Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-14023 Percentage of Total Reserved Disk Space for Replicas Exceeds the Threshold
Updated on 2024-09-23 GMT+08:00

ALM-14023 Percentage of Total Reserved Disk Space for Replicas Exceeds the Threshold

Description

The system checks the percentage of total reserved disk space for replicas (Total reserved disk space for replicas/(Total reserved disk space for replicas + Total remaining disk space)) every 30 seconds and compares the actual percentage with the threshold (90% by default). This alarm is generated when the percentage of total reserved disk space for replicas exceeds the threshold for multiple consecutive times (Trigger Count).

The alarm is cleared in the following two scenarios: The value of Trigger Count is 1 and the percentage of total reserved disk space for replicas is less than or equal to the threshold; the value of Trigger Count is greater than 1 and the percentage of total reserved disk space for replicas is less than or equal to 90% of the threshold.

Attribute

Alarm ID

Alarm Severity

Automatically Cleared

14023

Minor

Yes

Parameters

Name

Meaning

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

NameServiceName

Specifies the NameService service for which the alarm is generated.

Trigger condition

Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.

Impact on the System

The performance of writing data to HDFS is affected. If all remaining DataNode space is reserved for replicas, writing HDFS data fails.

Possible Causes

  • The alarm threshold is improperly configured.
  • The disk space configured for the HDFS cluster is insufficient.
  • The volume of services that access HDFS is too large and therefore DataNode is overloaded.

Procedure

Check whether the alarm threshold is appropriate.

  1. On the FusionInsight Manager portal, choose O&M > Alarm > Thresholds > Name of the desired cluster > HDFS > Disk > Percentage of Reserved Space for Replicas of Unused Space to check whether the alarm threshold is appropriate. (The default threshold is 90%. Users can change it as required.)

    • If yes, go to 4.
    • If no, go to 2.

  2. Choose O&M > Alarm > Thresholds > Name of the desired cluster > HDFS > Disk > Percentage of Reserved Space for Replicas of Unused Space and Click Modify, change the threshold based on the actual usage.

    Figure 1 Modify Thresholds

  3. Wait 5 minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 4.

Check whether an alarm indicating insufficient disk space is generated.

  1. On the FusionInsight Manager portal, check whether ALM-14001 HDFS Disk Usage Exceeds the Threshold or ALM-14002 DataNode Disk Usage Exceeds the Threshold exists on the O&M > Alarm > Alarms page.

    • If yes, go to 5.
    • If no, go to 7.

  2. Handle the alarm by referring to instructions in ALM-14001 HDFS Disk Usage Exceeds the Threshold or ALM-14002 DataNode Disk Usage Exceeds the Threshold and check whether the alarm is cleared.

    • If yes, go to 6.
    • If no, go to 7.

  3. Wait 5 minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 7.

Expand the DataNode capacity.

  1. Expand the DataNode capacity.
  2. Wait 5 minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 9.

Collect fault information.

  1. On the FusionInsight Manager portal, choose O&M > Log > Download.
  2. Select HDFS in the required cluster from the Service.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Related Information

None