ALM-14023 Percentage of Total Reserved Disk Space for Replicas Exceeds the Threshold

Description

The system checks the percentage of total reserved disk space for replicas (Total reserved disk space for replicas/(Total reserved disk space for replicas + Total remaining disk space)) every 30 seconds and compares the actual percentage with the threshold (90% by default). This alarm is generated when the percentage of total reserved disk space for replicas exceeds the threshold for multiple consecutive times (Trigger Count).

The alarm is cleared in the following two scenarios: The value of Trigger Count is 1 and the percentage of total reserved disk space for replicas is less than or equal to the threshold; the value of Trigger Count is greater than 1 and the percentage of total reserved disk space for replicas is less than or equal to 90% of the threshold.

Attribute

Alarm ID	Alarm Severity	Automatically Cleared
14023	Minor	Yes

Parameters

Name	Meaning
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
NameServiceName	Specifies the NameService service for which the alarm is generated.
Trigger condition	Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.

Impact on the System

The performance of writing data to HDFS is affected. If all remaining DataNode space is reserved for replicas, writing HDFS data fails.

Possible Causes

The alarm threshold is improperly configured.
The disk space configured for the HDFS cluster is insufficient.
The volume of services that access HDFS is too large and therefore DataNode is overloaded.

Procedure

Check whether the alarm threshold is appropriate.

On the FusionInsight Manager portal, choose O&M > Alarm > Thresholds > Name of the desired cluster > HDFS > Disk > Percentage of Reserved Space for Replicas of Unused Space to check whether the alarm threshold is appropriate. (The default threshold is 90%. Users can change it as required.)
- If yes, go to Step 4.
- If no, go to Step 2.
Choose O&M > Alarm > Thresholds > Name of the desired cluster > HDFS > Disk > Percentage of Reserved Space for Replicas of Unused Space and Click Modify, change the threshold based on the actual usage.

Figure 1 Modify Thresholds
Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 4.

Check whether an alarm indicating insufficient disk space is generated.

On the FusionInsight Manager portal, check whether ALM-14001 HDFS Disk Usage Exceeds the Threshold or ALM-14002 DataNode Disk Usage Exceeds the Threshold exists on the O&M > Alarm > Alarms page.
- If yes, go to Step 5.
- If no, go to Step 7.
Rectify the fault by referring to ALM-14001 HDFS Disk Usage Exceeds the Threshold or ALM-14002 DataNode Disk Usage Exceeds the Threshold. Check whether the alarm is cleared.
- If yes, go to Step 6.
- If no, go to Step 7.
Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 7.

Expand the DataNode capacity.

Expand the DataNode capacity .
Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 9.

Collect fault information.

On the FusionInsight Manager portal, choose O&M > Log > Download.
Select HDFS in the required cluster from the Service.
Click in the upper right corner, and set Start Date and End Date for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.