Updated on 2022-08-12 GMT+08:00

ALM-14002 DataNode Disk Usage Exceeds the Threshold

Description

The system checks the disk usage of the DataNode every 30 seconds and compares the actual disk usage with the threshold. The DataNode Disk Usage indicator has a default threshold. This alarm is generated when the value of the DataNode Disk Usage indicator exceeds the threshold.

To change the threshold, choose O&M > Alarm > Thresholds > Name of the desired cluster > HDFS.

When the Trigger Count is 1, this alarm is cleared when the value of the DataNode Disk Usage indicator is less than or equal to the threshold. When the Trigger Count is greater than 1, this alarm is cleared when the value of the DataNode Disk Usage indicator is less than or equal to 90% of the threshold.

Attribute

Alarm ID

Alarm Severity

Automatically Cleared

14002

Major

Yes

Parameters

Name

Meaning

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

Trigger Condition

Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.

Impact on the System

Writing Hadoop distributed file system (HDFS) data is affected.

Possible Causes

  • The cluster disk space is full.
  • Data among DataNode nodes is skew.

Procedure

Check whether the cluster disk space is full.

  1. On the FusionInsight Manager portal, click O&M > Alarm > Alarms, check whether the ALM-14001 HDFS Disk Usage Exceeds the Threshold alarm exists.

    • If yes, run 2.
    • If no, run 4.

  2. Handle the alarm as instructed in ALM-14001 HDFS Disk Usage Exceeds the Threshold. Check whether the alarm is cleared.

    • If yes, run 3.
    • If no, run 11.

  3. On the O&M > Alarm > Alarms pages, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, run 4.

Check the balancing status of DataNode nodes.

  1. On the FusionInsight Manager portal, click Host. Check the number of DataNodes on each rack. If the number differs greatly, adjust the racks to ensure that the number of DataNodes on each rack is almost the same. Restart the HDFS service for the changes to take effect.
  2. Choose Cluster > Name of the desired cluster > Services > HDFS.
  3. In the Basic Information area, click NameNode(Active) and the HDFS WebUI page is displayed.

    By default, the admin user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.

  4. In the Summary area of HDFS WebUI, check whether the value of Max is 10% greater than that of Median in DataNodes usages.

    • If yes, go to 8.
    • If no, go to 11.

  5. Data in the cluster is skew and must be balanced. Log in to the MRS client as user root. If the cluster uses the Normal Mode, run su - omm to switch to user omm. Run cd to switch to the client installation directory, and run source bigdata_env. If the cluster uses the security mode, perform security authentication. Run kinit hdfs and enter the password as prompted. Please obtain the password from the administrator.
  6. Run the following command to balance the data distribution:

    hdfs balancer -threshold 10

  7. Wait several minutes, and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 11.

Collect fault information.

  1. On the FusionInsight Manager portal, choose O&M > Log > Download.
  2. Select HDFS in the required cluster from the Service.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Related Information

None