ALM-14039 Slow DataNodes Exist in the Cluster

Alarm Description

The system checks the number of slow operations per second on HDFS DataNode instances every 60 seconds and compares the number with the threshold. This alarm is generated when the number of slow operations per second of an HDFS DataNode instance has exceeded the threshold for three minutes.

This alarm is cleared when the number of slow operations per second of the HDFS DataNode instance is less than or equal to the threshold.

This alarm applies only to MRS 3.5.0 or later.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
14039	Major (default threshold: 100)	Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.
Additional Information	Trigger Condition	Specifies the alarm triggering condition.

Impact on the System

Slow DataNodes on HDFS affect the data read and write performance of HDFS.

Possible Causes

The disk I/O rate of the HDFS DataNode instance is low, and the HDFS DataNode processing capability reaches the bottleneck.
The network transmission rate between HDFS DataNode instances is low.

Handling Procedure

Check whether the disk I/O rate of the DataNode instance is low.

Log in to FusionInsight Manager and choose O&M > Alarm > Alarms. In the Location field of the alarm details, view the host name of the DataNode instance for which this alarm is generated.
Choose Cluster > Services > HDFS, click the Instances tab, and click the DataNode role based on the host name obtained in 1.
Click the Chart tab and select Performance from the Chart Category area. Check whether any data in Slow Flush or Sync Occurrences Per Second, Slow SyncWriterOsCache Occurrences Per Second, and Slow WriteDataToDisk Occurrences Per Second charts is high.
- If yes, go to 4.
- If no, go to 8.
On FusionInsight Manager, choose O&M > Alarm > Alarms and check whether ALM-12033 Slow Disk Fault exists.
- If yes, record the disk information in the alarm details and go to 6.
- If no, go to 5.
Obtain information about the disk where slow operations occur.
1. Log in to the DataNode using the IP address obtained in 1 as user omm and run the following commands to view the run log:
  cd /var/log/Bigdata/hdfs/dn/
  
  vim hadoop-omm-datanode-Hostname.log
2. Search for keyword slow in the log to identify the disk where slow operations occur.
Rectify the fault based on the obtained disk information by following the handling procedure of ALM-12033 Slow Disk Fault.
Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.

Check whether the network transmission rate between HDFS DataNode instances is low.

On FusionInsight Manager, choose Cluster > Services > HDFS, click the Chart tab, select Performance in the Chart Category area, and check whether any data in the Slow Write Packet To DownStream Count Per Second and Slow Ack To Upstream Count Per Second charts is high.
- If yes, go to 9.
- If no, go to 13.
Log in to the DataNode using the IP address obtained in 1 as user omm and run the following commands to view the run log:

cd /var/log/Bigdata/hdfs/dn/

vim hadoop-omm-datanode-Hostname.log
Search for keyword slow in the log to identify the upstream and downstream nodes where slow operations occur.
Check whether the network communication between the current node and the nodes obtained in 10 is normal.
- If yes, go to 13.
- If no, contact the network administrator to repair the network.
Wait 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 13.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select HDFS for the target cluster.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.