ALM-12033 Slow Disk Fault

Description

The system runs the iostat command every 3 seconds to monitor the system indicator of disk I/O. If the svctm value is greater than 100 ms and greater than 1.5 times the svctm_average value within 300 seconds, it is considered as a slow period. If the number of slow periods within 300s is greater than 50%, the system considers that the disk is faulty and reports an alarm.

The value of svctm_average is the average value of all disk svctm on the current node.

This alarm is automatically cleared after the disk is replaced.

The alarm detecting principle is as follows:

On the Linux platform, run the iostat-x -t 1 command to check whether the I/O is faulty. Specifically, check values of parameters in the red box in the following figure.

Click to enlarge

%iowait: Specifies the percentage of the time when the CPU waits for I/O to the entire CPU period. If the value exceeds 50% or is significantly greater than the value of %system, %user, and %idle, the I/O may be faulty.
await: Specifies the sum of the disk I/O waiting time and I/O service time. The value of this parameter does not exceed 20. The value of this parameter for other DataNode disks can be slightly higher but cannot exceed 40.
svctm: Specifies the time when the I/O service of the disk is changed.
%util: Specifies the busy degree of the disk. If the value exceeds 80%, the disk maybe busy.

If the value of %util is greater than 10 and the value of svctm is greater than 100, the I/O is recorded as faulty. This alarm is generated when the I/O is recorded as faulty for 30 times in the 60 times of checks.

Attribute

Alarm ID	Alarm Severity	Auto Clear
12033	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster or system for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.
DiskName	Specifies the disk for which the alarm is generated.

Impact on the System

Service performance deteriorates and service processing capabilities become poor, and even the service is unavailable.

Possible Causes

The disk is aged or has bad sectors.

Procedure

Check the disk status.

On the FusionInsight Manager portal, click O&M > Alarm > Alarms.
View the detailed information about the alarm to obtain the values of the HostName and the DiskName fields and the information about the faulty disk for which the alarm is generated.
Check whether the node for which the alarm is generated is in the virtualization environment.
- If yes, go to 4.
- If no, go to 7.
Check whether the storage performance provided by the virtualization environment meets the hardware requirements. Then go to 5 after the check is complete.
Log in to the node where the alarm is generated as user root. Run the df -h command and check whether the command output contains the value of DiskName.
- If yes, go to 7.
- If no, go to 6.
Run the lsblk command and check whether you can find out the mapping relationship between the value of DiskName and the disks.
- If yes, go to 7.
- If no, go to 22.
Log in to the node for which the alarm is generated as user root. Run the lsscsi | grep "/dev/sd[x]" command to view the disk device information and determine whether the disk has been organized into a RAID group.

The value of dev/sd[x] is the faulty disk name obtained in 2.

For example, run the following command:

lsscsi | grep "/dev/sda"

In the command output, if ATA, SATA, or SAS is displayed in the third line, the disk has not been organized into a RAID group. If other information is displayed, the disk may have been organized into a RAID group.
- If yes, go to 12.
- If no, go to 8.
Run the smartctl -i /dev/sd[x] command to check whether the hardware supports SMART.

For example, run the following command:

smartctl -i /dev/sda

In the command output, if SMART support is: Enabled is displayed, the hardware supports SMART. If Device does not support SMART is displayed, the hardware does not support SMART.
- If yes, go to 9.
- If no, go to 17.
Run the smartctl -H --all /dev/sd[x] command to check basic SMART information and determine whether the disk is working correctly.

For example, run the following command:

smartctl -H --all /dev/sda

Check SMART overall-health self-assessment test result in the command output. If the result is FAILED, the disk is faulty and needs to be replaced. If the result is PASSED, check the count of Reallocated_Sector_Ct or Elements in grown defect list. If the count is greater than 100, the disk is faulty and needs to be replaced.
- If yes, go to 10.
- If no, go to 18.
Run the smartctl -l error -H /dev/sd[x] command to check the Glist of the disk and determine whether the disk is working correctly.

For example, run the following command:

smartctl -l error -H /dev/sda

Check the Command/Featrue_name column in the command output. If READ SECTOR(S) or WRITE SECTOR(S) is displayed, the disk has bad sectors. If other errors occur, the disk circuit is faulty. The preceding errors indicate that the disk is abnormal and needs to be replaced.

If No Errors Logged is displayed, no error log exists. You can perform step 9 to trigger the disk SMART self-check.
- If yes, go to 11.
- If no, go to 18.
Run the smartctl -t long /dev/sd[x] command to trigger the disk SMART self-check. After the command is executed, the time when the self-check is to be completed is displayed. After the self-check is completed, repeat 9 and 10 to check whether the disk is working properly.

For example, run the following command:

smartctl -t long /dev/sda
- If yes, go to 17.
- If no, go to 18.
Run the smartctl -d [sat|scsi]+megaraid,[DID] -H --all /dev/sd[x] command to check whether the hardware supports SMART.
- [sat|scsi] indicates the disk type. The preceding two types need to be used.
- [DID] indicates the slot information. Slots 0 to 15 need to be used.
For example, run the following commands in sequence:

smartctl -d sat+megaraid,0 -H --all /dev/sda

smartctl -d sat+megaraid,1 -H --all /dev/sda

smartctl -d sat+megaraid,2 -H --all /dev/sda

...

Run the commands that combine different disk types and slots. In a command output, if SMART support is: Enabled is displayed, the disk supports SMART. Record the parameters of the disk type and slot combination. If SMART support is: Enabled is not displayed in the outputs of all the preceding command combinations, the disk does not support SMART.
- If yes, go to 13.
- If no, go to 16.
Run the smartctl -d [sat|scsi]+megaraid,[DID] -H --all /dev/sd[x] command recorded in 12 to check basic SMART information and determine whether the disk is working correctly.

For example, run the following command:

smartctl -d sat+megaraid,2 -H --all /dev/sda

Check SMART overall-health self-assessment test result in the command output. If the result is FAILED, the disk is faulty and needs to be replaced. If the result is PASSED, check the count of Reallocated_Sector_Ct or Elements in grown defect list. If the count is greater than 100, the disk is faulty and needs to be replaced.
- If yes, go to 14.
- If no, go to 18.
Run the smartctl -d [sat|scsi]+megaraid,[DID] -l error -H /dev/sd[x] command to check the Glist of the disk and determine whether the disk is working correctly.

For example, run the following command:

smartctl -d sat+megaraid,2 -l error -H /dev/sda

Check the Command/Featrue_name column in the command output. If READ SECTOR(S) or WRITE SECTOR(S) is displayed, the disk has bad sectors. If other errors occur, the disk circuit is faulty. The preceding errors indicate that the disk is abnormal and needs to be replaced.

If No Errors Logged is displayed, no error log exists. You can perform step 9 to trigger the disk SMART self-check.
- If yes, go to 15.
- If no, go to 18.
Run the smartctl -d [sat|scsi]+megaraid,[DID] -t long /dev/sd[x] command to trigger the disk SMART self-check. After the command is executed, the time when the self-check is to be completed is displayed. After the self-check is completed, repeat 13 and 14 to check whether the disk is working properly.

For example, run the following command:

smartctl -d sat+megaraid,2 -t long /dev/sda
- If yes, go to 17.
- If no, go to 18.
If the configured RAID card does not support SMART, the disk usually does not support SMART. In this case, use the check tool provided by the corresponding RAID card vendor to solve the problem. Then go to 17.

For example, LSI is a MegaCLI tool.
On FusionInsight Manager, choose O&M > Alarm > Alarms, and click Clear in the Operation column of the alarm and check whether such alarm is generated for the same disk continuously.

If the alarm is reported for three times for the current disk, you are advised to replace the disk.
- If yes, go to 18.
- If no, no further action is required.

Replace the disk.

On the FusionInsight Manager portal, click O&M > Alarm > Alarms.
View the detailed information about the alarm to obtain the values of the HostName and the DiskName fields and the information about the faulty disk for which the alarm is generated.
Replace the faulty disk.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 22.

Collect fault information.

On the FusionInsight Manager, choose O&M > Log > Download.
Select OMS from the Service and click OK.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected log information.