ALM-12017 Insufficient Disk Capacity

Alarm Description

The system checks the host disk usage of the system every 30 seconds and compares the actual disk usage with the threshold. The disk usage has a default threshold, this alarm is generated when the host disk usage exceeds the specified threshold.

When the Trigger Count is 1, this alarm is cleared when the usage of a host disk partition is less than or equal to the threshold. When the Trigger Count is greater than 1, this alarm is cleared when the usage of a host disk partition is less than or equal to 90% of the threshold.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
12017	Major	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster or system for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.
PartitionName	Specifies the device partition for which the alarm is generated.
Trigger Condition	Specifies the threshold for triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.

Impact on the System

If you need to modify or use data on the disk when the disk capacity is insufficient, the job may fail.

Possible Causes

The alarm threshold is incorrect.
The disk configuration cannot meet service requirements. The disk usage reaches the upper limit.

Handling Procedure

Check whether the threshold is set properly.

Log in to FusionInsight Manager, choose O&M > Alarm > Thresholds > Host > Disk > Disk Usage and check whether the threshold (configurable, 90% by default) is appropriate.
- If yes, go to Step 2.
- If no, go to Step 4.
Locate the target threshold rule and click Modify in the Operation column to change the alarm threshold based on the current disk usage.

Figure 1 Setting an alarm threshold
After 2 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 4.

Check whether the disk usage reaches the upper limit.

In the alarm list on FusionInsight Manager, click in the row where the alarm is located to view the alarm host name and disk partition information in the alarm details.
Log in to the node for which the alarm is generated as user root.
Check the system disk partition usage. Check whether the disk is mounted to the following directories based on the disk partition name obtained in Step 4: /, /opt, /tmp, /var, /var/log, and /srv/BigData (can be customized).
```
df -lmPT |  awk '$2 != "iso9660"' | grep '^/dev/' | awk '{"readlink -m "$1 | getline real }{$1=real; print $0}' | sort -u -k 1,1
```
- If yes, the disk is a system disk. Then go to Step 10.
- If no, the disk is not a system disk. Then go to Step 7.

Check the system disk partition usage. Determine the role of the disk based on the disk partition name obtained in Step 4.

df -lmPT |  awk '$2 != "iso9660"' | grep '^/dev/' | awk '{"readlink -m "$1 | getline real }{$1=real; print $0}' | sort -u -k 1,1

Check whether the service that the disk belongs to is HDFS, Yarn, Kafka, Supervisor, or other services that require disk storage.
- If yes, expand the cluster capacity by referring to Scaling Out an MRS Cluster and go to Step 9.
- If no, go to Step 12.
After 2 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 12.
Check whether there is a file larger than 500 MB on the node: Check whether a large file that is written into the disk by mistake:
```
find / -xdev -size +500M -exec ls -l {} \;
```
- If yes, go to Step 11.
- If no, go to Step 12.
Handle the large file and check whether the alarm is cleared 2 minutes later.
- If yes, no further action is required.
- If no, go to Step 12.
Contact the system administrator to expand the disk capacity.
After 2 minutes, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 14.

Collect fault information.

On FusionInsight Manager, choose O&M > Log > Download.
Select OMS from the Service and click OK.
Click in the upper right corner, and select a time span starting 10 minutes before and ending 10 minutes after when the alarm was generated. Then, click Download to collect the logs.
Contact the O&M personnel and send the collected log information.