ALM-12053 Host File Handle Usage Exceeds the Threshold

Description

The system checks the file handle usage every 30 seconds and compares the actual usage with the threshold (the default threshold is 80%). This alarm is generated when the host file handle usage exceeds the threshold for several times (5 times by default) consecutively.

To change the threshold, choose O&M > Alarm > Thresholds > Name of the desired cluster > Host > Host Status > Host File Handle Usage.

When the Trigger Count is 1, this alarm is cleared when the host file handle usage is less than or equal to the threshold. When the Trigger Count is greater than 1, this alarm is cleared when the host file handle usage is less than or equal to 90% of the threshold.

Attribute

Alarm ID	Alarm Severity	Auto Clear
12053	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster or system for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.
Trigger Condition	Specifies the alarm triggering condition.

Impact on the System

Service failure: When the host file handle usage exceeds the threshold, system applications cannot perform I/O operations such as file opening and network operations. As a result, the program is abnormal, which may cause job running failure.

Possible Causes

The application process is abnormal. For example, the opened file or socket is not closed.
The number of file handles cannot meet the current service requirements.
The system is abnormal.

Procedure

Check information about files opened in processes.

On FusionInsight Manager, click in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host for which the alarm is generated.
Log in to the host for which the alarm is generated as user root.
Run the following command to check the process that occupies excessive file handles.

for proc in /proc/[0-9]*; do if [ -d "$proc/fd" ]; then num_fds=$(ls -l "$proc/fd" | wc -l); pid=$(basename $proc); echo "$num_fds ${pid}" ; fi; done | sort -nr | more
Check whether the processes in which a large number of files are opened are normal. For example, check whether there are files or sockets not closed.
- If yes, go to Step 5.
- If no, go to Step 7.
Release the abnormal processes that occupy too many file handles.
Five minutes later, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 7.

Increase the number of file handles.

On FusionInsight Manager, click in the row where the alarm is located in the real-time alarm list and obtain the IP address of the host for which the alarm is generated.
Log in to the host for which the alarm is generated as user root.
Contact the system administrator to increase the number of system file handles.
Run the cat /proc/sys/fs/file-nr command to view the used handles and the maximum number of file handles. The first value is the number of used handles, the third value is the maximum number. Please check whether the usage exceeds the threshold.
```
# cat /proc/sys/fs/file-nr
12704	0	640000
```
- If yes, go to Step 9.
- If no, go to Step 11.
Wait for 5 minutes, and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 12.

Check whether the system environment is abnormal.

Contact the system administrator to check whether the operating system is abnormal.
- If yes, rectify the operating system fault and go to Step 13.
- If no, go to Step 14.
Wait for 5 minutes, and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 14.

Collect fault information.

On the FusionInsight Manager home page of the active cluster, choose O&M > Log > Download.
Select OMS from the Service and click OK.
Set Host to the node for which the alarm is generated and the active OMS node.
Click in the upper right corner, and set Start Date and End Date for log collection to 30 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected log information.