ALM-12207 Slow Disk Processing Timeout

Alarm Description

When slow disk detection is enabled, the system checks the slow disk processing status every 10 minutes by default. This alarm is generated when the following disk or node status does not change within 10 hours.

Disk: Automatic isolation aborted, isolated, isolation failed, and de-isolation failed.

Node: Isolated, Isolation failed, Isolation cancellation failed, Node startup failed, and De-isolated.

This alarm is automatically cleared when the status of the node or disk that is in the processing timeout state changes.

This alarm applies only to MRS 3.3.1 or later.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
12207	Major	Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster or system for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.
	DiskName	Specifies the disk for which the alarm was generated.
Additional Information	HostName	Specifies the host for which the alarm was generated.
	DiskName	Specifies the disk for which the alarm was generated.
	Details	Specifies that the description of slow disk isolation.

Impact on the System

If an isolated disk or node cannot be restored in a timely manner, the running of components may be affected, which further affects user services.

Possible Causes

The isolation status of the disk or node exceeds the configured timeout period for processing slow disks.

Handling Procedure

Check the cause of the slow disk processing timeout.

Log in to FusionInsight Manager and choose O&M > Alarm > Alarms. In the alarm list, expand the alarm details, and view and record the host or disk for which the alarm is generated.
Log in to the active OMS node as user root and run the following command to check the cause of slow disk processing timeout in the controller log and check whether there is obvious error information:

vi /var/log/Bigdata/controller/controller.log
- If yes, go to Step 4.
- If no, go to Step 3.
Log in to the node for which the alarm is generated as user root and run the following command to check the cause of slow disk processing timeout in the agent log and check whether any error information is displayed:

vi /var/log/Bigdata/nodeagent/agentlog/agent.log
- If yes, go to Step 4.
- If no, go to Step 5.
Contact O&M engineers to rectify the fault and manually run the command for the slow disk or node. After the command is executed, observe for 5 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 5.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Select Controller and NodeAgent for Service, select the active/standby OMS node and the node for which the alarm is generated in the Host area, and click OK.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.