Help Center > > User Guide> FusionInsight Manager Operation Guide (Applicable to 3.x)> Alarm Reference (Applicable to MRS 3.x)> ALM-14027 DataNode Disk Fault

ALM-14027 DataNode Disk Fault

Updated at: Oct 21, 2021 GMT+08:00

Description

The system checks the disk status on DataNodes every 60 seconds. This alarm is generated when a disk is faulty.

After all faulty disks on the DataNode are recovered, you need to manually clear the alarm and restart the DataNode.

Attribute

Alarm ID

Alarm Severity

Auto Clear

14027

Major

No

Parameters

Name

Meaning

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the name of the service for which the alarm is generated.

RoleName

Specifies the name of the role for which the alarm is generated.

HostName

Specifies the name of the host for which the alarm is generated.

Failed Volumes

Specifies the list of faulty disks.

Impact on the System

If this alarm is reported, there are abnormal disk partitions on the DataNode. This may cause the loss of written files.

Possible Causes

  • The hard disk is faulty.
  • The disk permissions are configured improperly.

Procedure

Check whether a disk alarm is generated.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms and check whether ALM-12014 Partition Lost or ALM-12033 Slow Disk Fault exists.

    • If yes, go to 2.
    • If no, go to 4.

  2. Rectify the fault by referring to the handling procedure of ALM-12014 Partition Lost or ALM-12033 Slow Disk Fault. Then, check whether the alarm is cleared.

    • If yes, go to 3.
    • If no, go to 4.

  3. Wait 5 minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 4.

Modify disk permissions.

  1. Choose O&M > Alarm > Alarms and view Location and Additional Information of the alarm to obtain the location of the faulty disk.
  2. Log in to the node for which the alarm is generated as user root. Go to the directory where the faulty disk is located, and run the ll command to check whether the permission of the faulty disk is 711 and whether the user is omm.

    • If yes, go to 8.
    • If no, go to 6.

  3. Modify the permission of the faulty disk. For example, if the faulty disk is data1, run the following commands:

    chown omm:wheel data1

    chmod 711 data1

  4. In the alarm list on Manager, click Clear in the Operation column of the alarm to manually clear the alarm. Choose Cluster > Services > HDFS > Instance, select the DataNode, choose More > Restart Instance, wait for 5 minutes, and check whether a new alarm is reported.

    • If no, no further action is required.
    • If yes, go to 8.

Collect the fault information.

  1. On the FusionInsight Manager portal, choose O&M > Log > Download.
  2. Expand the Service drop-down list, and select HDFS and OMS for the target cluster.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system does not automatically clear this alarm and you need to manually clear the alarm.

Related Information

None

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel