Help Center > > User Guide> FusionInsight Manager Operation Guide> Alarm Reference (Applicable to MRS 3.x)> ALM-14027 DataNode Disk Fault

ALM-14027 DataNode Disk Fault

Updated at: Mar 25, 2021 GMT+08:00

Description

The system checks the disk status on DataNodes every 60 seconds. This alarm is generated when a disk is faulty.

This alarm is cleared after all faulty disks on the DataNode are restored and the DataNode is restarted.

Attribute

Alarm ID

Alarm Severity

Automatically Cleared

14027

Major

Yes

Parameters

Name

Meaning

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

Failed Volumes

List of faulty disks.

Impact on the System

If a DataNode disk fault alarm is reported, a faulty disk partition exists on the DataNode, which may cause the loss of written files.

Possible Causes

  • The hard disk is faulty.
  • The disk permissions are assigned improperly.

Procedure

Check whether disk alarms are generated.

  1. On the FusionInsight Manager portal, choose O&M > Alarm > Alarms, and check whether ALM-12014 Partition Lost or ALM-12033 Slow Disk Fault is reported.

    • If yes, go to 2.
    • If no, go to 4.

  2. Rectify the fault by referring to ALM-12014 Partition Lost or ALM-12033 Slow Disk Fault. Then, check whether the alarm is cleared.

    • If yes, go to 3.
    • If no, go to 4.

  3. Five minutes later, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 4.

Assign the disk permissions properly.

  1. Choose O&M > Alarm > Alarms. On the displayed page, view Location and Additional Information of the alarm to obtain the location information of the faulty disk.
  2. Log in to the node for which this alarm is generated as user root. Go to the directory where the faulty disk resides, and run the ll command to check whether the permission of the faulty disk is 711 and whether the user is omm.

    • If yes, go to 8.
    • If no, go to 6.

  3. Modify the permission of the faulty disk. For example, to modify the permission of disk "data1", run the following commands:

    chown omm:wheel data1

    chmod 711 data1

  4. Restart the DataNode, one minute later, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

Collect fault information.

  1. On the FusionInsight Manager portal, choose O&M > Log > Download.
  2. Select HDFS and OMS in the required cluster from the Service.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Related Information

None

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel