Updated on 2024-04-11 GMT+08:00

ALM-19000 HBase Service Unavailable

Alarm Description

The alarm module checks the HBase service status every 120 seconds. This alarm is generated when the HBase service is unavailable.

This alarm is cleared when the HBase service recovers.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

19000

Critical

Yes

Alarm Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Impact on the System

Operations such as data read/write and table creation cannot be performed.

Possible Causes

  • ZooKeeper is abnormal.
  • HDFS is abnormal.
  • HBase is abnormal.
  • The network connection is abnormal.
  • The service configuration value is incorrect.

Handling Procedure

Check the ZooKeeper service status.

  1. In the service list on FusionInsight Manager, check whether Running Status of ZooKeeper is Normal.

    • If yes, go to 5.
    • If no, go to 2.

  2. In the alarm list, check whether ALM-13000 ZooKeeper Service Unavailable exists.

    • If yes, go to 3.
    • If no, go to 5.

  3. Rectify the fault by performing the operations provided for ALM-13000 ZooKeeper Service Unavailable.
  4. Wait several minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 5.

Check the HDFS service status.

  1. In the alarm list, check whether ALM-14000 HDFS Service Unavailable exists.

    • If yes, go to 6.
    • If no, go to 8.

  2. Rectify the fault by performing the operations provided for ALM-14000 HDFS Service Unavailable.
  3. Wait several minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

  4. On FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > HDFS, and check whether Safe Mode of HDFS is ON.

    • If yes, go to 9.
    • If no, go to 12.

  5. Log in to the HDFS client as user root. Run the cd command to go to the client installation directory and run the source bigdata_env command.

    If the cluster uses the security mode, perform security authentication. Obtain the password of user hdfs from the MRS cluster administrator, run the kinit hdfs command, and enter the password as prompted.

  6. Run the following command to manually exit the safe mode:

    hdfs dfsadmin -safemode leave

  7. Wait several minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 12.

Check the HBase service status.

  1. On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > HBase.
  2. Check whether there is one active HMaster and one standby HMaster.

    • If yes, go to 15.
    • If no, go to 14.

  3. Click Instances and select the HMaster instance whose status is not Active. Click More and select Restart Instance to restart HMaster. Then check whether there is one active HMaster and one standby HMaster.

    • If yes, go to 15.
    • If no, go to 21.

  4. Choose Cluster, click the name of the desired cluster, choose Services > HBase, and click HMaster(Active) to access the HMaster web UI.

    By default, the admin user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.

  5. Check whether at least one RegionServer exists under Region Servers.

    • If yes, go to 17.
    • If no, go to 21.

  6. Choose Tables > System Tables and check whether hbase:meta, hbase:namespace, and hbase:acl exist in the Table Name column, as shown in Figure 1.

    • If yes, go to 18.
    • If no, go to 19.
      Figure 1 HBase system tables

  7. Click hbase:meta, hbase:namespace, and hbase:acl to check whether all pages can be opened. If all of them can be opened, the tables are normal.

    • If yes, go to 19.
    • If no, go to 25.

      In a normal cluster, ACL permission control is disabled for HBase by default. The hbase:acl table is generated only after ACL permission control is manually enabled. In this case, you need to check this table.

  8. View the HMaster startup status.

    On the Tasks page shown in Figure 2, the RUNNING value in the State column indicates that HMaster is being started and provides how much time HMaster keeps in that state. As shown in Figure 3, if the state is COMPLETE, HMaster has been started.

    Check whether HMaster has been in the RUNNING state for a long time.

    Figure 2 HMaster being started
    Figure 3 HMaster startup completed
    • If yes, go to 20.
    • If no, go to 21.

  9. On the HMaster web UI, check whether any hbase:meta is in the Regions in Transition state for a long time.

    Figure 4 Regions in Transition
    • If yes, go to 21.
    • If no, go to 22.

  10. After ensuring that services are not affected, log in to FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > HBase, click More, and select Restart Service. In the dialog box that is displayed, enter the password, and click OK.

    • If yes, go to 22.
    • If no, go to 25.

  11. Wait several minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 25.

Check whether the HBase configurations are correctly modified.

  1. On FusionInsight Manager, choose Audit. On the Audit page, click Advanced Search, click on the right of Operation Type, select Save configuration, click OK, and click Search.
  2. In the search result, check whether the historical configurations of HBase-related services in the Service column, such as ZooKeeper, HDFS, and HBase, may affect the HBase service status. Table 1 lists some configurations that may affect the HBase service status.

    Table 1 Configurations affecting the HBase service status

    Parameter

    Possible Impact

    GC_OPTS

    The memory configuration may be improper. You need to check the health status of instance processes.

    hbase.rpc.protection

    If the HBase service is not restarted offline after the value of this parameter is changed, the connection authentication fails and the HBase service becomes abnormal.

    hbase.regionserver.metahandler.count

    If there are too many regions in the cluster but this parameter is set to a small value, RIT may occur and regions cannot be brought online for a long time.

    hbase.regionserver.thread.compaction.large

    If this parameter is set to a large value, the node CPU usage may be too high.

    hbase.regionserver.thread.compaction.small

    If this parameter is set to a large value, the node CPU usage may be too high.

    hbase.coprocessor.master.classes

    If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.

    hbase.coprocessor.region.classes

    If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.

    hbase.coprocessor.regionserver.classes

    If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.

    zookeeper.session.timeout

    If this parameter is set to a small value, the connection between HBase and ZooKeeper times out too quickly. As a result, the HMaster instance and RegionServer may restart repeatedly.

Check the network connection between HMaster and dependent components.

  1. On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > HBase.
  2. Click Instances. In the HMaster instance list, record the management IP address of the active HMaster instance.
  3. Log in to the active HMaster node as user omm through the IP address obtained in 26.
  4. Run the ping command to check whether the network connection between the active HMaster node and the host where the dependent components reside is normal. (The dependent components include ZooKeeper, HDFS, and Yarn. The method of obtaining the IP address of the host where the dependent components reside is the same as that of obtaining the IP address of the active HMaster node.)

    • If yes, go to 31.
    • If no, go to 29.

  5. Contact the network administrator to restore the network.
  6. In the alarm list, check whether this alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 31.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the drop-down list next to the Service field. In the Services dialog box that is displayed, select the following services for the target cluster:

    • ZooKeeper
    • HDFS
    • HBase

  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None