ALM-19000 HBase Service Unavailable

Description

This alarm is generated when the HBase service is unavailable. The alarm module checks the HBase service status every 120 seconds.

This alarm is cleared when the HBase service recovers.

If the multi-instance function is enabled in the cluster and multiple HBase service instances are installed, you need to determine the HBase service instance where the alarm is generated based on the value of ServiceName in Location. For example, if the HBase1 service is unavailable, ServiceName=HBase1 is displayed in Location, and the operation object in the procedure needs to be changed from HBase to HBase1.

Attribute

Alarm ID	Alarm Severity	Automatically Cleared
19000	Critical	Yes

Parameters

Name	Meaning
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.

Impact on the System

Operations, such as reading or writing data and creating tables, cannot be performed.

Possible Causes

The ZooKeeper service is abnormal.
The HDFS service is abnormal.
The HBase service is abnormal.
The network is abnormal.

Procedure

Check the ZooKeeper service status.

On the FusionInsight Manager, check whether the running status of ZooKeeper is Normal on service list.
- If yes, go to 5.
- If no, go to 2.
In the alarm list, check whether ALM-13000 ZooKeeper Service Unavailable exists.
- If yes, go to 3.
- If no, go to 5.
Rectify the fault by following the steps provided in ALM-13000 ZooKeeper Service Unavailable.
Wait several minutes, and check whether alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.

Check the HDFS service status.

In the alarm list, check whether ALM-14000 HDFS Service Unavailable exists.
- If yes, go to 6.
- If no, go to 8.
Rectify the fault by following the steps provided in ALM-14000 HDFS Service Unavailable.
Wait several minutes, and check whether alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
On the FusionInsight Manager portal, choose Cluster > Name of the desired cluster > Services > HDFS. Check whether Safe Mode is ON.
- If yes, go to 9.
- If no, go to 12.
Log in to the HDFS client as user root. Run cd to switch to the client installation directory, and run source bigdata_env.

If the cluster uses the security mode, perform security authentication. Obtain the password of user hdfs from the administrator, run the kinit hdfs command and enter the password as prompted.
Run the following command to manually exit the safe mode:

hdfs dfsadmin -safemode leave
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.

Check the HBase service status.

On the FusionInsight Manager portal, click Cluster > Name of the desired cluster > Services > HBase.
Check whether there is one active HMaster and one standby HMaster.
- If yes, go to 15.
- If no, go to 14.
Click Instances, select the HMaster whose status is not Active, click More, and select Restart Instance to restart the HMaster. Check whether there is one active HMaster and one standby HMaster again.
- If yes, go to 15.
- If no, go to 21.
Choose Cluster >Name of the desired cluster > Services > HBase > HMaster(Active) to go to the HMaster WebUI.

By default, the admin user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.
Check whether at least one RegionServer exists under Region Servers.
- If yes, go to 17.
- If no, go to 21.
Check Tables > System Tables, as shown in Figure 1. Check whether hbase:meta, hbase:namespace, and hbase:acl exist in the Table Name column.
- If yes, go to 18.
- If no, go to 19.
Figure 1 HBase system table
As shown in Figure 1, click the hbase:meta, hbase:namespace, and hbase:acl hyperlinks and check whether the pages are properly displayed. If the pages are properly displayed, the tables are normal.

If they are, go to 19.

If they are not, go to 23.

In normal mode, ACL is enabled for HBase by default. The hbase:acl table is generated only when ACL is manually enabled. In this case, check this table. In other scenarios, this table does not need to be checked.
View the HMaster startup status.

In Figure 2, if the RUNNING state exists in Tasks, HMaster is being started. In the State column, you can view the time when HMaster is in the RUNNING state. In Figure 3, if the state is COMPLETE, HMaster is started.

Check whether HMaster is in the RUNNING state for a long time.

Figure 2 HMaster is being started

Figure 3 HMaster is started
- If yes, go to 20.
- If no, go to 21.
On the HMaster WebUI, check whether any hbase:meta is in the Region in Transition state for a long time.

Figure 4 Region in Transition
- If yes, go to 21.
- If no, go to 22.
In the precondition that services are not affected, log in to the FusionInsight Manager portal and choose Cluster > Name of the desired cluster > Services > HBase > More > Restart Service. Enter the administrator password and click OK.
- If yes, go to 22.
- If no, go to 23.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 23.

Check the network connection between HMaster and dependent components.

On the FusionInsight Manager, choose Cluster >Name of the desired cluster > Services > HBase.
Click Instance and the HMaster instance list is displayed. Record the management IP Address in the row of HMaster(Active).
Use the IP address obtained in 24 to log in to the host where the active HMaster runs as user omm .
Run the ping command to check whether communication between the host that runs the active HMaster and the hosts that run the dependent components. (The dependent components include ZooKeeper, HDFS and Yarn. Obtain the IP addresses of the hosts that run these services in the same way as that for obtaining the IP address of the active HMaster.)
- If yes, go to 29.
- If no, go to 27.
Contact the administrator to restore the network.
In the alarm list, check whether HBase Service Unavailable is cleared.
- If yes, no further action is required.
- If no, go to 29.

Collect fault information.

On the FusionInsight Manager, choose O&M > Log > Download.
Select the following nodes in the required cluster from the Service drop-down list:
- ZooKeeper
- HDFS
- HBase
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected logs.