Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-19000 HBase Service Unavailable

Updated on 2024-09-23 GMT+08:00

View PDF

ALM-19000 HBase Service Unavailable

Alarm Description

The alarm module checks the HBase service status every 120 seconds. This alarm is generated when the HBase service is unavailable.

This alarm is cleared when the HBase service recovers.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
19000	Critical	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm was generated.
ServiceName	Specifies the service for which the alarm was generated.
RoleName	Specifies the role for which the alarm was generated.
HostName	Specifies the host for which the alarm was generated.

Impact on the System

Operations such as data read/write and table creation cannot be performed.

Possible Causes

ZooKeeper is abnormal.
HDFS is abnormal.
HBase is abnormal.
The network connection is abnormal.
The service configuration value is incorrect.

Handling Procedure

Check the ZooKeeper service status.

In the service list on FusionInsight Manager, check whether Running Status of ZooKeeper is Normal.
- If yes, go to 5.
- If no, go to 2.
In the alarm list, check whether ALM-13000 ZooKeeper Service Unavailable exists.
- If yes, go to 3.
- If no, go to 5.
Rectify the fault by performing the operations provided for ALM-13000 ZooKeeper Service Unavailable.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.

Check the HDFS service status.

In the alarm list, check whether ALM-14000 HDFS Service Unavailable exists.
- If yes, go to 6.
- If no, go to 8.
Rectify the fault by performing the operations provided for ALM-14000 HDFS Service Unavailable.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
On FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > HDFS, and check whether Safe Mode of HDFS is ON.
- If yes, go to 9.
- If no, go to 12.
Log in to the HDFS client as user root. Run the cd command to go to the client installation directory and run the source bigdata_env command.

If the cluster uses the security mode, perform security authentication. Obtain the password of user hdfs from the MRS cluster administrator, run the kinit hdfs command, and enter the password as prompted.
Run the following command to manually exit the safe mode:

hdfs dfsadmin -safemode leave
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.

Check the HBase service status.

On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > HBase.
Check whether there is one active HMaster and one standby HMaster.
- If yes, go to 15.
- If no, go to 14.
Click Instances and select the HMaster instance whose status is not Active. Click More and select Restart Instance to restart HMaster. Then check whether there is one active HMaster and one standby HMaster.
- If yes, go to 15.
- If no, go to 21.
During the HMaster restart, table operations cannot be performed, and the HBase web UI is inaccessible. Data read and write operations are not affected.
Choose Cluster, click the name of the desired cluster, choose Services > HBase, and click HMaster(Active) to access the HMaster web UI.

By default, the admin user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.
Check whether at least one RegionServer exists under Region Servers.
- If yes, go to 17.
- If no, go to 21.
Choose Tables > System Tables and check whether hbase:meta, hbase:namespace, and hbase:acl exist in the Table Name column, as shown in Figure 1.
- If yes, go to 18.
- If no, go to 19.
  Figure 1 HBase system tables
Click hbase:meta, hbase:namespace, and hbase:acl to check whether all pages can be opened. If all of them can be opened, the tables are normal.
- If yes, go to 19.
- If no, go to 25.
  
  In a normal cluster, ACL permission control is disabled for HBase by default. The hbase:acl table is generated only after ACL permission control is manually enabled. In this case, you need to check this table.
View the HMaster startup status.

On the Tasks page shown in Figure 2, the RUNNING value in the State column indicates that HMaster is being started and provides how much time HMaster keeps in that state. As shown in Figure 3, if the state is COMPLETE, HMaster has been started.

Check whether HMaster has been in the RUNNING state for a long time.

Figure 2 HMaster being started

Figure 3 HMaster startup completed
- If yes, go to 20.
- If no, go to 21.
On the HMaster web UI, check whether any hbase:meta is in the Regions in Transition state for a long time.

Figure 4 Regions in Transition
- If yes, go to 21.
- If no, go to 22.
After ensuring that services are not affected, log in to FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > HBase, click More, and select Restart Service. In the dialog box that is displayed, enter the password, and click OK.
- If yes, go to 22.
- If no, go to 25.
During HBase service restart, the service is unavailable. For example, data cannot be read or written, table operations cannot be performed, and the HBase web UI is inaccessible.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 25.

Check whether the HBase configurations are correctly modified.

On FusionInsight Manager, choose Audit. On the Audit page, click Advanced Search, click on the right of Operation Type, select Save configuration, click OK, and click Search.

In the search result, check whether the historical configurations of HBase-related services in the Service column, such as ZooKeeper, HDFS, and HBase, may affect the HBase service status. Table 1 lists some configurations that may affect the HBase service status.

**Table 1** Configurations affecting the HBase service status
Parameter	Possible Impact
GC_OPTS	The memory configuration may be improper. You need to check the health status of instance processes.
hbase.rpc.protection	If the HBase service is not restarted offline after the value of this parameter is changed, the connection authentication fails and the HBase service becomes abnormal.
hbase.regionserver.metahandler.count	If there are too many regions in the cluster but this parameter is set to a small value, RIT may occur and regions cannot be brought online for a long time.
hbase.regionserver.thread.compaction.large	If this parameter is set to a large value, the node CPU usage may be too high.
hbase.regionserver.thread.compaction.small	If this parameter is set to a large value, the node CPU usage may be too high.
hbase.coprocessor.master.classes	If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
hbase.coprocessor.region.classes	If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
hbase.coprocessor.regionserver.classes	If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
zookeeper.session.timeout	If this parameter is set to a small value, the connection between HBase and ZooKeeper times out too quickly. As a result, the HMaster instance and RegionServer may restart repeatedly.

Check the network connection between HMaster and dependent components.

On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > HBase.
Click Instances. In the HMaster instance list, record the management IP address of the active HMaster instance.
Log in to the active HMaster node as user omm through the IP address obtained in 26.
Run the ping command to check whether the network connection between the active HMaster node and the host where the dependent components reside is normal. (The dependent components include ZooKeeper, HDFS, and Yarn. The method of obtaining the IP address of the host where the dependent components reside is the same as that of obtaining the IP address of the active HMaster node.)
- If yes, go to 31.
- If no, go to 29.
Contact the network administrator to restore the network.
In the alarm list, check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 31.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the drop-down list next to the Service field. In the Services dialog box that is displayed, select the following services for the target cluster:
- ZooKeeper
- HDFS
- HBase
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None

Parent topic: MRS Cluster Alarm Handling Reference

Previous topic: ALM-18028 TimeLineServer Process Is Abnormal

Next topic: ALM-19006 HBase Replication Sync Failed

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot