ALM-19000 HBase Service Unavailable
Alarm Description
This alarm is generated when the HBase service is unavailable. The alarm module checks the HBase service status every 120 seconds.
This alarm is cleared when the HBase service recovers.
Alarm Attributes
Alarm ID |
Alarm Severity |
Alarm Type |
Service Type |
Auto Cleared |
---|---|---|---|---|
19000 |
Critical |
Error handling |
HBase |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
|
RoleName |
Specifies the role for which the alarm is generated. |
|
HostName |
Specifies the host for which the alarm is generated. |
Impact on the System
Operations, such as reading or writing data and creating tables, cannot be performed.
Possible Causes
- The ZooKeeper service is abnormal.
- The HDFS service is abnormal.
- The HBase service is abnormal.
- The network is abnormal.
- The service configuration value is incorrect.
Handling Procedure
Check the ZooKeeper service status.
- On the FusionInsight Manager, check whether the running status of ZooKeeper is Normal on service list.
- In the alarm list, check whether ALM-13000 ZooKeeper Service Unavailable exists.
- Rectify the fault by following the steps provided in ALM-13000 ZooKeeper Service Unavailable.
- Wait several minutes, and check whether alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.
Check the HDFS service status.
- In the alarm list, check whether ALM-14000 HDFS Service Unavailable exists.
- Rectify the fault by following the steps provided in ALM-14000 HDFS Service Unavailable.
- Wait several minutes, and check whether alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
- On the FusionInsight Manager portal, choose Cluster > Name of the desired cluster > Services > HDFS. Check whether Safe Mode is ON.
- Log in to the HDFS client as user root. Run cd to switch to the client installation directory, and run source bigdata_env.
If the cluster uses the security mode, perform security authentication. Obtain the password of user hdfs from the administrator, run the kinit hdfs command and enter the password as prompted.
- Run the following command to manually exit the safe mode:
hdfs dfsadmin -safemode leave
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.
Check the HBase service status.
- On the FusionInsight Manager portal, click Cluster > Name of the desired cluster > Services > HBase.
- Check whether there is one active HMaster and one standby HMaster.
- Click Instances, select the HMaster whose status is not Active, click More, and select Restart Instance to restart the HMaster. Check whether there is one active HMaster and one standby HMaster again.
- Choose Cluster >Name of the desired cluster > Services > HBase > HMaster(Active) to go to the HMaster WebUI.
By default, the admin user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.
- Check whether at least one RegionServer exists under Region Servers.
- Check Tables > System Tables, as shown in Figure 1. Check whether hbase:meta, hbase:namespace, and hbase:acl exist in the Table Name column.
- As shown in Figure 1, click the hbase:meta, hbase:namespace, and hbase:acl hyperlinks and check whether the pages are properly displayed. If the pages are properly displayed, the tables are normal.
If they are, go to 19.
If they are not, go to 25.
In normal mode, ACL is enabled for HBase by default. The hbase:acl table is generated only when ACL is manually enabled. In this case, check this table. In other scenarios, this table does not need to be checked.
- View the HMaster startup status.
In Figure 2, if the RUNNING state exists in Tasks, HMaster is being started. In the State column, you can view the time when HMaster is in the RUNNING state. In Figure 3, if the state is COMPLETE, HMaster is started.
Check whether HMaster is in the RUNNING state for a long time.
- On the HMaster WebUI, check whether any hbase:meta is in the Region in Transition state for a long time.
Figure 4 Region in Transition
- In the precondition that services are not affected, log in to the FusionInsight Manager portal and choose Cluster > Name of the desired cluster > Services > HBase > More > Restart Service. Enter the administrator password and click OK.
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 25.
Check whether the HBase configurations are correctly modified.
- On FusionInsight Manager, choose Audit. On the Audit page, click Advanced Search, click on the right of Operation Type, select Save configuration, click OK, and click Search.
- In the search result, check whether the historical configurations of HBase-related services in the Service column, such as ZooKeeper, HDFS, and HBase, may affect the HBase service status. Table 1 lists some configurations that may affect the HBase service status.
Table 1 Configurations affecting the HBase service status Parameter
Possible Impact
GC_OPTS
The memory configuration may be improper. You need to check the health status of instance processes.
hbase.rpc.protection
If the HBase service is not restarted offline after the value of this parameter is changed, the connection authentication fails and the HBase service becomes abnormal.
hbase.regionserver.metahandler.count
If there are too many regions in the cluster but this parameter is set to a small value, RIT may occur and regions cannot be brought online for a long time.
hbase.regionserver.thread.compaction.large
If this parameter is set to a large value, the node CPU usage may be too high.
hbase.regionserver.thread.compaction.small
If this parameter is set to a large value, the node CPU usage may be too high.
hbase.coprocessor.master.classes
If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
hbase.coprocessor.region.classes
If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
hbase.coprocessor.regionserver.classes
If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
zookeeper.session.timeout
If this parameter is set to a small value, the connection between HBase and ZooKeeper times out too quickly. As a result, the HMaster instance and RegionServer may restart repeatedly.
Check the network connection between HMaster and dependent components.
- On the FusionInsight Manager, choose Cluster >Name of the desired cluster > Services > HBase.
- Click Instance and the HMaster instance list is displayed. Record the management IP Address in the row of HMaster(Active).
- Use the IP address obtained in 26 to log in to the host where the active HMaster runs as user omm .
- Run the ping command to check whether communication between the host that runs the active HMaster and the hosts that run the dependent components. (The dependent components include ZooKeeper, HDFS and Yarn. Obtain the IP addresses of the hosts that run these services in the same way as that for obtaining the IP address of the active HMaster.)
- Contact the administrator to restore the network.
- In the alarm list, check whether HBase Service Unavailable is cleared.
- If yes, no further action is required.
- If no, go to 31.
Collect fault information.
- On the FusionInsight Manager, choose O&M > Log > Download.
- Select the following nodes in the required cluster from the Service drop-down list:
- ZooKeeper
- HDFS
- HBase
- Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact the O&M engineers and send the collected logs.
Alarm Clearance
After the fault is rectified, the system automatically clears this alarm.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot