ALM-19000 HBase Service Unavailable
Alarm Description
The alarm module checks the HBase service status every 120 seconds. This alarm is generated when the HBase service is unavailable.
This alarm is cleared when the HBase service recovers.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
19000 |
Critical |
Yes |
Alarm Parameters
Parameter |
Description |
---|---|
Source |
Specifies the cluster for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
RoleName |
Specifies the role for which the alarm was generated. |
HostName |
Specifies the host for which the alarm was generated. |
Impact on the System
Operations such as data read/write and table creation cannot be performed.
Possible Causes
- ZooKeeper is abnormal.
- HDFS is abnormal.
- HBase is abnormal.
- The network connection is abnormal.
- The service configuration value is incorrect.
Handling Procedure
Check the ZooKeeper service status.
- In the service list on FusionInsight Manager, check whether Running Status of ZooKeeper is Normal.
- In the alarm list, check whether ALM-13000 ZooKeeper Service Unavailable exists.
- Rectify the fault by performing the operations provided for ALM-13000 ZooKeeper Service Unavailable.
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.
Check the HDFS service status.
- In the alarm list, check whether ALM-14000 HDFS Service Unavailable exists.
- Rectify the fault by performing the operations provided for ALM-14000 HDFS Service Unavailable.
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
- On FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > HDFS, and check whether Safe Mode of HDFS is ON.
- Log in to the HDFS client as user root. Run the cd command to go to the client installation directory and run the source bigdata_env command.
If the cluster uses the security mode, perform security authentication. Obtain the password of user hdfs from the MRS cluster administrator, run the kinit hdfs command, and enter the password as prompted.
- Run the following command to manually exit the safe mode:
hdfs dfsadmin -safemode leave
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.
Check the HBase service status.
- On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > HBase.
- Check whether there is one active HMaster and one standby HMaster.
- Click Instances and select the HMaster instance whose status is not Active. Click More and select Restart Instance to restart HMaster. Then check whether there is one active HMaster and one standby HMaster.
During the HMaster restart, table operations cannot be performed, and the HBase web UI is inaccessible. Data read and write operations are not affected.
- Choose Cluster, click the name of the desired cluster, choose Services > HBase, and click HMaster(Active) to access the HMaster web UI.
By default, the admin user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.
- Check whether at least one RegionServer exists under Region Servers.
- Choose Tables > System Tables and check whether hbase:meta, hbase:namespace, and hbase:acl exist in the Table Name column, as shown in Figure 1.
- Click hbase:meta, hbase:namespace, and hbase:acl to check whether all pages can be opened. If all of them can be opened, the tables are normal.
- View the HMaster startup status.
On the Tasks page shown in Figure 2, the RUNNING value in the State column indicates that HMaster is being started and provides how much time HMaster keeps in that state. As shown in Figure 3, if the state is COMPLETE, HMaster has been started.
Check whether HMaster has been in the RUNNING state for a long time.
- On the HMaster web UI, check whether any hbase:meta is in the Regions in Transition state for a long time.
Figure 4 Regions in Transition
- After ensuring that services are not affected, log in to FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > HBase, click More, and select Restart Service. In the dialog box that is displayed, enter the password, and click OK.
During HBase service restart, the service is unavailable. For example, data cannot be read or written, table operations cannot be performed, and the HBase web UI is inaccessible.
- Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 25.
Check whether the HBase configurations are correctly modified.
- On FusionInsight Manager, choose Audit. On the Audit page, click Advanced Search, click on the right of Operation Type, select Save configuration, click OK, and click Search.
- In the search result, check whether the historical configurations of HBase-related services in the Service column, such as ZooKeeper, HDFS, and HBase, may affect the HBase service status. Table 1 lists some configurations that may affect the HBase service status.
Table 1 Configurations affecting the HBase service status Parameter
Possible Impact
GC_OPTS
The memory configuration may be improper. You need to check the health status of instance processes.
hbase.rpc.protection
If the HBase service is not restarted offline after the value of this parameter is changed, the connection authentication fails and the HBase service becomes abnormal.
hbase.regionserver.metahandler.count
If there are too many regions in the cluster but this parameter is set to a small value, RIT may occur and regions cannot be brought online for a long time.
hbase.regionserver.thread.compaction.large
If this parameter is set to a large value, the node CPU usage may be too high.
hbase.regionserver.thread.compaction.small
If this parameter is set to a large value, the node CPU usage may be too high.
hbase.coprocessor.master.classes
If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
hbase.coprocessor.region.classes
If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
hbase.coprocessor.regionserver.classes
If a custom coprocessor is used in the configuration, a logic error may cause the service to be unavailable.
zookeeper.session.timeout
If this parameter is set to a small value, the connection between HBase and ZooKeeper times out too quickly. As a result, the HMaster instance and RegionServer may restart repeatedly.
Check the network connection between HMaster and dependent components.
- On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > HBase.
- Click Instances. In the HMaster instance list, record the management IP address of the active HMaster instance.
- Log in to the active HMaster node as user omm through the IP address obtained in 26.
- Run the ping command to check whether the network connection between the active HMaster node and the host where the dependent components reside is normal. (The dependent components include ZooKeeper, HDFS, and Yarn. The method of obtaining the IP address of the host where the dependent components reside is the same as that of obtaining the IP address of the active HMaster node.)
- Contact the network administrator to restore the network.
- In the alarm list, check whether this alarm is cleared.
- If yes, no further action is required.
- If no, go to 31.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the drop-down list next to the Service field. In the Services dialog box that is displayed, select the following services for the target cluster:
- ZooKeeper
- HDFS
- HBase
- Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M personnel and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot