ALM-19020 Number of HBase WAL Files to Be Synchronized Exceeds the Threshold
Alarm Description
The system checks the number of WAL files to be synchronized by the RegionServer of each HBase service instance every 30 seconds. This indicator can be viewed on the RegionServer role monitoring page. This alarm is generated when the number of WAL files to be synchronized on a RegionServer exceeds the threshold (exceeding 128 for 20 consecutive times by default). To change the threshold, choose O&M > Alarm > Threshold Configuration > Name of the desired cluster > HBase . This alarm is cleared when the number of WAL files to be synchronized is less than or equal to the threshold.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
19020 |
Major |
Yes |
Alarm Parameters
Parameter |
Description |
---|---|
Source |
Specifies the cluster for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
RoleName |
Specifies the role for which the alarm was generated. |
HostName |
Specifies the host for which the alarm was generated. |
Trigger Condition |
Specifies the threshold for triggering the alarm. |
Impact on the System
A large number of WAL files are stacked. Data is inconsistent between the active and standby nodes, and the latest data cannot be read from the standby cluster during an active/standby switchover or during HBase dual-read. If the fault persists, the storage space of the active cluster and ZooKeeper nodes will be used up. As a result, the active cluster service will be interrupted.
Possible Causes
- The network is abnormal.
- The RegionServer region distribution is unbalanced.
- The HBase service scale of the standby cluster is too small.
Handling Procedure
View alarm location information.
- Log in to FusionInsight Manager and choose O&M. In the navigation pane on the left, choose Alarm > Alarms. On the page that is displayed, locate the row containing the alarm whose Alarm ID is 19020, and view the service instance and host name in Location.
Check the network connection between RegionServers on active and standby clusters.
- Run the ping command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal.
- Contact the network administrator to restore the network.
- After the network recovers, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.
Check the RegionServer region distribution in the active cluster.
- On FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > HBase. Click HMaster(Active) to go to the web UI of the HBase instance and check whether regions are evenly distributed on the Region Server.
- Log in to the faulty RegionServer node as user omm.
- Run the following commands to go to the client installation directory and set the environment variable:
cd Client installation directory
source bigdata_env
If the cluster uses the security mode, perform security authentication. Run the kinit hbase command and enter the password as prompted (obtain the password from the MRS cluster administrator).
- Run the following commands to check whether the load balancing function is enabled.
hbase shell
- Run the following commands in HBase Shell to enable the load balancing function and check whether the function is enabled.
balance_switch true
balancer_enabled
- Run the balancer command to manually trigger the load balancing function.
You are advised to enable and manually trigger the load balancing function during off-peak hours.
- Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.
Check the HBase service scale of the standby cluster.
- Expand the HBase cluster, add a node, and add a RegionServer instance on the node. Then, perform 6 to 10 to enable the load balancing function and manually trigger it.
- On FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > HBase. Click HMaster(Active) to go to the web UI of the HBase instance, refresh the page, and check whether regions are evenly distributed.
- Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 15.
Collect the fault information.
- On FusionInsight Manager of the standby cluster, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, and select HBase for the target cluster.
- Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M personnel and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot