ALM-19020 Number of HBase WAL Files to Be Synchronized Exceeds the Threshold

Alarm Description

The system checks the number of WAL files to be synchronized by the RegionServer of each HBase service instance every 30 seconds. This indicator can be viewed on the RegionServer role monitoring page. This alarm is generated when the number of WAL files to be synchronized on a RegionServer exceeds the threshold (exceeding 128 for 20 consecutive times by default). To change the threshold, choose O&M > Alarm > Threshold Configuration > Name of the desired cluster > HBase . This alarm is cleared when the number of WAL files to be synchronized is less than or equal to the threshold.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
19020	Major	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm was generated.
ServiceName	Specifies the service for which the alarm was generated.
RoleName	Specifies the role for which the alarm was generated.
HostName	Specifies the host for which the alarm was generated.
Trigger Condition	Specifies the threshold for triggering the alarm.

Impact on the System

A large number of WAL files are stacked. Data is inconsistent between the active and standby nodes, and the latest data cannot be read from the standby cluster during an active/standby switchover or during HBase dual-read. If the fault persists, the storage space of the active cluster and ZooKeeper nodes will be used up. As a result, the active cluster service will be interrupted.

Possible Causes

The network is abnormal.
The RegionServer region distribution is unbalanced.
The HBase service scale of the standby cluster is too small.

Handling Procedure

View alarm location information.

Log in to FusionInsight Manager and choose O&M. In the navigation pane on the left, choose Alarm > Alarms. On the page that is displayed, locate the row containing the alarm whose Alarm ID is 19020, and view the service instance and host name in Location.

Check the network connection between RegionServers on active and standby clusters.

Run the ping command to check whether the network connection between the faulty RegionServer node and the host where RegionServer of the standby cluster resides is normal.
- If yes, go to 5.
- If no, go to 3.
Contact the network administrator to restore the network.
After the network recovers, check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.

Check the RegionServer region distribution in the active cluster.

On FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > HBase. Click HMaster(Active) to go to the web UI of the HBase instance and check whether regions are evenly distributed on the Region Server.
Log in to the faulty RegionServer node as user omm.
Run the following commands to go to the client installation directory and set the environment variable:

cd Client installation directory

source bigdata_env

If the cluster uses the security mode, perform security authentication. Run the kinit hbase command and enter the password as prompted (obtain the password from the MRS cluster administrator).
Run the following commands to check whether the load balancing function is enabled.

hbase shell
balancer_enabled
- If yes, go to 10.
- If no, go to 9.
Run the following commands in HBase Shell to enable the load balancing function and check whether the function is enabled.

balance_switch true

balancer_enabled
Run the balancer command to manually trigger the load balancing function.

You are advised to enable and manually trigger the load balancing function during off-peak hours.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 12.

Check the HBase service scale of the standby cluster.

Expand the HBase cluster, add a node, and add a RegionServer instance on the node. Then, perform 6 to 10 to enable the load balancing function and manually trigger it.
On FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > HBase. Click HMaster(Active) to go to the web UI of the HBase instance, refresh the page, and check whether regions are evenly distributed.
- If yes, go to 14.
- If no, go to 15.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 15.

Collect the fault information.

On FusionInsight Manager of the standby cluster, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select HBase for the target cluster.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.