Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-19033 Number of Tasks in the RegionServer RPC Read Queue Exceeds the Threshold

Updated on 2025-09-16 GMT+08:00

View PDF

ALM-19033 Number of Tasks in the RegionServer RPC Read Queue Exceeds the Threshold

Alarm Description

The system checks the number of tasks waiting in the RPC read queue for the RegionServer instances of the HBase service every 30 seconds. This alarm is generated when the number of waiting tasks exceeds the threshold for 10 consecutive times.

This alarm is cleared when the number of waiting tasks is less than or equal to the threshold.

This alarm applies only to MRS 3.3.1 or later.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
19033	Critical (default threshold: 2000) Major (default threshold: 1600)	Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.
Additional Information	Threshold	Specifies the threshold for generating the alarm.

Impact on the System

Request queues are stacked, and the response time of read requests increases. For latency-sensitive services, a large number of service read requests may time out.

Possible Causes

The RegionServer heap memory configuration is improper.
The RegionServer configuration is improper.
Regions of RegionServers are unevenly distributed, and read hotspotting occurred.
A slow disk fault occurred.

Handling Procedure

Log in to FusionInsight Manager and choose O&M. In the navigation pane on the left, choose Alarm > Alarms. On the page that is displayed, locate the row containing the alarm whose Alarm ID is 19033, and view the service instance and host name in Location.

Check the heap memory configuration.

In the alarm list on FusionInsight Manager, check whether the "Heap Memory Usage of the HBase Process Exceeds the Threshold" alarm is generated for the service instance in Step 1.
- If yes, go to Step 3.
- If no, go to Step 5.
Rectify the fault by following the handling procedure of "ALM-19008 Heap Memory Usage of the HBase Process Exceeds the Threshold".
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 5.
On FusionInsight Manager, choose Cluster > Services > HBase > Chart, select GC from the chart category, and check whether the GC times and GC monitoring period are normal.
- If yes, go to Step 6.
- If no, go to Step 9.

Click Configurations, search for GC_OPTS, and increase the value of Xmx of the RegionServer within the allowed memory range. Set the value to a number less than or equal to 31 GB. Click Save.

On FusionInsight Manager, choose Cluster > Services > HBase > Instances and obtain the hostnames of all nodes where RegionServer resides. Go back to the FusionInsight Manager homepage and choose Hosts. In the host list, view the Memory(GB) values of all nodes where RegionServer resides and check the remaining memory of each host. Increase the value of Xmx based on the minimum remaining memory, and ensure that the used memory of each node does not exceed 80% after the adjustment.
Click Dashboard and click More > Restart Service to restart the HBase service.

During HBase service restart, the service is unavailable. For example, data cannot be read or written, table operations cannot be performed, and the HBase web UI is inaccessible.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 9.

Check the RegionServer configuration.

On FusionInsight Manager, choose Cluster > Services > HBase, click Configurations > All Configurations, and check whether hbase.bucketcache.size is properly set. A larger value indicates a larger read cache and higher read performance. Increase the value based on the remaining memory of the node and click Save. Click Dashboard and click More > Restart Service to restart the HBase service.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 11.

On the HBase dashboard, click the hyperlink on the right of HMaster Web UI. In the User Tables tab in the Tables area, click the name of the table hit by a large number of user read requests. In the Table Schema area of the Table tab, check whether the value of BLOCKCACHE is false.
- If yes, go to Step 12.
- If no, go to Step 14.
Log in to the node where the HBase client is installed as user omm. Run the following commands to change the value of BLOCKCACHE of the Step 11 table column family to true:

cd Client installation directory

source bigdata_env

kinit the supergroup user group or a user with the Global Admin permission (If Kerberos authentication is disabled for the cluster, skip this operation.)

hbase shell

alter'Table name', {NAME =>'Column family name', BLOCKCACHE => true}

Run the following command to check whether the value of BLOCKCACHE of the column family is changed to true:

describe'Table name'
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 14.

Check whether regions of RegionServers are evenly distributed.

On FusionInsight Manager, choose Cluster > Services > HBase and click HMaster(Active). On the HBase web UI, check whether regions are evenly distributed in the Num.Regions column in the Base Stats tab in the Region Servers area.
- If yes, go to Step 20.
- If no, go to Step 15.
Log in to the faulty RegionServer node as user omm.
Run the following commands to go to the client installation directory and set the environment variable:

cd Client installation directory

source bigdata_env

kinit the supergroup user group or a user with the Global Admin permission (If Kerberos authentication is disabled for the cluster, skip this operation.)
Run the following commands to check whether the load balancing function is enabled:

hbase shell

balancer_enabled
If the command output is true, load balancing is enabled.
- If yes, go to Step 20.
- If no, go to Step 18.
Run the following commands to enable load balancing and check whether the function is successfully enabled:

balance_switch true

balancer_enabled

Run the balancer command to manually trigger the load balancing function.

You are advised to enable and trigger load balancing during off-peak hours.

Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 20.

Check for slow disk fault.

Check whether alarm "Slow Disk Fault" or "Disk Unavailable" are generated on the node in Step 1.
- If yes, go to Step 21.
- If no, go to Step 23.

Rectify the fault by following the handling procedure of "ALM-12033 Slow Disk Fault" or "ALM-12063 Disk Unavailable".
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 23.

Collect fault information.

On FusionInsight Manager, choose Cluster > Services > HBase > Chart, select IO from the Chart Category area, and view the values of Maximum Pread Latency-All Instances and Maximum Read Latency-All Instances. Normal values do not exceed 100 ms.
On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select HBase for the target cluster.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.

Parent Topic: MRS Cluster Alarm Handling Reference

Previous topic: ALM-19032 Number of Tasks in the RegionServer RPC Write Queue Exceeds the Threshold

Next topic: ALM-19034 Number of RegionServer WAL Write Timeouts Exceeds the Threshold

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot