ALM-38013 Produce Request Latency in the Request Queue Exceeds the Threshold

Alarm Description

The system checks the latency of Produce requests on Broker instances in the request queue every 30 seconds. This alarm is generated when the latency of Produce requests on a Broker instance in the request queue has exceeded the threshold for 10 consecutive times.

This alarm is cleared when the latency of Produce requests in the request queue is less than or equal to the threshold.

This alarm applies only to MRS 3.5.0 or later.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
38013	Critical (default threshold: 60000) Major (default threshold: 30000)	Yes

Alarm ID

Alarm Severity

Auto Cleared

38013

Critical (default threshold: 60000)

Major (default threshold: 30000)

Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.

Impact on the System

The latency of Produce requests on the Broker instance in the request queue exceeds the threshold. As a result, the request queue is congested, and the response time of write requests increases. For latency-sensitive services, a large number of write requests may time out.

Possible Causes

The number of threads used by the Broker instance to process requests is incorrectly configured.
A slow disk fault has occurred.
The Broker disk I/O is busy.
Broker partitions are unevenly distributed, and hotspotting has occurred.

Handling Procedure

Check whether the number of threads used by the Broker instance to process requests is appropriate.

Log in to FusionInsight Manager and choose Cluster > Services > Kafka. On the page that is displayed, click Configurations and then All Configurations.
Search for and check the value of num.io.threads. If the value is too small, increase it. You are advised to change the value to twice the number of CPU cores. The maximum value is 64. Save the configuration.
Click the Instances tab, select all Broker instances, click More, and select Instance Rolling Restart.

Services may be affected or interrupted during the restart. Restart the instances during off-peak hours.
Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to Step 5.

Check whether a slow disk fault has occurred.

On FusionInsight Manager, choose O&M > Alarm > Alarms. In the Location field of the alarm details, view the host name for which this alarm is generated.
Check whether alarm Slow Disk Fault or Disk Unavailable is generated for the same node in Step 5.
- If yes, rectify the fault by following the handling procedure of ALM-12033 Slow Disk Fault or ALM-12063 Disk Unavailable.
- If no, go to Step 8.
Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to Step 8.

Check whether the Broker disk I/O is busy.

Check whether alarm Busy Broker Disk I/Os exists on the node for which this alarm is generated in Step 5.
- If yes, rectify the fault by following the handling procedure of ALM-38009 Busy Broker Disk I/Os and then go to Step 9.
- If no, go to Step 10.
Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to Step 10.

Check whether Broker partitions are evenly distributed and hotspotting has occurred.

Choose Cluster > Services > Kafka > Chart, select Partition from the Chart Category area, zoom in Number of Partitions-All Instances in the upper right corner, and click Distribution to check whether partitions are evenly distributed on Broker.

Figure 1 Example of uneven partition distribution on Broker
- If yes, go to Step 13.
- If no, go to Step 11.
Click the uneven distribution bar on the rightmost, and check whether the node obtained in Step 5 is included in the unevenly distributed instances. If it is, perform data balancing.
Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to Step 13.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select Kafka for the target cluster.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M engineers and provide the collected logs.