ALM-38014 Total Produce Request Latency Exceeds the Threshold
Alarm Description
The system checks the total latency of Produce requests on Broker instances every 30 seconds. This alarm is generated when the total latency of Produce requests on a Broker instance has exceeded the threshold for 10 consecutive times.
This alarm is cleared when the total latency of Produce requests is less than or equal to the threshold.
This alarm applies only to MRS 3.5.0 or later.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
38014 |
Critical (default threshold: 120000) Major (default threshold: 60000) |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster for which the alarm was generated. |
ServiceName |
Specifies the service for which the alarm was generated. |
|
RoleName |
Specifies the role for which the alarm was generated. |
|
HostName |
Specifies the host for which the alarm was generated. |
Impact on the System
The total latency of Produce requests on the Broker instance exceeds the threshold. For latency-sensitive services, a large number of service query requests may time out.
Possible Causes
- The number of threads used by the Broker instance to process requests is incorrectly configured.
- A slow disk fault has occurred.
- The Broker disk I/O is busy.
- Broker partitions are unevenly distributed, and hotspotting has occurred.
Handling Procedure
Check whether the number of threads used by the Broker instance to process requests is appropriate.
- Log in to FusionInsight Manager and choose Cluster > Services > Kafka. On the page that is displayed, click Configurations and then All Configurations.
- Search for and check the value of num.io.threads. If the value is too small, increase it. You are advised to change the value to twice the number of CPU cores. The maximum value is 64. Save the configuration.
- Click the Instances tab, select all Broker instances, click More, and select Instance Rolling Restart.
Services may be affected or interrupted during the restart. Restart the instances during off-peak hours.
- Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 5.
Check whether a slow disk fault has occurred.
- On FusionInsight Manager, choose O&M > Alarm > Alarms. In the Location field of the alarm details, view the host name for which this alarm is generated.
- Check whether alarm Slow Disk Fault or Disk Unavailable is generated for the same node in 5.
- If yes, rectify the fault by following the handling procedure of ALM-12033 Slow Disk Fault or ALM-12063 Disk Unavailable.
- If no, go to 8.
- Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 8.
Check whether the Broker disk I/O is busy.
- Check whether alarm Busy Broker Disk I/Os exists on the node for which this alarm is generated in 5.
- Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 10.
Check whether Broker partitions are evenly distributed and hotspotting has occurred.
- Choose Cluster > Services > Kafka > Chart, select Partition from the Chart Category area, zoom in Number of Partitions-All Instances in the upper right corner, and click Distribution to check whether partitions are evenly distributed on Broker.
Figure 1 Example of uneven partition distribution on Broker
- Click the uneven distribution bar on the rightmost, and check whether the node obtained in 5 is included in the unevenly distributed instances. If it is, perform data balancing.
- Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 13.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, and select Kafka for the target cluster.
- Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M engineers and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot