ALM-38018 Kafka Consumer Lag
Alarm Description
If you have configured a threshold to report Kafka consumer lag on the Alarms page of Kafka UI (there is no such rule by default), the system reports the alarm based on the following rules:
The system checks the topics subscribed to by all consumer groups every 60 seconds. This alarm is generated when the system detects that the difference (lag) between the consumption progress (offset) and the log end offset of the latest message generated in the partition is too large for five consecutive times, and the consumer log exceeds the threshold configured in the alarm rule.
This alarm is cleared when the system detects that the difference (lag) between the offsets is lower than the configured threshold for five consecutive times.
This alarm applies only to MRS 3.5.0 or later.
Alarm Attributes
Alarm ID |
Alarm Severity |
Auto Cleared |
---|---|---|
38018 |
Major (manually configured threshold) Major (manually configured threshold) |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
ServiceName |
Specifies the cluster service for which the alarm was generated. |
ConsumerGroup |
Name of the Kafka consumer group for which the alarm is generated. |
|
Additional Information |
TopicName |
Specifies the Kafka topic for which the alarm is generated. |
ConsumerLag |
Specifies the number of messages yet to be consumed by the consumers in the Kafka topic for which the alarm is generated. |
Impact on the System
Messages in Kafka topics are retained for a limited period (seven days by default). If messages are not consumed in time, data will be lost.
Possible Causes
- The new consumer group starts consuming messages from the beginning topic, leading to consumer lag.
- The threshold of the consumer lag alarm rule configured by the user is too small.
- The Kafka topic traffic increases sharply, and a large number of messages are generated in a short period of time.
- It takes a long time for the downstream system to process the Kafka messages in the topic.
Handling Procedure
Check whether the consumer group is new.
- Log in to FusionInsight Manager and choose O&M > Alarm > Alarms. View the alarm details. In the Location information area, view the name of the Kafka consumer group for which the alarm is generated. In the Additional Information area, view the topic name.
- Check whether the consumer group is new.
- Wait a moment and then check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 4.
Check whether the alarm rule configuration is improper.
- On FusionInsight Manager, choose Cluster > Services > Kafka. On the right of KafkaManager web UI, click the URL link to access the Kafka UI. Click Alarms and check whether the configured threshold of the consumer lag alarm is proper.
- Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 6.
Check whether the topic traffic increases sharply.
- On the Kafka UI, click Topics and check whether a large number of messages are generated in a short period of time.
- Wait a moment and then check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.
Check whether it takes a long time for the downstream system to process messages in the Kafka topic.
- Check whether the downstream system is consuming messages from the topic at a slow pace.
- Analyze the reason why downstream jobs cannot quickly consume the topic messages and rectify the fault to accelerate the consumption. Wait 5 minutes and check whether the alarm is automatically cleared.
- If yes, no further action is required.
- If no, go to 10.
Collect fault information.
- On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
- Expand the Service drop-down list, and select Kafka for the target cluster.
- Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact O&M engineers and provide the collected logs.
Alarm Clearance
This alarm is automatically cleared after the fault is rectified.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot