Updated on 2024-11-13 GMT+08:00

ALM-38018 Kafka Consumer Lag

Alarm Description

If you have configured a threshold to report Kafka consumer lag on the Alarms page of Kafka UI (there is no such rule by default), the system reports the alarm based on the following rules:

The system checks the topics subscribed to by all consumer groups every 60 seconds. This alarm is generated when the system detects that the difference (lag) between the consumption progress (offset) and the log end offset of the latest message generated in the partition is too large for five consecutive times, and the consumer log exceeds the threshold configured in the alarm rule.

This alarm is cleared when the system detects that the difference (lag) between the offsets is lower than the configured threshold for five consecutive times.

This alarm applies only to MRS 3.5.0 or later.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

38018

Major (manually configured threshold)

Major (manually configured threshold)

Yes

Alarm Parameters

Type

Parameter

Description

Location Information

ServiceName

Specifies the cluster service for which the alarm was generated.

ConsumerGroup

Name of the Kafka consumer group for which the alarm is generated.

Additional Information

TopicName

Specifies the Kafka topic for which the alarm is generated.

ConsumerLag

Specifies the number of messages yet to be consumed by the consumers in the Kafka topic for which the alarm is generated.

Impact on the System

Messages in Kafka topics are retained for a limited period (seven days by default). If messages are not consumed in time, data will be lost.

Possible Causes

  • The new consumer group starts consuming messages from the beginning topic, leading to consumer lag.
  • The threshold of the consumer lag alarm rule configured by the user is too small.
  • The Kafka topic traffic increases sharply, and a large number of messages are generated in a short period of time.
  • It takes a long time for the downstream system to process the Kafka messages in the topic.

Handling Procedure

Check whether the consumer group is new.

  1. Log in to FusionInsight Manager and choose O&M > Alarm > Alarms. View the alarm details. In the Location information area, view the name of the Kafka consumer group for which the alarm is generated. In the Additional Information area, view the topic name.
  2. Check whether the consumer group is new.

    • If yes, go to 3.

      In a new consumer group, the new consumer starts consuming messages from the beginning topic, which can cause a consumer lag alarm. This alarm is automatically cleared once the downstream consumer finishes processing the topic messages.

    • If no, go to 4.

  3. Wait a moment and then check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 4.

Check whether the alarm rule configuration is improper.

  1. On FusionInsight Manager, choose Cluster > Services > Kafka. On the right of KafkaManager web UI, click the URL link to access the Kafka UI. Click Alarms and check whether the configured threshold of the consumer lag alarm is proper.

    • If yes, go to 6.
    • If no, reconfigure the threshold, save the configuration, and go to 5.

  2. Wait 5 minutes and check whether the alarm is automatically cleared.

    • If yes, no further action is required.
    • If no, go to 6.

Check whether the topic traffic increases sharply.

  1. On the Kafka UI, click Topics and check whether a large number of messages are generated in a short period of time.

    • If yes, go to 7.

      If the alarm is caused by a soaring increase in topic traffic, the alarm is automatically cleared after the downstream system consumes topic messages.

    • If no, go to 8.

  2. Wait a moment and then check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

Check whether it takes a long time for the downstream system to process messages in the Kafka topic.

  1. Check whether the downstream system is consuming messages from the topic at a slow pace.

    • If yes, go to 9.
    • If no, go to 10.

  2. Analyze the reason why downstream jobs cannot quickly consume the topic messages and rectify the fault to accelerate the consumption. Wait 5 minutes and check whether the alarm is automatically cleared.

    • If yes, no further action is required.
    • If no, go to 10.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select Kafka for the target cluster.
  3. Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M engineers and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.