Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-38009 Kafka Topic Overload (Applicable to MRS 3.1.0 and Earlier Versions)
Updated on 2024-09-23 GMT+08:00

ALM-38009 Kafka Topic Overload (Applicable to MRS 3.1.0 and Earlier Versions)

Alarm Description

The system checks the overload status of each Kafka topic every 60 seconds. This alarm is generated when the percentage of partitions of a topic on the overloaded disk exceeds the threshold (40% by default).

Its Trigger Count is 1. This alarm is cleared when the percentage of partitions of a topic on the overloaded disk is lower than the threshold (40% by default).

An overloaded disk refers to the disk whose I/O usage of a disk partition is greater than 80%.

For example:

The partitions of Topic A are distributed on three brokers. The I/O usages of the disk partitions on two brokers are greater than 80%.

The percentage of partitions on the overloaded disk is 2/3, greater than 40%, this alarm is generated.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

38009

Major

Yes

Alarm Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

TopicName

Specifies the Kafka topic for which the alarm was generated.

Impact on the System

The disk partition has frequent I/Os. Data may fail to be written to the Kafka topic for which the alarm is generated.

Possible Causes

  • There are many replicas configured for the topic.
  • The parameter for batch writing producer's messages is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.

Handling Procedure

Check the number of topic replicas.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms. Locate the row that contains this alarm, click , and view the host name in Location.
  2. On FusionInsight Manager, choose Cluster, click the name of the desired cluster, choose Services > Kafka > KafkaTopic Monitor, search for the topic for which the alarm is generated, and check the number of replicas.
  3. Reduce the replication factors of the topic (for example, reduce to 3) if the number of replicas is greater than 3.

    Run the following command on the FusionInsight client to replan the replicas of Kafka topics:

    kafka-reassign-partitions.sh --zookeeper {zk_host}:{port}/kafka --reassignment-json-file {manual assignment json file path} --execute

    For example:

    /opt/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka --reassignment-json-file expand-cluster-reassignment.json --execute

    In the expand-cluster-reassignment.json file, describe the brokers to which the partitions of the topic are migrated in the following format: {"partitions":[{"topic": "topicName","partition": 1,"replicas": [1,2,3] }],"version":1}

  4. Observe for a period of time and check whether the alarm is cleared. If the alarm persists, go to 5.

Check the partition planning of the topic.

  1. On the KafkaTopic Monitor page, view Topic Input Traffic in the Topic Traffic area of each topic, obtain the topic with the largest value, and check the partitions of this topic as well as information about the host of these partitions.
  2. Log in to the host queried in 5 and run the iostat -d -x command to check the %util value of each disk.

    • If the %util value of each disk exceeds the threshold (80% by default), expand the Kafka disk capacity. After the capacity expansion, replan the topic partitions by referring to 3.
    • If the %util values of the disks vary greatly, check the disk partition configuration of Kafka. For example, check the value of log.dirs in the ${BIGDATA_HOME}/FusionInsight_HD_8.1.0.1/1_14_Broker/etc/server.properties file.

      Run the following command to view the Filesystem information:

      df -h log.dirs value

      The command output is as follows.

    • If the partition where Filesystem is located matches the partition with a high %util value, plan Kafka partitions on idle disks, configure log.dirs as an idle disk directory, and replan topic partitions by referring to 3. Ensure that the partitions of the topic are evenly distributed to each disk.

  3. Observe for a period of time and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, repeat 5 to 6 three times. Then, go to 8.

  4. Observe for a period of time and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 9.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select Kafka for the target cluster.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None