Help Center > > User Guide> FusionInsight Manager Operation Guide> Alarm Reference (Applicable to MRS 3.x)> ALM-38009 Kafka Topic Overloaded

ALM-38009 Kafka Topic Overloaded

Updated at: Mar 25, 2021 GMT+08:00

Description

The system checks the overload status of each Kafka topic every 60 seconds. This alarm is generated when the partition percentage of a topic in an overloaded disk exceeds the threshold (40% by default).

Trigger Count is set to 1. This alarm is cleared when the percentage is lower than the threshold (40% by default).

A disk is considered as overloaded if the I/O usage of the disk partitions is greater than 80%.

For example:

The partitions of TopicA are distributed on three Brokers. The I/O usages of the disk partitions for the two Brokers are greater than 80%.

In this case, the partition percentage is 2/3, greater than 40%, and this alarm is generated.

Attribute

Alarm ID

Alarm Severity

Automatically Cleared

38009

Major

Yes

Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

TopicName

Specifies the Kafka topic for which the alarm is generated.

Impact on the System

The I/O usage of the disk partition is high. Data may fail to be written to the Kafka topic for which the alarm is generated.

Possible Causes

  • There are many replicas configured for a topic.
  • The parameter specifying producer message batch write is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.

Procedure

Check the number of replication.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms. On the displayed page, select this alarm, and check the TopicName for which this alarm is generated.
  2. Choose Cluster > Name of the desired cluster > Services > Kafka > KafkaTopic Monitor. Search the topic for which this alarm is generated. On the displayed page, view the number of replication.
  3. If the number of replication is greater than 3, decrease the value to 3.

    Specifically, run the following command to re-plan replicas of the Kafka topic.

    kafka-reassign-partitions.sh --zookeeper {zk_host}:{port}/kafka --reassignment-json-file {manual assignment json file path} --execute

    For example:

    /opt/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka --reassignment-json-file expand-cluster-reassignment.json --execute

    In the expand-cluster-reassignment.json file, describe the Brokers to which the partitions of the topic are migrated in the format of {"partitions":[{"topic": "topicName","partition": 1,"replicas": [1,2,3] }],"version":1}.

  4. After a period of time, check whether this alarm is cleared. If this alarm persists, go to 5.

Check the partition planning of the topic.

  1. On the KafkaTopic Monitor page, click Topic Traffic > Topic Input Traffic of each topic to obtain the topic with the largest value of Topic Input Traffic, and check partitions on this topic and information about hosts of these partitions.
  2. Log in to the hosts queried in 5 and run the iostat -d -x command to check the value of %util for each disk:

    • If the value is high for each disk, expand the Kafka disks. After the capacity expansion, plan partitions of the topic by following the instruction in 3.
    • If values of %util for the disks vary greatly, check the disk partition configuration of Kafka.For example: The configuration item indicates log.dirs in the server.properties file in the ${BIGDATA_HOME}/FusionInsight_HD_ 8.1.0/1_14_Broker/etc directory.

      Run the following command to view information about the Filesystem:

      df -h log.dirs configuration item.

      The command output is as follows:

    • If the partition of the Filesystem matches the partition with the high %util, plan Kafka partitions on idle disks, and set log.dirs to directories of the idle disk. Then, plan partitions of the topic by following the instruction in 3. to ensure that the partitions of the topic are evenly distributed to disks.

  3. After a period of time, check whether the alarm is cleared.

    • If it is, no further action is required.
    • If it is not, repeat 5 to 6 for three times. If the number of repeated execution times reaches the upper limit, go to 8.

  4. After a period of time, check whether the alarm is cleared.

    • If it is, no further action is required.
    • If it is not, go to 9.

Collect fault information.

  1. On FusionInsight Manager, choose O&M > Log > Download.
  2. In the Service area, select Kafka in the required cluster.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Related Information

None

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel