Updated on 2022-12-14 GMT+08:00

ALM-38009 Busy Broker Disk I/Os

Description

The system checks the I/O status of each Kafka disk every 60 seconds. This alarm is generated when the I/O status of a Kafka data directory disk on a broker exceeds the threshold (80% by default).

The alarm smoothing time is 3. This alarm is cleared when the disk I/O is lower than the threshold (80% by default).

Attribute

Alarm ID

Alarm Severity

Automatically Cleared

38009

Major

Yes

Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm is generated.

ServiceName

Specifies the service for which the alarm is generated.

RoleName

Specifies the role for which the alarm is generated.

HostName

Specifies the host for which the alarm is generated.

Data directory name

Name of the data directory of the Kafka disk with busy I/Os

Impact on the System

The I/O usage of the disk partition is high. Data may fail to be written to the Kafka topic for which the alarm is generated.

Possible Causes

  • There are many replicas configured for a topic.
  • The parameter specifying producer message batch write is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.

Procedure

Check the number of replication.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms. On the displayed page, select this alarm, and check the TopicName for which this alarm is generated.
  2. Choose Cluster > Name of the desired cluster > Services > Kafka > KafkaTopic Monitor. Search the topic for which this alarm is generated. On the displayed page, view the number of replication.
  3. If the number of replication is greater than 3, decrease the value to 3.

    Specifically, run the following command to re-plan replicas of the Kafka topic.

    kafka-reassign-partitions.sh --zookeeper {zk_host}:{port}/kafka --reassignment-json-file {manual assignment json file path} --execute

    For example:

    /opt/Bigdata/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka --reassignment-json-file expand-cluster-reassignment.json --execute

    In the expand-cluster-reassignment.json file, describe the Brokers to which the partitions of the topic are migrated in the format of {"partitions":[{"topic": "topicName","partition": 1,"replicas": [1,2,3] }],"version":1}.

  4. After a period of time, check whether this alarm is cleared. If this alarm persists, go to 5.

Check the partition planning of the topic.

  1. On the KafkaTopic Monitor page, click Topic Traffic > Topic Input Traffic of each topic to obtain the topic with the largest value of Topic Input Traffic, and check partitions on this topic and information about hosts of these partitions.
  2. Log in to the hosts queried in 5 and run the iostat -d -x command to check the value of %util for each disk:

    • If the value is high for each disk, expand the Kafka disks. After the capacity expansion, plan partitions of the topic by following the instruction in 3.
    • If values of %util for the disks vary greatly, check the disk partition configuration of Kafka.For example: The configuration item indicates log.dirs in the server.properties file in the ${BIGDATA_HOME}/FusionInsight_HD_ 8.1.0.1/1_14_Broker/etc directory.

      Run the following command to view information about the Filesystem:

      df -h log.dirs configuration item.

      The command output is as follows:

    • If the partition of the Filesystem matches the partition with the high %util, plan Kafka partitions on idle disks, and set log.dirs to directories of the idle disk. Then, plan partitions of the topic by following the instruction in 3. to ensure that the partitions of the topic are evenly distributed to disks.

  3. After a period of time, check whether the alarm is cleared.

    • If it is, no further action is required.
    • If it is not, repeat 5 to 6 for three times. If the number of repeated execution times reaches the upper limit, go to 8.

  4. After a period of time, check whether the alarm is cleared.

    • If it is, no further action is required.
    • If it is not, go to 9.

Collect fault information.

  1. On FusionInsight Manager, choose O&M > Log > Download.
  2. In the Service area, select Kafka in the required cluster.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Related Information

None