ALM-38009 Busy Broker Disk I/Os

Description

The system checks the I/O status of each Kafka disk every 60 seconds. This alarm is generated when the I/O status of a Kafka data directory disk on a broker exceeds the threshold (80% by default).

The alarm smoothing time is 3. This alarm is cleared when the disk I/O is lower than the threshold (80% by default).

Attribute

Alarm ID	Alarm Severity	Automatically Cleared
38009	Major	Yes

Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.
Data directory name	Name of the data directory of the Kafka disk with busy I/Os

Impact on the System

The I/O usage of the disk partition is high. Data may fail to be written to the Kafka topic for which the alarm is generated.

Possible Causes

There are many replicas configured for a topic.
The parameter specifying producer message batch write is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.

Procedure

Check the number of replication.

On FusionInsight Manager, choose O&M > Alarm > Alarms. On the displayed page, select this alarm, and check the TopicName for which this alarm is generated.
Choose Cluster > Name of the desired cluster > Services > Kafka > KafkaTopic Monitor. Search the topic for which this alarm is generated. On the displayed page, view the number of replication.
If the number of replication is greater than 3, decrease the value to 3.

Specifically, run the following command to re-plan replicas of the Kafka topic.

kafka-reassign-partitions.sh --zookeeper {zk_host}:{port}/kafka --reassignment-json-file {manual assignment json file path} --execute

For example:

/opt/Bigdata/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka --reassignment-json-file expand-cluster-reassignment.json --execute

In the expand-cluster-reassignment.json file, describe the Brokers to which the partitions of the topic are migrated in the format of {"partitions":[{"topic": "topicName","partition": 1,"replicas": [1,2,3] }],"version":1}.
After a period of time, check whether this alarm is cleared. If this alarm persists, go to 5.

Check the partition planning of the topic.

On the KafkaTopic Monitor page, click Topic Traffic > Topic Input Traffic of each topic to obtain the topic with the largest value of Topic Input Traffic, and check partitions on this topic and information about hosts of these partitions.
Log in to the hosts queried in 5 and run the iostat -d -x command to check the value of %util for each disk:
- If the value is high for each disk, expand the Kafka disks. After the capacity expansion, plan partitions of the topic by following the instruction in 3.
- If values of %util for the disks vary greatly, check the disk partition configuration of Kafka.For example: The configuration item indicates log.dirs in the server.properties file in the ${BIGDATA_HOME}/FusionInsight_HD_ 8.1.0.1/1_14_Broker/etc directory.
  Run the following command to view information about the Filesystem:
  
  df -h log.dirs configuration item.
  
  The command output is as follows:
- If the partition of the Filesystem matches the partition with the high %util, plan Kafka partitions on idle disks, and set log.dirs to directories of the idle disk. Then, plan partitions of the topic by following the instruction in 3. to ensure that the partitions of the topic are evenly distributed to disks.
After a period of time, check whether the alarm is cleared.
- If it is, no further action is required.
- If it is not, repeat 5 to 6 for three times. If the number of repeated execution times reaches the upper limit, go to 8.
After a period of time, check whether the alarm is cleared.
- If it is, no further action is required.
- If it is not, go to 9.

Collect fault information.

On FusionInsight Manager, choose O&M > Log > Download.
In the Service area, select Kafka in the required cluster.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected logs.