ALM-38009 Busy Broker Disk I/Os
Description
The system checks the I/O status of each Kafka disk every 60 seconds. This alarm is generated when the I/O status of a Kafka data directory disk on a broker exceeds the threshold (80% by default).
The alarm smoothing time is 3. This alarm is cleared when the disk I/O is lower than the threshold (80% by default).
Attribute
Alarm ID |
Alarm Severity |
Automatically Cleared |
---|---|---|
38009 |
Major |
Yes |
Parameters
Parameter |
Description |
---|---|
Source |
Specifies the cluster for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
RoleName |
Specifies the role for which the alarm is generated. |
HostName |
Specifies the host for which the alarm is generated. |
Data directory name |
Name of the data directory of the Kafka disk with busy I/Os |
Impact on the System
The I/O usage of the disk partition is high. Data may fail to be written to the Kafka topic for which the alarm is generated.
Possible Causes
- There are many replicas configured for a topic.
- The parameter specifying producer message batch write is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.
Procedure
Check the number of replication.
- On FusionInsight Manager, choose O&M > Alarm > Alarms. On the displayed page, select this alarm, and check the TopicName for which this alarm is generated.
- Choose Cluster > Name of the desired cluster > Services > Kafka > KafkaTopic Monitor. Search the topic for which this alarm is generated. On the displayed page, view the number of replication.
- If the number of replication is greater than 3, decrease the value to 3.
Specifically, run the following command to re-plan replicas of the Kafka topic.
kafka-reassign-partitions.sh --zookeeper {zk_host}:{port}/kafka --reassignment-json-file {manual assignment json file path} --execute
For example:
/opt/Bigdata/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka --reassignment-json-file expand-cluster-reassignment.json --execute
In the expand-cluster-reassignment.json file, describe the Brokers to which the partitions of the topic are migrated in the format of {"partitions":[{"topic": "topicName","partition": 1,"replicas": [1,2,3] }],"version":1}.
- After a period of time, check whether this alarm is cleared. If this alarm persists, go to 5.
Check the partition planning of the topic.
- On the KafkaTopic Monitor page, click Topic Traffic > Topic Input Traffic of each topic to obtain the topic with the largest value of Topic Input Traffic, and check partitions on this topic and information about hosts of these partitions.
- Log in to the hosts queried in 5 and run the iostat -d -x command to check the value of %util for each disk:
- If the value is high for each disk, expand the Kafka disks. After the capacity expansion, plan partitions of the topic by following the instruction in 3.
- If values of %util for the disks vary greatly, check the disk partition configuration of Kafka.For example: The configuration item indicates log.dirs in the server.properties file in the ${BIGDATA_HOME}/FusionInsight_HD_ 8.1.0.1/1_14_Broker/etc directory.
Run the following command to view information about the Filesystem:
df -h log.dirs configuration item.
The command output is as follows:
- If the partition of the Filesystem matches the partition with the high %util, plan Kafka partitions on idle disks, and set log.dirs to directories of the idle disk. Then, plan partitions of the topic by following the instruction in 3. to ensure that the partitions of the topic are evenly distributed to disks.
- After a period of time, check whether the alarm is cleared.
- After a period of time, check whether the alarm is cleared.
- If it is, no further action is required.
- If it is not, go to 9.
Collect fault information.
- On FusionInsight Manager, choose O&M > Log > Download.
- In the Service area, select Kafka in the required cluster.
- Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact the O&M personnel and send the collected logs.
Alarm Clearing
After the fault is rectified, the system automatically clears this alarm.
Related Information
None
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot