ALM-38009 Busy Broker Disk I/Os (Applicable to Versions Later Than MRS 3.1.0)

This section applies to versions later than MRS 3.1.0.

Alarm Description

The system checks the I/O status of each Kafka disk every 60 seconds. This alarm is generated when the disk I/O of a Kafka data directory on a broker exceeds the threshold (80% by default).

Its Trigger Count is 3. This alarm is cleared when the disk I/O is lower than the threshold (80% by default).

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
38009	Major	Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.
	DataDirectoryName	Specifies the name of the Kafka data directory with frequent disk I/Os.

Impact on the System

The disk partition has frequent I/Os. Data may fail to be written to the Kafka topic for which the alarm is generated.

Possible Causes

There are many replicas configured for the topic.
The parameter for batch writing producer's messages is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.

Handling Procedure

Check the number of topic replicas.

On FusionInsight Manager, choose O&M > Alarm > Alarms. In the Location field of the alarm details, view the host name for which this alarm is generated.
On FusionInsight Manager, choose Cluster > Services > Kafka > KafkaTopic Monitor, and sort topics by the number of replicas in descending order. Check whether topics with more than three replicas are experiencing high traffic.
- If yes, go to Step 3.
- If no, go to Step 5.
If such topics are found, click the topic name and verify whether the IP address resolved from the host name obtained in Step 1 is included among the hosts for all partitions of the topic. If it is, reduce the topic's replication factor to 3.

Perform the operations below on the Kafka client to replan replicas for Kafka topics.
1. Log in to the node where the Kafka client is installed as the client installation user.
2. Run the following command to switch to the client installation directory, for example, /opt/client/Kafka/kafka/bin:
```
cd /opt/client/Kafka/kafka/bin
```
3. Run the following command to configure environment variables:
```
source /opt/client/bigdata_env
```
4. Run the following command to perform user authentication (skip this step for a cluster in normal mode):
```
kinit Component service user
```
5. Run the following command to replan the replicas of Kafka topics:
```
kafka-reassign-partitions.sh --bootstrap-server <Kafka cluster IP address:21007> --command-config ../config/client.properties --reassignment-json-file {manual assignment json file path} --execute
```
  To obtain the IP address of the Kafka cluster, log in to FusionInsight Manager and choose Cluster > Services > Kafka > Instances. Check and record the service IP address of any Broker instance. The port number of the Kafka cluster is 21007 if Kerberos authentication is enabled for the cluster (the cluster is in security mode) and is 21005 if Kerberos authentication is disabled for the cluster (the cluster is in normal mode).
  
  Example:
```
/opt/client/Kafka/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server 192.168.0.90:21007,192.168.0.91:21007,192.168.0.92:21007 --command-config /opt/client/Kafka/kafka/config/client.properties --reassignment-json-file expand-cluster-reassignment.json --execute
```
  In the expand-cluster-reassignment.json file, specify the target brokers to which the topic's partitions will be migrated. The JSON content should follow this format:
```
{"partitions":[{"topic": "topicName","partition": 1,"replicas": [1,2,3] }],"version":1}.
```
Observe for a period of time and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 5.

Check the partition planning of the topic.

On the KafkaTopic Monitor page, view Topic Input Traffic in the Topic Traffic area of each topic, obtain the topic with the largest value, and check the partitions of this topic as well as information about the host of these partitions.
Log in to the host queried in Step 5 and run the following command to check the %util value of each disk:
```
iostat -d -x
```
- If the %util value of each disk exceeds the threshold (80% by default), expand the Kafka disk capacity. After the capacity expansion, replan the topic partitions by referring to Step 3.
- If the %util values of the disks vary greatly, check the disk partition configuration of Kafka.
  For example, check the value of log.dirs in the ${BIGDATA_HOME}/FusionInsight_HD_*/x_x_Broker/etc/server.properties file.
  
  Run the following command to view the Filesystem information:
```
df -h log.dirs value
```
  The command output is as follows.
- If the partition where Filesystem is located matches the partition with a high %util value, plan Kafka partitions on idle disks, configure log.dirs as an idle disk directory, and replan topic partitions by referring to Step 3. Ensure that the partitions of the topic are evenly distributed to each disk.
Observe for a period of time and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, repeat Step 5 to Step 6 three times. Then, go to Step 8.
Observe for a period of time and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to Step 9.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select Kafka for the target cluster.
Click the edit icon in the upper right corner and select a time span starting 10 minutes before and ending 10 minutes after when the alarm was generated. Then, click Download to collect the logs.
Contact O&M engineers and provide the collected logs.