High Usage of Multiple Disks on a Kafka Cluster Node

Issue

The usage of multiple disks on a node in the Kafka streaming cluster is high. The Kafka service will become unavailable if the usage reaches 100%.

Symptom

A node in the MRS Kafka streaming cluster by the customer has multiple disks. Due to improper partitioning and service reasons, the usage of some disks is high. When the usage reaches 100%, Kafka becomes unavailable.

Cause Analysis

The disk data needs to be processed in a timely manner. After the value of log.retention.hours is changed, the service needs to be restarted. To ensure service continuity, you can shorten the aging time of a single data-intensive topic as required.

Procedure

Log in to the core node of the Kafka streaming cluster.
Run the df –h command to check the disk usage.
Obtain the data storage directory from the log.dirs configuration item in the Kafka configuration file opt/Bigdata/MRS_2.1.0/1_11_Broker/etc/server.properties. Change the configuration file path based on the cluster version in the environment. If there are multiple disks, use commas (,) to separate multiple configuration items.
Run the cd command to go to the data storage directory obtained in 3 of the disk with high usage.
Run the du -sh * command to print the name and size of the current topic.
Determine the method of changing the data retention period. The default global data retention period of Kafka is seven days. A large amount of data may be written to some topics, and these topics reside on the partitions on the disk with high usage.
- You can change the global data retention period to a smaller value to release disk space. This method requires a Kafka service restart, which may affect service running. For details, see 7.
- You can change the data retention period of a single topic to a smaller value to release disk space. This configuration takes effect without a Kafka service restart. For details, see 8.
Log in to Manager. On the Kafka service configuration page, switch to All Configurations and search for the log.retention.hours configuration item. The default value is 7 days. Change it based on the site requirements.
Change the data retention time of the topics on these disks.
1. Check the retention time of the topic data.
  bin/kafka-topics.sh --describe --zookeeper <ZooKeeper cluster service IP address>:2181/kafka --topic kktest
2. Set the topic data retention time. --topic indicates the topic name, and retention.ms indicates the data retention time, in milliseconds.
  kafka-topics.sh --zookeeper <ZooKeeper cluster service IP address>:2181/kafka --alter --topic kktest --config retention.ms=1000000
  
  After the data retention time is set, the deletion operation may not be performed immediately. The deletion operation starts after the time specified by log.retention.check.interval.ms. You can check whether the delete field exists in the server.log file of Kafka to determine whether the deletion operation takes effect. If the delete field exists, the deletion operation has taken effect. You can also run the df –h command to check the disk usage and determine whether the setting takes effect.