Kafka Balancing Tool Instructions

Scenario

This section describes how to use the Kafka balancing tool on a client to balance the load of the Kafka cluster based on service requirements in scenarios such as node decommissioning, node recommissioning, and load balancing.

Prerequisites

The MRS cluster administrator has understood service requirements and prepared a Kafka administrator (belonging to the kafkaadmin group. It is not required for the normal mode.).
The Kafka client has been installed.

Procedure

Log in as a client installation user to the node on which the Kafka client is installed.
Switch to the Kafka client installation directory, for example, /opt/client.

cd /opt/client
Run the following command to configure environment variables:

source bigdata_env
Run the following command to authenticate the user (skip this step in normal mode):

kinit Component service user
Run the following command to switch to the Kafka client installation directory:

cd Kafka/kafka
Run the kafka-balancer.sh command to balance user cluster. The commonly used commands are:
- Run the --run command to perform cluster balancing:
  ./bin/kafka-balancer.sh --run --zookeeper <ZooKeeper service IP address of any ZooKeeper node:zkPort/kafka> --bootstrap-server <Kafka cluster IP: port> --throttle 10000000 --consumer-config config/consumer.properties --enable-az-aware --show-details
  
  This command consists of generation and execution of the balancing solution. --show-details is optional, indicating whether to print the solution details. --throttle indicates the bandwidth limit during the execution of the balancing solution. The unit is bytes per second (bytes/sec). --enable-az-aware indicates that the cross-AZ feature is enabled when the balancing solution is generated. When this parameter is used, ensure that the cross-AZ feature has been enabled for the cluster.
- Run the --run command to decommission a node:
  ./bin/kafka-balancer.sh --run --zookeeper <Service IP address of any ZooKeeper node:zkPort/kafka> --bootstrap-server <Kafka cluster IP address: port> --throttle 10000000 --consumer-config config/consumer.properties --remove-brokers <BrokerId list> --enable-az-aware --force
  
  In the command, --remove-brokers indicates the list of broker IDs to be deleted. Multiple broker IDs are separated by commas (,). --force is optional, indicating that the disk usage alarm is ignored and the migration solution is forcibly generated. -enable-az-aware is optional, indicating that the cross-AZ feature is enabled when the balancing solution is generated. When this parameter is used, ensure that the cross-AZ feature has been enabled for the cluster.
  
  This command migrates data on the Broker nodes to be decommissioned to other Broker nodes.
- Run the following command to view the execution status:
  ./bin/kafka-balancer.sh --status --zookeeper <Service IP address of any ZooKeeper node:zkPort/kafka>
- Run the following command to generate a balancing solution:
  ./bin/kafka-balancer.sh --generate --zookeeper <Service IP address of any ZooKeeper node:zkPort/kafka> --bootstrap-server <Kafka cluster IP address:port> --consumer-config config/consumer.properties --enable-az-aware
  
  This command is used to generate a migration solution based on the current cluster status and print the solution to the console. --enable-az-aware is optional, indicating that the cross-AZ feature is enabled when a migration solution is generated. If this parameter is used, ensure that the cross-AZ feature has been enabled for the cluster.
- Clearing the intermediate status
  ./bin/kafka-balancer.sh --clean --zookeeper <Service IP address of any ZooKeeper node:zkPort/kafka>
  
  This command is used to clear the intermediate status information on the ZooKeeper when the migration is not complete.
  
  The port number of the Kafka cluster's IP address is 21007 in security mode and 9092 in normal mode.

Troubleshooting

During partition migration using the Kafka balancing tool, if the execution progress of the balancing tool is blocked due to a Broker fault in the cluster, you need to manually rectify the fault. The scenarios are as follows:

The Broker is faulty because the disk usage reaches 100%.
1. Log in to FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > Kafka > Instance, stop the Broker instance in the Restoring state, and record the management IP address of the node where the instance resides and the corresponding broker.id. You can click the role name to view the value, on the Instance Configurations page, select All Configurations and search for the broker.id parameter.
2. Log in to the recorded management IP address as user root, and run the df -lh command to view the mounted directory whose disk usage is 100%, for example, ${BIGDATA_DATA_HOME}/kafka/data1.
3. Go to the directory, run the du -sh * command to view the size of each file in the directory, Check whether files other than files in the kafka-logs directory exist, and determine whether these files can be deleted or migrated.
  - If yes, delete or migrate the related data and go to 8.
  - If no, go to 4.
4. Go to the kafka-logs directory, run the du -sh * command, select a partition folder to be moved. The naming rule is Topic name-Partition ID. Record the topic and partition.
5. Modify the recovery-point-offset-checkpoint and replication-offset-checkpoint files in the kafka-logs directory in the same way.
  1. Decrease the number in the second line in the file. (To remove multiple directories, the number deducted is equal to the number of files to be removed.)
  2. Delete the line of the to-be-removed partition. (The line structure is "Topic name Partition ID Offset". Save the data before deletion. Subsequently, the content must be added to the file of the same name in the destination directory.)
6. Modify the recovery-point-offset-checkpoint and replication-offset-checkpoint files in the destination data directory (for example, ${BIGDATA_DATA_HOME}/kafka/data2/kafka-logs) in the same way.
  - Increase the number in the second line in the file. (To move multiple directories, the number added is equal to the number of files to be moved.)
  - Add the to-be moved partition to the end of the file. (The line structure is "Topic name Partition ID Offset". You can copy the line data saved in 5.)
7. Move the partition to the destination directory. After the partition is moved, run the chown omm:wheel -R Partition directory command to modify the directory owner group for the partition.
8. Log in to FusionInsight Manager and choose Cluster > Name of the desired cluster > Services > Kafka > Instance to start the stopped Broker instance.
9. Wait for 5 to 10 minutes and check whether the health status of the Broker instance is Good.
  - If yes, resolve the disk capacity insufficiency problem according to the handling method of "ALM-38001 Insufficient Kafka Disk Capacity" after the alarm is cleared.
  - If no, contact O&M support.
After the faulty Broker is recovered, the blocked balancing task continues. You can run the --status command to view the task execution progress.
The Broker fault occurs because of other causes, the fault scenario is clear, and the fault can be rectified within a short period of time.
1. Restore the faulty Broker according to the root cause.
2. After the faulty Broker is recovered, the blocked balancing task continues. You can run the --status command to view the task execution progress.
The Broker fault occurs because of other causes, the fault scenario is complex, and the fault cannot be rectified within a short period of time.
1. Run the kinit Kafka administrator account command (skip this step in normal mode).
2. Run the zkCli.sh -server <ZooKeeper cluster service IP address:zkPort/kafka> command to log in to ZooKeeper Shell.
3. Run the addauth krbgroup command (skip this step in normal mode).
4. Delete the /admin/reassign_partitions and /controller directories.
5. Perform the preceding steps to forcibly stop the migration. After the cluster recovers, run the kafka-reassign-partitions.sh command to delete redundant copies generated during the intermediate process.