Handling Service Overload

High CPU usage and full disks indicate overloaded Kafka services.

High CPU usage leads to low system performance and high risk of hardware damage.
If a disk is full, the Kafka log content stored on it goes offline. Then, the disk's partition replicas cannot be read or written, reducing partition availability and fault tolerance. The leader partition switches to another broker, adding load to the broker.

Causes of high CPU usage

There are too many data operation threads: num.io.threads, num.network.threads, and num.replica.fetchers.
Improper partitions. One broker carries all production and consumption services.

Causes of full disk

Current disk space no longer meets the needs of the rapidly increasing service data volume.
Unbalanced broker disk usage. The produced messages are all in one partition, taking up the partition's disk.
The time to live (TTL) set for a topic is too long. Old data takes too much disk space.

Handling high CPU usage:

Optimize the parameters configuration for threads num.io.threads, num.network.threads, and num.replica.fetchers.
- Set the number of num.io.threads and the number of num.network.threads threads to multiples of the disk quantity. Do not exceed the number of CPU cores
- Set the number of num.replica.fetchers threads to smaller than or equal to 5.
Set topic partitions properly. Set the number of partitions to multiples of the number of brokers.
Attach a random suffix to each message key so that messages can be evenly distributed in partitions.

In actual scenarios, attaching a random suffix to each message key compromises global message sequence. Decide whether a suffix is required by your service.

Handling full disk: