Help Center/ Distributed Cache Service/ Troubleshooting/ Troubleshooting High CPU Usage of a DCS Redis Instance
Updated on 2024-06-19 GMT+08:00

Troubleshooting High CPU Usage of a DCS Redis Instance

Symptom

The CPU usage of a Redis instance increases dramatically within a short period of time. If the CPU usage is too high, connections may time out, and master/standby switchover may be triggered.

Possible Causes

  1. The service QPS is high. In this case, refer to Checking QPS.
  2. Resource-consuming commands, such as KEYS, were used. In this case, refer to Locating and Disabling CPU-Intensive Commands.
  3. Redis rewrite was triggered. In this case, refer to Checking Redis Rewrite.

Checking QPS

On the Cache Manager page of the DCS console, click an instance to go to the instance details page. On the left menu, choose Performance Monitoring and then view the Ops per Second metric.

If the QPS is high, optimize customer services or modifying instance specifications. For details about the QPS supported by different instance specifications, see DCS Instance Specifications.

Locating and Disabling CPU-Intensive Commands

Resource-consuming commands (commands with time complexity O(N) or higher), such as KEYS, are used. Generally, the higher the time complexity, the more resources a command uses. As a result, the CPU usage is high, and a master/standby switchover can be easily triggered. For details about the time complexity of each command, visit the Redis official website. In this case, use the SCAN command instead or disable the KEYS command.

  1. On the Performance Monitoring page of the DCS console, locate the period when the CPU usage is high.

  2. Use the following methods to find the commands that consume a large number of resources.

    • Redis logs queries that exceed a specified execution duration. You can find the commands that consume a large number of resources by analyzing the slow queries and their execution duration. For details, see Viewing Redis Slow Queries.
    • Use the instance diagnosis function to analyze the execution duration percentage of different commands during the period when the CPU usage is high. For details, see Diagnosing an Instance.

  3. Resolve the problem.

    • Evaluate and disable high-risk and high-consumption commands, such as FLUSHALL, KEYS, and HGETALL.
    • Optimize services. For example, avoid frequent data sorting operations.
    • (Optional) Perform the following operations to adjust instances based on service requirements:
      • Change the instance type to read/write splitting to separate read and write requests from high-consumption commands or applications.
      • Scale up the instance.

Checking Redis Rewrite

AOF persistence, which is enabled by default for master/standby and cluster DCS Redis instances, takes place in the following scenarios:

  • If a small amount of data is written and the AOF file is not large, AOF rewrite is performed from 01:00 to 04:00 in the morning every day, and CPU usage may suddenly spike during this period.
  • When a large amount of data is written and the AOF file size exceeds the threshold (three to five times the DCS instance capacity), AOF rewrite is automatically triggered in the background regardless of the current time.

Redis rewrite is performed by running the BGSAVE or BGREWRITEAOF command, which may consume many CPU resources (see the discussion). BGSAVE and BGREWRITEAOF commands need to fork(), resulting in CPU usage spikes within a short period of time.

If persistence is not required, disable it by changing the value of appendonly to no on the Parameters page of the instance. However, if you disable persistence, data loss may occur due to a lack of data flushing to disk in extreme situations.