Help Center/ Distributed Cache Service/ User Guide/ Monitoring/ Configuring Alarm Rules for Critical Metrics
Updated on 2024-06-20 GMT+08:00

Configuring Alarm Rules for Critical Metrics

This section describes the alarm rules of some metrics and how to configure the rules. In actual scenarios, configure alarm rules for metrics by referring to the following alarm policies.

Alarm Policies for DCS Redis Instances

Table 1 DCS Redis instance metrics to configure alarm rules for

Metric

Value Range

Alarm Policy

Approach Upper Limit

Handling Suggestion

CPU Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Major

No

Consider capacity expansion based on the service analysis.

The CPU capacity of a single-node or master/standby instance cannot be expanded. If you need larger capacity, use a cluster instance instead.

This metric is available only for single-node, master/standby, and Proxy Cluster instances. For Redis Cluster instances, this metric is available only on the Redis Server level. You can view the metric on the Redis Server tab page on the Performance Monitoring page of the instance.

Average CPU Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Major

No

Consider capacity expansion based on the service analysis.

The CPU capacity of a single-node or master/standby instance cannot be expanded. If you need larger capacity, use a cluster instance instead.

This metric is available only for single-node, master/standby, and Proxy Cluster instances. For Redis Cluster instances, this metric is available only on the Redis Server level. You can view the metric on the Redis Server tab page on the Performance Monitoring page of the instance.

Memory Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Critical

No

Expand the capacity of the instance.

Connected Clients

0–10,000

Alarm threshold: > 8000

Number of consecutive periods: 2

Alarm severity: Major

No

Optimize the connection pool in the service code to prevent the number of connections from exceeding the maximum limit.

Configure this alarm policy on the instance level for single-node and master/standby instances. For cluster instances, configure this alarm policy on the Redis Server and Proxy level.

For single-node and master/standby instances, the maximum number of connections allowed is 10,000. You can adjust the threshold based on service requirements.

New Connections

(Count/min)

≥ 0

Alarm threshold: > 10,000

Number of consecutive periods: 2

Alarm severity: Minor

-

Check whether connect is used and whether the client connection is abnormal. Use persistent connections ("pconnect" in Redis terminology) to ensure performance.

Configure this alarm policy on the instance level for single-node and master/standby instances. For cluster instances, configure this alarm policy on the Redis Server and Proxy level.

Alarm Policies for Redis Server Nodes of Cluster DCS Redis Instances

Table 2 Redis server metrics to configure alarm policies for

Metric

Value Range

Alarm Policy

Approach Upper Limit

Handling Suggestion

CPU Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Major

No

Check the service for traffic surge.

Check whether the CPU usage is evenly distributed to Redis Server nodes. If the CPU usage is high on multiple nodes, consider capacity expansion. Expanding the capacity of a cluster instance will scale out nodes to share the CPU pressure.

If the CPU usage is high on a single node, check whether hot keys exist. If yes, optimize the service code to eliminate hot keys.

Average CPU Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Major

No

Consider capacity expansion based on the service analysis.

The CPU capacity of a single-node or master/standby instance cannot be expanded. If you need larger capacity, use a cluster instance instead.

This metric is available only for single-node, master/standby, and Proxy Cluster instances. For Redis Cluster instances, this metric is available only on the Redis Server level. You can view the metric on the Redis Server tab page on the Performance Monitoring page of the instance.

Memory Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Major

No

Check the service for traffic surge.

Check whether the memory usage is evenly distributed to Redis Server nodes. If the memory usage is high on multiple nodes, consider capacity expansion. If the memory usage is high on a single node, check whether big keys exist. If yes, optimize the service code to eliminate big keys.

Connected Clients

0–10,000

Alarm threshold: > 8000

Number of consecutive periods: 2

Alarm severity: Major

No

Check whether the number of connections is within the appropriate range. If yes, adjust the alarm threshold.

New Connections

≥ 0

Alarm threshold: > 10,000

Number of consecutive periods: 2

Alarm severity: Minor

-

Check whether connect is used. To ensure performance, use persistent connections ("pconnect" in Redis terminology).

Slow Query Logs

0–1

Alarm threshold: > 0

Number of consecutive periods: 1

Alarm severity: Major

-

Use the slow query function on the console to analyze slow commands.

Bandwidth Usage

0–200%

Alarm threshold: > 90%

Number of consecutive periods: 2

Alarm severity: Major

Yes

Check whether the bandwidth usage increase comes from read services or write services based on the input and output flow.

If the bandwidth usage of a single node is high, check whether big keys exist.

Even if the bandwidth usage exceeds 100%, flow control may not necessarily be performed. The actual flow control is subject to the Flow Control Times metric.

Even if the bandwidth usage is below 100%, flow control may be performed. The real-time bandwidth usage is reported once in every reporting period. The flow control times metric is reported every second. During a reporting period, the traffic may surge within seconds and then fall back. By the time the bandwidth usage is reported, it has restored to the normal level.

Flow Control Times

≥ 0

Alarm threshold: > 0

Number of consecutive periods: 1

Alarm severity: Critical

Yes

Consider capacity expansion based on the specification limits, input flow, and output flow.

Alarm Policies for Proxy Nodes of Cluster DCS Redis Instances

Table 3 Proxy metrics to configure alarm policies for

Metric

Value Range

Alarm Policy

Approach Upper Limit

Handling Suggestion

CPU Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Critical

Yes

Consider capacity expansion, which will add proxies.

Memory Usage

0–100%

Alarm threshold: > 70%

Number of consecutive periods: 2

Alarm severity: Critical

Yes

Consider capacity expansion, which will add proxies.

Connected Clients

0–30,000

Alarm threshold: > 20,000

Number of consecutive periods: 2

Alarm severity: Major

No

Optimize the connection pool in the service code to prevent the number of connections from exceeding the maximum limit.

Configuring an Alarm Rule for a Resource Group

Cloud Eye allows you to add DCS instances, Redis Server nodes, and proxy nodes to resource groups and manage instances and alarm rules by group to simplify O&M. For details, see Creating a Resource Group.

  1. Create a resource group.

    1. Log in to the Cloud Eye console. In the navigation pane, choose Resource Groups and then click Create Resource Group in the upper right corner.
    2. Enter a group name and add Redis Server nodes to the resource group.

      You can add Redis Server nodes of different instances to the same resource group.

      Figure 1 Creating a resource group
    3. Click Create.

  2. In the navigation pane of the Cloud Eye console, choose Alarm Management > Alarm Rules and then click Create Alarm Rule to set alarm information for the resource group.

    Create a CPU usage alarm rule for all Redis Server nodes in the resource group, as shown in the following figure.

    Figure 2 Creating an alarm rule for a resource group

  3. Click Create.

Configuring an Alarm Rule for a Specific Resource

In the following example, an alarm rule is set for the Slow Query Logs (is_slow_log_exist) metric.

  1. Log in to the management console, and choose Application > Distributed Cache Service in the service list.
  2. Click in the upper left corner of the management console and select the region where your instance is located.
  3. In the navigation pane, choose Cache Manager.
  4. In the row containing the DCS instance whose metrics you want to view, click View Metric in the Operation column.

    Figure 3 Viewing instance metrics

  5. On the displayed page, locate the Slow Query Logs metric. Hover over the metric and click to create an alarm rule for the metric.

    The Create Alarm Rule page is displayed.

  6. Specify the alarm information.

    1. Set the alarm name and description.
    2. Specify the alarm policy and alarm severity.
      For example, the alarm policy shown in Figure 4 indicates that an alarm will be triggered if slow queries exist in the instance for two consecutive periods. If no actions are taken, the alarm will be triggered once every day, until the value of this metric returns to 0.
      Figure 4 Setting the alarm content
    3. Set the alarm notification configurations. If you enable Alarm Notification, set the validity period, notification object, and trigger condition.
    4. Click Create.