Configuring DCS Monitoring and Alarms

This section describes the alarm rules of some metrics and how to configure the rules. In actual scenarios, configure alarm rules for metrics by referring to the following alarm policies.

Alarm Policies for DCS Redis Instances

**Table 1** DCS Redis instance metrics to configure alarm rules for
Metric	Value Range	Alarm Policy	Approach Upper Limit	Handling Suggestion
CPU Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Major	No	Consider capacity expansion based on the service analysis. The CPU capacity of a single-node or master/standby instance cannot be expanded. If you need larger capacity, use a cluster instance instead. This metric is available only for single-node, master/standby, and Proxy Cluster instances. For Redis Cluster instances, this metric is available only on the Redis Server level. You can view the metric on the Redis Server tab page on the Performance Monitoring page of the instance.
Average CPU Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Major	No	Consider capacity expansion based on the service analysis. The CPU capacity of a single-node or master/standby instance cannot be expanded. If you need larger capacity, use a cluster instance instead. This metric is available only for single-node, master/standby, and Proxy Cluster instances. For Redis Cluster instances, this metric is available only on the Redis Server level. You can view the metric on the Redis Server tab page on the Performance Monitoring page of the instance.
Memory Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Critical	No	Expand the capacity of the instance.
Connected Clients	0–10,000	Alarm threshold: > 8000 Number of consecutive periods: 2 Alarm severity: Major	No	Optimize the connection pool in the service code to prevent the number of connections from exceeding the maximum limit. Configure this alarm policy on the instance level for single-node and master/standby instances. For cluster instances, configure this alarm policy on the Redis Server and Proxy level. For single-node and master/standby instances, the maximum number of connections allowed is 10,000. You can adjust the threshold based on service requirements.
New Connections (Count/min)	≥ 0	Alarm threshold: > 10,000 Number of consecutive periods: 2 Alarm severity: Minor	-	Check whether connect is used and whether the client connection is abnormal. Use persistent connections ("pconnect" in Redis terminology) to ensure performance. Configure this alarm policy on the instance level for single-node and master/standby instances. For cluster instances, configure this alarm policy on the Redis Server and Proxy level.
Input Flow	≥ 0	Alarm threshold: > 80% of the assured bandwidth Number of consecutive periods: 2 Alarm severity: Major	Yes	Consider capacity expansion based on the service analysis and bandwidth limit. Configure this alarm only for single-node and master/standby DCS Redis 3.0 instances and set the alarm threshold to 80% of the assured bandwidth of DCS Redis 3.0 instances.
Output Flow	≥ 0	Alarm threshold: > 80% of the assured bandwidth Number of consecutive periods: 2 Alarm severity: Major	Yes	Consider capacity expansion based on the service analysis and bandwidth limit. Configure this alarm only for single-node and master/standby DCS Redis 3.0 instances and set the alarm threshold to 80% of the assured bandwidth of DCS Redis 3.0 instances.

Alarm Policies for DCS Memcached Instances

**Table 2** DCS Memcached instance metrics to configure alarm rules for
Metric	Value Range	Alarm Policy	Approach Upper Limit	Handling Suggestion
CPU Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Major	No	Check the service for traffic surge. The CPU capacity of a single-node or master/standby instance cannot be expanded. Analyze the service and consider splitting the service or combine multiple instances into a cluster on the client end.
Memory Usage	0–100 Unit: %	Alarm threshold: > 65% Number of consecutive periods: 2 Alarm severity: Minor	No	Consider expanding the instance capacity.
Connected Clients	0–10,000	Alarm threshold: > 8000 Number of consecutive periods: 2 Alarm severity: Major	No	Optimize the connection pool in the service code to prevent the number of connections from exceeding the maximum limit.
New Connections	≥ 0	Alarm threshold: > 10,000 Number of consecutive periods: 2 Alarm severity: Minor	-	Check whether connect is used and whether the client connection is abnormal. Use persistent connections ("pconnect" in Redis terminology) to ensure performance.
Input Flow	≥ 0	Alarm threshold: > 80% of the assured bandwidth Number of consecutive periods: 2 Alarm severity: Major	Yes	Consider capacity expansion based on the service analysis and bandwidth limit. For details about the bandwidth of different instance specifications, see DCS Instance Specifications.
Output Flow	≥ 0	Alarm threshold: > 80% of the assured bandwidth Number of consecutive periods: 2 Alarm severity: Major	Yes	Consider capacity expansion based on the service analysis and bandwidth limit. For details about the bandwidth of different instance specifications, see DCS Instance Specifications.
Authentication Failures	≥ 0	Alarm threshold: > 0 Number of consecutive periods: 1 Alarm severity: Critical	-	Check whether the password is entered correctly.

Alarm Policies for Redis Server Nodes of DCS Redis Instances

**Table 3** Redis server metrics to configure alarm policies for
Metric	Value Range	Alarm Policy	Approach Upper Limit	Handling Suggestion
CPU Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Major	No	Check the service for traffic surge. Check whether the CPU usage is evenly distributed to Redis Server nodes. If the CPU usage is high on multiple nodes, consider capacity expansion. Expanding the capacity of a cluster instance will scale out nodes to share the CPU pressure. If the CPU usage is high on a single node, check whether hot keys exist. If yes, optimize the service code to eliminate hot keys.
Average CPU Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Major	No	Consider capacity expansion based on the service analysis. The CPU capacity of a single-node, read/write splitting, or master/standby instance cannot be expanded. If you need larger capacity, use a cluster instance instead.
Memory Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Major	No	Check the service for traffic surge. Check whether the memory usage is evenly distributed to Redis Server nodes. If the memory usage is high on multiple nodes, consider capacity expansion. If the memory usage is high on a single node, check whether big keys exist. If yes, optimize the service code to eliminate big keys.
Connected Clients	0–10,000	Alarm threshold: > 8000 Number of consecutive periods: 2 Alarm severity: Major	No	Check whether the number of connections is within the appropriate range. If yes, adjust the alarm threshold.
New Connections	≥ 0	Alarm threshold: > 10,000 Number of consecutive periods: 2 Alarm severity: Minor	-	Check whether connect is used. To ensure performance, use persistent connections ("pconnect" in Redis terminology).
Slow Query Logs	0–1	Alarm threshold: > 0 Number of consecutive periods: 1 Alarm severity: Major	-	Use the slow query function on the console to analyze slow commands.
Bandwidth Usage	0–200 Unit: %	Alarm threshold: > 90% Number of consecutive periods: 2 Alarm severity: Major	Yes	Check whether the bandwidth usage increase comes from read services or write services based on the input and output flow. If the bandwidth usage of a single node is high, check whether big keys exist. Even if the bandwidth usage exceeds 100%, flow control may not necessarily be performed. The actual flow control is subject to the Flow Control Times metric. Even if the bandwidth usage is below 100%, flow control may be performed. The real-time bandwidth usage is reported once in every reporting period. The flow control times metric is reported every second. During a reporting period, the traffic may surge within seconds and then fall back. By the time the bandwidth usage is reported, it has restored to the normal level.
Flow Control Times	≥ 0	Alarm threshold: > 0 Number of consecutive periods: 1 Alarm severity: Critical	Yes	Consider capacity expansion based on the specification limits, input flow, and output flow. NOTE: This metric is supported only by Redis 4.0 and later and not by Redis 3.0.

Alarm Policies for Proxy Nodes of DCS Redis Instances

**Table 4** Proxy metrics to configure alarm policies for
Metric	Value Range	Alarm Policy	Approach Upper Limit	Handling Suggestion
CPU Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Critical	Yes	Consider capacity expansion, which will add proxies.
Memory Usage	0–100 Unit: %	Alarm threshold: > 70% Number of consecutive periods: 2 Alarm severity: Critical	Yes	Consider capacity expansion, which will add proxies.
Connected Clients	0–30,000	Alarm threshold: > 20,000 Number of consecutive periods: 2 Alarm severity: Major	No	Optimize the connection pool in the service code to prevent the number of connections from exceeding the maximum limit.

Configuring an Alarm Rule for a Resource Group

Cloud Eye allows you to add DCS instances, Redis Server nodes, and proxy nodes to resource groups and manage instances and alarm rules by group to simplify O&M.

Create a resource group. For details, see Creating a Resource Group.
In the navigation pane on the Cloud Eye console, choose Alarm Management > Alarm Rules. On the displayed page, click Create Alarm Rule in the upper right corner. Or, click Create Alarm Rule in the Operation column of the created resource group.
Create alarm rules for all resources in the resource group. Figure 1 is an example of creating a CPU usage alarm.

Figure 1 Creating an alarm rule for a resource group
Select how to send alarm notifications and click Create.

Configuring an Alarm Rule for a Specific Resource

In the following example, an alarm rule is set for the Slow Query Logs (is_slow_log_exist) metric.

Log in to the DCS console.
Click in the upper left corner of the management console and select the region where your instance is located.
In the navigation pane, choose Cache Manager.
In the row containing the DCS instance whose metrics you want to view, click View Metric in the Operation column.

Figure 2 Viewing instance metrics
On the displayed page, locate the Slow Query Logs metric. Hover over the metric and click + to create an alarm rule for the metric.

The Create Alarm Rule page is displayed.
Specify the alarm information.
1. Set the alarm name and description.
2. Specify the alarm policy and alarm severity.
  For example, the alarm policy shown in Figure 3 indicates that an alarm will be triggered if slow queries exist in the instance for two consecutive periods. If no actions are taken, the alarm will be triggered once every day, until the value of this metric returns to 0.
  Figure 3 Setting the alarm content
3. Set the alarm notification configurations. If you enable Alarm Notification, set the validity period, notification object, and trigger condition.
4. Click Create.
  - For more information about creating alarm rules, see Creating an Alarm Rule.
  - To modify or disable alarms, see Alarm Rule Management.