Configuring Alarm Thresholds for an MRS Cluster

Manager allows you to configure monitoring metric thresholds to monitor the health of various metrics. If abnormal data occurs and meets the preset conditions, the system will trigger an alarm, which will appear on the alarm page.

Configuring Alarm Thresholds for a Cluster (MRS 3.x)

Log in to FusionInsight Manager of the MRS cluster.

For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
Choose O&M > Alarm > Thresholds.
Select a monitoring metric for a host or service in the cluster.

Figure 1 Configuring the threshold for a metric
For example, after selecting Host Memory Usage, the information about this indicator threshold is displayed.
- If the alarm sending switch is turned on, an alarm will be triggered if the threshold is reached.
- When Alarm Severity is on, hierarchical alarms are enabled. The system dynamically reports alarms at each severity based on the real-time metric values and hierarchical thresholds set for the severity. MRS 3.3.0 or later supports this function.
- Alarm ID and Alarm Name: alarm information triggered against the threshold
- Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable.
- Check Period (s): interval for the system to check the monitoring metric.
- The rules in the rule list are used to trigger alarms.

Click Create Rule to add rules used for monitoring indicators.

**Table 1** Monitoring indicator rule parameters
Parameter	Description	Example Value
Rule Name	Specifies the alarm rule name.	CPU_MAX
Severity	Select an alarm severity. After Alarm Severity is on, you need to configure the alarm severity in Thresholds. The alarm severity can be one of the following: Critical: Indicates a severe fault or issue within the system that could impact cluster running or lead to data loss. This severity level applies to extreme conditions, such as disk or memory usage exceeding 95%. Major: Indicates a significant impact on system performance or availability that requires prompt attention to prevent further degradation. Minor: Indicates small issues within the system that have no impact on overall system running. Alarms of this severity level should be monitored and addressed, but they have a low handling priority. Warning: Indicates the occurrence of a system event and serves purely as an informational notice or prompt.	Major
Threshold Type	You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.	Max value
Date	This parameter is used to set the date when the rule takes effect. Possible values are as follows: Daily Weekly Others If Alarm Severity is on, only Daily is supported.	Daily
Add Date	This parameter is available only when Date is set to Others. You can set the date when the rule takes effect. Multiple options are available.	09-30
Thresholds	This parameter is used to set the time range when the rule takes effect. If Alarm Severity is on, you cannot set the start time and end time. The default start time and end time are 00:00-23:59.	Start and End Time: 00:00–08:30
Thresholds	Thresholds of the rule monitoring metric After Alarm Severity is on, different alarm severities can be set for a cluster based on different thresholds. You can click to set multiple time ranges for the threshold or click to delete one.	Threshold: 10

Click OK to save the rules.
Locate the row that contains an added rule, and click Apply in the Operation column. The value of Effective for this rule changes to Yes.

A new rule can be applied only after you click Cancel for an existing rule.

Configuring Alarm Thresholds for a Cluster (Versions Earlier Than MRS 3.x)

Log in to FusionInsight Manager of the MRS cluster.

For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
Click System.
In Configuration, click Configure Alarm Threshold under Monitoring and Alarm, select monitoring metrics as planned, and set their baselines.
Click a metric, for example, CPU Usage, and click Create Rule.

In the displayed dialog box, set monitoring metric rule parameters.

**Table 2** Monitoring indicator rule parameters
Parameter	Description	Example Value
Rule Name	Specifies the alarm rule name.	CPU_MAX
Reference Date	Specifies the date on which the reference indicator history is generated.	2014/11/06
Threshold Type	You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.	Max value
Severity	Alarm severity. The value can be: Critical: Indicates a severe fault or issue within the system that could impact cluster running or lead to data loss. Major: Indicates a significant impact on system performance or availability that requires prompt attention to prevent further degradation. Minor: Indicates small issues within the system that have no impact on overall system running. Alarms of this severity level should be monitored and addressed, but they have a low handling priority. Warning: Indicates the occurrence of a system event and serves purely as an informational notice or prompt.	Major
Time Range	Specifies the period in which the rule takes effect.	From 00:00 to 23:59
Threshold	Specifies the threshold of the rule monitoring metrics.	80
Date	Date when the rule takes effect. Workday Weekend Other	Workday
Add Date	This parameter is valid only when Date is set to Other. You can select multiple dates.	11/30

Click OK. A message is displayed in the upper right corner of the page, indicating that the template is saved successfully.

Send alarm is selected by default. Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable. Check Period (s) indicates the interval at which MRS Manager checks monitoring metrics.
Locate the row that contains the newly added rule, and click Apply in the Operation column. A message is displayed in the upper right corner, indicating that the rule xx is successfully applied. Click Cancel in the Operation column. A message is displayed in the upper right corner, indicating that the rule xx is successfully canceled.

Threshold Alarm Metric Reference

FusionInsight Manager alarm monitoring metrics are classified as node information metrics and cluster service metrics. Table 3 lists the metrics whose thresholds can be configured for a node, and Table 4 lists the metrics whose thresholds can be configured for a component.

On FusionInsight Manager of MRS 3.3.0 or later, alarms of some components can be reported by severity. Each alarm severity has a threshold. You can view them on the FusionInsight Manager configuration page.

**Table 3** Node monitoring metrics
Metric Group	Metric	Description	Default Threshold
CPU	Host CPU Usage	This indicator reflects the computing and control capabilities of the current cluster in a measurement period. By observing the indicator value, you can better understand the overall resource usage of the cluster.	90.0%
Disk	Disk Usage	Indicates the disk usage of a host.	90.0%
Disk	Disk Inode Usage	Indicates the disk inode usage in a measurement period.	80.0%
Memory	Host Memory Usage	Indicates the average memory usage at the current time.	90.0%
Host Status	Host File Handle Usage	Indicates the usage of file handles of the host in a measurement period.	80.0%
Host Status	Host PID Usage	Indicates the PID usage of a host.	90%
Network Status	TCP Ephemeral Port Usage	Indicates the usage of temporary TCP ports of the host in a measurement period.	80.0%
Network Reading	Read Packet Error Rate	Indicates the read packet error rate of the network interface on the host in a measurement period.	0.5%
	Read Packet Dropped Rate	Indicates the read packet dropped rate of the network interface on the host in a measurement period.	0.5%
	Read Throughput Rate	Indicates the average read throughput (at MAC layer) of the network interface in a measurement period.	80%
Network Writing	Write Packet Error Rate	Indicates the write packet error rate of the network interface on the host in a measurement period.	0.5%
	Write Packet Dropped Rate	Indicates the write packet dropped rate of the network interface on the host in a measurement period.	0.5%
	Write Throughput Rate	Indicates the average write throughput (at MAC layer) of the network interface in a measurement period.	80%
Process	Uninterruptible Sleep Process	Number of D state processes on the host in a measurement period	0
Process	omm Process Usage	omm process usage in a measurement period	90

**Table 4** Cluster service indicators
Service	Metric Group	Metric Name	Metric Description	Default Threshold
DBService	Database	Usage of the Number of Database Connections	Usage of database connections	90%
DBService	Database	Disk Space Usage of the Data Directory	Disk space usage of the data directory	80%
Flume	Agent	Heap Memory Usage Calculate	Flume heap memory usage	95.0%
		Flume Direct Memory Usage Statistics	Flume direct memory usage	80.0%
		Flume Non-heap Memory Usage	Flume non-heap memory usage	80.0%
		Total GC duration of Flume process	Flume total GC time	12000 ms
HBase	GC	GC time for old generation	Total GC time of RegionServer	5000 ms
	GC	GC time for old generation	Total GC time of HMaster	5000 ms
	CPU & memory	RegionServer Direct Memory Usage Statistics	RegionServer direct memory usage	90%
		RegionServer Heap Memory Usage Statistics	RegionServer heap memory usage	90%
		HMaster Direct Memory Usage	HMaster direct memory usage	90%
		HMaster Heap Memory Usage Statistics	HMaster heap memory usage	90%
	Service	Number of Online Regions of a RegionServer	Number of regions of a RegionServer	2000
	Service	Region in transaction count over threshold	Number of regions that are in the RIT state and reach the threshold duration	1
	Replication	Replication sync failed times (RegionServer)	Number of times that DR data fails to be synchronized	1
		Number of Log Files to Be Synchronized in the Active Cluster	Number of log files to be synchronized in the active cluster	128
		Number of HFiles to Be Synchronized in the Active Cluster	Number of HFiles to be synchronized in the active cluster	128
	Queue	Compaction Queue Size	Size of the Compaction queue	100
HDFS	File and Block	Lost Blocks	Number of block copies that the HDFS lacks of	0
	File and Block	Blocks Under Replicated	Total number of blocks that need to be replicated by the NameNode	1000
	RPC	Average Time of Active NameNode RPC Processing	Average NameNode RPC processing time	100 ms
	RPC	Average Time of Active NameNode RPC Queuing	Average NameNode RPC queuing time	200 ms
	Disk	HDFS Disk Usage	HDFS disk usage	80%
		DataNode Disk Usage	Disk usage of DataNodes in the HDFS	80%
		Percentage of Reserved Space for Replicas of Unused Space	Percentage of the reserved disk space of all the copies to the total unused disk space of DataNodes.	90%
	Resource	Faulty DataNodes	Indicates the number of faulty DataNodes.	3
		NameNode Non-Heap Memory Usage Statistics	Indicates the percentage of NameNode non-heap memory usage.	90%
		NameNode Direct Memory Usage Statistics	Indicates the percentage of direct memory used by NameNodes.	90%
		NameNode Heap Memory Usage Statistics	Indicates the percentage of NameNode non-heap memory usage.	95%
		DataNode Direct Memory Usage Statistics	Indicates the percentage of direct memory used by DataNodes.	90%
		DataNode Heap Memory Usage Statistics	DataNode heap memory usage	95%
		DataNode Heap Memory Usage Statistics	Indicates the percentage of DataNode non-heap memory usage.	90%
	Garbage Collection	GC Time (NameNode)/GC Time (DataNode)	Indicates the Garbage collection (GC) duration of NameNodes per minute.	12000 ms
	Garbage Collection	GC Time	Indicates the GC duration of DataNodes per minute.	12000 ms
Hive	HQL	Percentage of HQL Statements That Are Executed Successfully by Hive	Indicates the percentage of HQL statements that are executed successfully by Hive.	90.0%
	Background	Background Thread Usage	Background thread usage	90%
	GC	Total GC time of MetaStore	Indicates the total GC time of MetaStore.	12000 ms
	GC	Total GC Time in Milliseconds	Indicates the total GC time of HiveServer.	12000 ms
	Capacity	Percentage of HDFS Space Used by Hive to the Available Space	Indicates the percentage of HDFS space used by Hive to the available space.	85.0%
	CPU & memory	MetaStore Direct Memory Usage Statistics	MetaStore direct memory usage	95%
		MetaStore Non-Heap Memory Usage Statistics	MetaStore non-heap memory usage	95%
		MetaStore Heap Memory Usage Statistics	MetaStore heap memory usage	95%
		HiveServer Direct Memory Usage Statistics	HiveServer direct memory usage	95%
		HiveServer Non-Heap Memory Usage Statistics	HiveServer non-heap memory usage	95%
		HiveServer Heap Memory Usage Statistics	HiveServer heap memory usage	95%
	Session	Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer	Indicates the percentage of the number of sessions connected to the HiveServer to the maximum number of sessions allowed by the HiveServer.	90.0%
Kafka	Partition	Percentage of Partitions That Are Not Completely Synchronized	Indicates the percentage of partitions that are not completely synchronized to total partitions.	50%
	Others	Unavailable Partition Percentage	Percentage of unavailable partitions of each Kafka topic	40%
	Others	User Connection Usage on Broker	Usage of user connections on Broker	80%
	Disk	Broker Disk Usage	Indicates the disk usage of the disk where the Broker data directory is located.	80.0%
	Disk	Disk I/O Rate of a Broker	I/O usage of the disk where the Broker data directory is located	80%
	Process	Broker GC Duration per Minute	Indicates the GC duration of the Broker process per minute.	12000 ms
		Heap Memory Usage of Kafka	Indicates the Kafka heap memory usage.	95%
		Kafka Direct Memory Usage	Indicates the Kafka direct memory usage.	95%
Loader	Memory	Heap Memory Usage Calculate	Indicates the Loader heap memory usage.	95%
		Direct Memory Usage of Loader	Indicates the Loader direct memory usage.	80.0%
		Non-heap Memory Usage of Loader	Indicates the Loader non-heap memory usage.	80%
	GC	Total GC time of Loader	Indicates the total GC time of Loader.	12000 ms
MapReduce	Garbage Collection	GC Time	Indicates the GC time.	12000 ms
	Resource	JobHistoryServer Direct Memory Usage Statistics	Indicates the JobHistoryServer direct memory usage.	90%
		JobHistoryServer Non-Heap Memory Usage Statistics	Indicates the JobHistoryServer non-heap memory usage.	90%
		JobHistoryServer Heap Memory Usage Statistics	Indicates the JobHistoryServer non-heap memory usage.	95%
Oozie	Memory	Oozie Heap Memory Usage Calculate	Indicates the Oozie heap memory usage.	95.0%
		Oozie Direct Memory Usage	Indicates the Oozie direct memory usage.	80.0%
		Oozie Non-heap Memory Usage	Indicates the Oozie non-heap memory usage.	80%
	GC	Total GC duration of Oozie	Indicates the Oozie total GC time.	12000 ms
Spark/Spark2x	Memory	JDBCServer2x Heap Memory Usage Statistics	JDBCServer2x heap memory usage	95%
		JDBCServer2x Direct Memory Usage Statistics	JDBCServer2x direct memory usage	95%
		JDBCServer2x Non-Heap Memory Usage Statistics	JDBCServer2x non-heap memory usage	95%
		JobHistory2x Direct Memory Usage Statistics	JobHistory2x direct memory usage	95%
		JobHistory2x Non-Heap Memory Usage Statistics	JobHistory2x non-heap memory usage	95%
		JobHistory2x Heap Memory Usage Statistics	JobHistory2x heap memory usage	95%
		IndexServer2x Direct Memory Usage Statistics	IndexServer2x direct memory usage	95%
		IndexServer2x Heap Memory Usage Statistics	IndexServer2x heap memory usage	95%
		IndexServer2x Non-Heap Memory Usage Statistics	IndexServer2x non-heap memory usage	95%
	GC Count	Full GC Number of JDBCServer2x	Full GC times of JDBCServer2x	12
		Full GC Number of JobHistory2x	Full GC times of JobHistory2x	12
		Full GC Number of IndexServer2x	Full GC times of IndexServer2x	12
	GC Time	Total GC Time in Milliseconds	Total GC time of JDBCServer2x	12000 ms
		Total GC Time in Milliseconds	Total GC time of JobHistory2x	12000 ms
		Total GC Time in Milliseconds	Total GC time of IndexServer2x	12000 ms
Storm	Cluster	Number of Available Supervisors	Indicates the number of available Supervisor processes in the cluster in a measurement period.	1
	Cluster	Slot Usage	Indicates the slot usage in the cluster in a measurement period.	80.0%
	Nimbus	Nimbus Heap Memory Usage Calculate	Indicates the Nimbus heap memory usage.	80%
Yarn	Resources	NodeManager Direct Memory Usage Statistics	Indicates the percentage of direct memory used by NodeManagers.	90%
		NodeManager Heap Memory Usage Statistics	Indicates the percentage of NodeManager heap memory usage.	95%
		NodeManager Non-Heap Memory Usage Statistics	Indicates the percentage of NodeManager non-heap memory usage.	90%
		ResourceManager Direct Memory Usage Statistics	Indicates the ResourceManager direct memory usage.	90%
		ResourceManager Heap Memory Usage Statistics	Indicates the ResourceManager heap memory usage.	95%
		ResourceManager Non-Heap Memory Usage Statistics	Indicates the ResourceManager non-heap memory usage.	90%
	Garbage collection	GC Time	Indicates the GC duration of NodeManager per minute.	12000 ms
	Garbage collection	GC Time	Indicates the GC duration of ResourceManager per minute.	12000 ms
	Others	Failed Applications of root queue	Number of failed tasks in the root queue	50
	Others	Terminated Applications of root queue	Number of killed tasks in the root queue	50
	CPU & memory	Pending Memory	Pending memory capacity	83886080 MB
	Application	Pending Applications	Pending tasks	60
ZooKeeper	Connection	ZooKeeper Connections Usage	Indicates the percentage of the used connections to the total connections of ZooKeeper.	80%
	CPU & memory	Heap Memory Usage	Indicates the ZooKeeper heap memory usage.	95%
	CPU & memory	Heap Memory Usage Calculate	Indicates the ZooKeeper direct memory usage.	80%
	GC	ZooKeeper GC Duration per Minute	Indicates the GC time of ZooKeeper every minute.	12000 ms
Ranger	GC	UserSync GC Duration	UserSync garbage collection (GC) duration	12000 ms
		RangerAdmin GC Duration	RangerAdmin GC duration	12000 ms
		TagSync GC Duration	TagSync GC duration	12000 ms
	CPU & memory	UserSync Non-Heap Memory Usage	UserSync non-heap memory usage	80.0%
		UserSync Direct Memory Usage	UserSync direct memory usage	80.0%
		UserSync Heap Memory Usage	UserSync heap memory usage	95.0%
		RangerAdmin Non-Heap Memory Usage	RangerAdmin non-heap memory usage	80.0%
		RangerAdmin Heap Memory Usage	RangerAdmin heap memory usage	95.0%
		RangerAdmin Direct Memory Usage	RangerAdmin direct memory usage	80.0%
		TagSync Direct Memory Usage	TagSync direct memory usage	80.0%
		TagSync Non-Heap Memory Usage	TagSync non-heap memory usage	80.0%
		TagSync Heap Memory Usage	TagSync heap memory usage	95.0%
ClickHouse	Cluster Quota	Clickhouse service quantity quota usage in ZooKeeper	Quota of the ZooKeeper nodes used by a ClickHouse service	90%
ClickHouse	Cluster Quota	Capacity quota usage of the Clickhouse service in ZooKeeper	Capacity quota of ZooKeeper directory used by the ClickHouse service	90%
IoTDB	GC	IoTDBServer GC Duration	IoTDBServer garbage collection (GC) duration	12000 ms
	CPU & memory	IoTDBServer Heap Memory Usage	IoTDBServer heap memory usage	90%
	CPU & memory	IoTDBServer Direct Memory Usage	IoTDBServer direct memory usage	90%