Configuring Alarm Thresholds for an MRS Cluster
Manager allows you to configure monitoring metric thresholds to monitor the health of various metrics. If abnormal data occurs and meets the preset conditions, the system will trigger an alarm, which will appear on the alarm page.
Configuring Alarm Thresholds for an MRS Cluster (MRS 3.x or Later)
- Log in to FusionInsight Manager.
- Choose O&M > Alarm > Thresholds.
- Select a monitoring metric for a host or service in the cluster.
Figure 1 Configuring the threshold for a metric
For example, after selecting Host Memory Usage, the information about this indicator threshold is displayed.- If the alarm sending switch is turned on, an alarm will be triggered if the threshold is reached.
- When Alarm Severity is on, hierarchical alarms are enabled. The system dynamically reports alarms at each severity based on the real-time metric values and hierarchical thresholds set for the severity. MRS 3.3.0 or later supports this function.
- Alarm ID and Alarm Name: alarm information triggered against the threshold
- Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable.
- Check Period (s): interval for the system to check the monitoring metric.
- The rules in the rule list are used to trigger alarms.
- Click Create Rule to add rules used for monitoring indicators.
Table 1 Monitoring indicator rule parameters Parameter
Description
Example Value
Rule Name
Name of a rule.
CPU_MAX
Severity
Select an alarm severity.
- Critical
- Major
- Minor
- Warning
Threshold Type
You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.
- Max value
- Min value
Date
This parameter is used to set the date when the rule takes effect.
- Daily
- Weekly
- Others
Add Date
This parameter is available only when Date is set to Others. You can set the date when the rule takes effect. Multiple options are available.
09-30
Thresholds
This parameter is used to set the time range when the rule takes effect.
Start and End Time: 00:00–08:30
Threshold of the rule monitoring metric
Threshold: 10
You can click to set multiple time ranges for the threshold or click to delete one.
- Click OK to save the rules.
- Locate the row that contains an added rule, and click Apply in the Operation column. The value of Effective for this rule changes to Yes.
A new rule can be applied only after you click Cancel for an existing rule.
Configuring Alarm Thresholds for an MRS Cluster (MRS 2.x or Earlier)
- On MRS Manager, click System.
- In Configuration, click Configure Alarm Threshold under Monitoring and Alarm, select monitoring metrics as planned, and set their baselines.
- Click a metric, for example, CPU Usage, and click Create Rule.
- In the displayed dialog box dialog box, set monitoring metric rule parameters.
Table 2 Monitoring indicator rule parameters Parameter
Description
Value
Rule Name
Name of a rule.
CPU_MAX
Reference Date
Date on which the reference metric history is generated.
2014/11/06 (example)
Threshold Type
You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.
- Max value
- Min value
Severity
Severity
- Critical
- Major
- Minor
- Warning
Time Range
Period in which the rule takes effect.
From 00:00 to 23:59 (example)
Threshold
Threshold of the rule monitoring metric
80 (example)
Date
Type of date when the rule takes effect.
- Workday
- Weekend
- Other
Add Date
This parameter is valid only when Date is set to Other. You can select multiple dates.
11/30 (example)
- Click OK. A message is displayed in the upper right corner of the page, indicating that the template is saved successfully.
Send alarm is selected by default. Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable. Check Period (s) indicates the interval at which MRS Manager checks monitoring metrics.
- Locate the row that contains the newly added rule and click Apply in the Operation column. A message is displayed in the upper right corner, indicating that the rule xx is successfully added. Click Cancel in the Operation column. A message is displayed in the upper right corner, indicating that the rule xx is successfully canceled.
Monitoring Metric Reference (MRS 3.x or Later)
FusionInsight Manager alarm monitoring metrics are classified as node information metrics and cluster service metrics. Table 3 lists the metrics whose thresholds can be configured for a node, and Table 4 lists the metrics whose thresholds can be configured for a component.
Metric Group |
Metric |
Description |
Default Threshold |
---|---|---|---|
CPU |
Host CPU Usage |
This indicator reflects the computing and control capabilities of the current cluster in a measurement period. By observing the indicator value, you can better understand the overall resource usage of the cluster. |
90.0% |
Disk |
Disk Usage |
Indicates the disk usage of a host. |
90.0% |
Disk Inode Usage |
Indicates the disk inode usage in a measurement period. |
80.0% |
|
Memory |
Host Memory Usage |
Indicates the average memory usage at the current time. |
90.0% |
Host Status |
Host File Handle Usage |
Indicates the usage of file handles of the host in a measurement period. |
80.0% |
Host PID Usage |
Indicates the PID usage of a host. |
90% |
|
Network Status |
TCP Ephemeral Port Usage |
Indicates the usage of temporary TCP ports of the host in a measurement period. |
80.0% |
Network Reading |
Read Packet Error Rate |
Indicates the read packet error rate of the network interface on the host in a measurement period. |
0.5% |
Read Packet Dropped Rate |
Indicates the read packet dropped rate of the network interface on the host in a measurement period. |
0.5% |
|
Read Throughput Rate |
Indicates the average read throughput (at MAC layer) of the network interface in a measurement period. |
80% |
|
Network Writing |
Write Packet Error Rate |
Indicates the write packet error rate of the network interface on the host in a measurement period. |
0.5% |
Write Packet Dropped Rate |
Indicates the write packet dropped rate of the network interface on the host in a measurement period. |
0.5% |
|
Write Throughput Rate |
Indicates the average write throughput (at MAC layer) of the network interface in a measurement period. |
80% |
|
Process |
Uninterruptible Sleep Process |
Number of D state processes on the host in a measurement period |
0 |
omm Process Usage |
omm process usage in a measurement period |
90 |
Service |
Metric Group |
Metric Name |
Metric Description |
Default Threshold |
---|---|---|---|---|
DBService |
Database |
Usage of the Number of Database Connections |
Usage of database connections |
90% |
Disk Space Usage of the Data Directory |
Disk space usage of the data directory |
80% |
||
Flume |
Agent |
Heap Memory Usage Calculate |
Flume heap memory usage |
95.0% |
Flume Direct Memory Usage Statistics |
Flume direct memory usage |
80.0% |
||
Flume Non-heap Memory Usage |
Flume non-heap memory usage |
80.0% |
||
Total GC duration of Flume process |
Flume total GC time |
12000 ms |
||
HBase |
GC |
GC time for old generation |
Total GC time of RegionServer |
5000 ms |
GC time for old generation |
Total GC time of HMaster |
5000 ms |
||
CPU & memory |
RegionServer Direct Memory Usage Statistics |
RegionServer direct memory usage |
90% |
|
RegionServer Heap Memory Usage Statistics |
RegionServer heap memory usage |
90% |
||
HMaster Direct Memory Usage |
HMaster direct memory usage |
90% |
||
HMaster Heap Memory Usage Statistics |
HMaster heap memory usage |
90% |
||
Service |
Number of Online Regions of a RegionServer |
Number of regions of a RegionServer |
2000 |
|
Region in transaction count over threshold |
Number of regions that are in the RIT state and reach the threshold duration |
1 |
||
Replication |
Replication sync failed times (RegionServer) |
Number of times that DR data fails to be synchronized |
1 |
|
Number of Log Files to Be Synchronized in the Active Cluster |
Number of log files to be synchronized in the active cluster |
128 |
||
Number of HFiles to Be Synchronized in the Active Cluster |
Number of HFiles to be synchronized in the active cluster |
128 |
||
Queue |
Compaction Queue Size |
Size of the Compaction queue |
100 |
|
HDFS |
File and Block |
Lost Blocks |
Number of block copies that the HDFS lacks of |
0 |
Blocks Under Replicated |
Total number of blocks that need to be replicated by the NameNode |
1000 |
||
RPC |
Average Time of Active NameNode RPC Processing |
Average NameNode RPC processing time |
100 ms |
|
Average Time of Active NameNode RPC Queuing |
Average NameNode RPC queuing time |
200 ms |
||
Disk |
HDFS Disk Usage |
HDFS disk usage |
80% |
|
DataNode Disk Usage |
Disk usage of DataNodes in the HDFS |
80% |
||
Percentage of Reserved Space for Replicas of Unused Space |
Percentage of the reserved disk space of all the copies to the total unused disk space of DataNodes. |
90% |
||
Resource |
Faulty DataNodes |
Indicates the number of faulty DataNodes. |
3 |
|
NameNode Non-Heap Memory Usage Statistics |
Indicates the percentage of NameNode non-heap memory usage. |
90% |
||
NameNode Direct Memory Usage Statistics |
Indicates the percentage of direct memory used by NameNodes. |
90% |
||
NameNode Heap Memory Usage Statistics |
Indicates the percentage of NameNode non-heap memory usage. |
95% |
||
DataNode Direct Memory Usage Statistics |
Indicates the percentage of direct memory used by DataNodes. |
90% |
||
DataNode Heap Memory Usage Statistics |
DataNode heap memory usage |
95% |
||
DataNode Heap Memory Usage Statistics |
Indicates the percentage of DataNode non-heap memory usage. |
90% |
||
Garbage Collection |
GC Time (NameNode)/GC Time (DataNode) |
Indicates the Garbage collection (GC) duration of NameNodes per minute. |
12000 ms |
|
GC Time |
Indicates the GC duration of DataNodes per minute. |
12000 ms |
||
Hive |
HQL |
Percentage of HQL Statements That Are Executed Successfully by Hive |
Indicates the percentage of HQL statements that are executed successfully by Hive. |
90.0% |
Background |
Background Thread Usage |
Background thread usage |
90% |
|
GC |
Total GC time of MetaStore |
Indicates the total GC time of MetaStore. |
12000 ms |
|
Total GC Time in Milliseconds |
Indicates the total GC time of HiveServer. |
12000 ms |
||
Capacity |
Percentage of HDFS Space Used by Hive to the Available Space |
Indicates the percentage of HDFS space used by Hive to the available space. |
85.0% |
|
CPU & memory |
MetaStore Direct Memory Usage Statistics |
MetaStore direct memory usage |
95% |
|
MetaStore Non-Heap Memory Usage Statistics |
MetaStore non-heap memory usage |
95% |
||
MetaStore Heap Memory Usage Statistics |
MetaStore heap memory usage |
95% |
||
HiveServer Direct Memory Usage Statistics |
HiveServer direct memory usage |
95% |
||
HiveServer Non-Heap Memory Usage Statistics |
HiveServer non-heap memory usage |
95% |
||
HiveServer Heap Memory Usage Statistics |
HiveServer heap memory usage |
95% |
||
Session |
Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer |
Indicates the percentage of the number of sessions connected to the HiveServer to the maximum number of sessions allowed by the HiveServer. |
90.0% |
|
Kafka |
Partition |
Percentage of Partitions That Are Not Completely Synchronized |
Indicates the percentage of partitions that are not completely synchronized to total partitions. |
50% |
Others |
Unavailable Partition Percentage |
Percentage of unavailable partitions of each Kafka topic |
40% |
|
User Connection Usage on Broker |
Usage of user connections on Broker |
80% |
||
Disk |
Broker Disk Usage |
Indicates the disk usage of the disk where the Broker data directory is located. |
80.0% |
|
Disk I/O Rate of a Broker |
I/O usage of the disk where the Broker data directory is located |
80% |
||
Process |
Broker GC Duration per Minute |
Indicates the GC duration of the Broker process per minute. |
12000 ms |
|
Heap Memory Usage of Kafka |
Indicates the Kafka heap memory usage. |
95% |
||
Kafka Direct Memory Usage |
Indicates the Kafka direct memory usage. |
95% |
||
Loader |
Memory |
Heap Memory Usage Calculate |
Indicates the Loader heap memory usage. |
95% |
Direct Memory Usage of Loader |
Indicates the Loader direct memory usage. |
80.0% |
||
Non-heap Memory Usage of Loader |
Indicates the Loader non-heap memory usage. |
80% |
||
GC |
Total GC time of Loader |
Indicates the total GC time of Loader. |
12000 ms |
|
MapReduce |
Garbage Collection |
GC Time |
Indicates the GC time. |
12000 ms |
Resource |
JobHistoryServer Direct Memory Usage Statistics |
Indicates the JobHistoryServer direct memory usage. |
90% |
|
JobHistoryServer Non-Heap Memory Usage Statistics |
Indicates the JobHistoryServer non-heap memory usage. |
90% |
||
JobHistoryServer Heap Memory Usage Statistics |
Indicates the JobHistoryServer non-heap memory usage. |
95% |
||
Oozie |
Memory |
Oozie Heap Memory Usage Calculate |
Indicates the Oozie heap memory usage. |
95.0% |
Oozie Direct Memory Usage |
Indicates the Oozie direct memory usage. |
80.0% |
||
Oozie Non-heap Memory Usage |
Indicates the Oozie non-heap memory usage. |
80% |
||
GC |
Total GC duration of Oozie |
Indicates the Oozie total GC time. |
12000 ms |
|
Spark/Spark2x |
Memory |
JDBCServer2x Heap Memory Usage Statistics |
JDBCServer2x heap memory usage |
95% |
JDBCServer2x Direct Memory Usage Statistics |
JDBCServer2x direct memory usage |
95% |
||
JDBCServer2x Non-Heap Memory Usage Statistics |
JDBCServer2x non-heap memory usage |
95% |
||
JobHistory2x Direct Memory Usage Statistics |
JobHistory2x direct memory usage |
95% |
||
JobHistory2x Non-Heap Memory Usage Statistics |
JobHistory2x non-heap memory usage |
95% |
||
JobHistory2x Heap Memory Usage Statistics |
JobHistory2x heap memory usage |
95% |
||
IndexServer2x Direct Memory Usage Statistics |
IndexServer2x direct memory usage |
95% |
||
IndexServer2x Heap Memory Usage Statistics |
IndexServer2x heap memory usage |
95% |
||
IndexServer2x Non-Heap Memory Usage Statistics |
IndexServer2x non-heap memory usage |
95% |
||
GC Count |
Full GC Number of JDBCServer2x |
Full GC times of JDBCServer2x |
12 |
|
Full GC Number of JobHistory2x |
Full GC times of JobHistory2x |
12 |
||
Full GC Number of IndexServer2x |
Full GC times of IndexServer2x |
12 |
||
GC Time |
Total GC Time in Milliseconds |
Total GC time of JDBCServer2x |
12000 ms |
|
Total GC Time in Milliseconds |
Total GC time of JobHistory2x |
12000 ms |
||
Total GC Time in Milliseconds |
Total GC time of IndexServer2x |
12000 ms |
||
Storm |
Cluster |
Number of Available Supervisors |
Indicates the number of available Supervisor processes in the cluster in a measurement period. |
1 |
Slot Usage |
Indicates the slot usage in the cluster in a measurement period. |
80.0% |
||
Nimbus |
Nimbus Heap Memory Usage Calculate |
Indicates the Nimbus heap memory usage. |
80% |
|
Yarn |
Resources |
NodeManager Direct Memory Usage Statistics |
Indicates the percentage of direct memory used by NodeManagers. |
90% |
NodeManager Heap Memory Usage Statistics |
Indicates the percentage of NodeManager heap memory usage. |
95% |
||
NodeManager Non-Heap Memory Usage Statistics |
Indicates the percentage of NodeManager non-heap memory usage. |
90% |
||
ResourceManager Direct Memory Usage Statistics |
Indicates the Kafka direct memory usage. |
90% |
||
ResourceManager Heap Memory Usage Statistics |
Indicates the ResourceManager heap memory usage. |
95% |
||
ResourceManager Non-Heap Memory Usage Statistics |
Indicates the ResourceManager non-heap memory usage. |
90% |
||
Garbage collection |
GC Time |
Indicates the GC duration of NodeManager per minute. |
12000 ms |
|
GC Time |
Indicates the GC duration of ResourceManager per minute. |
12000 ms |
||
Others |
Failed Applications of root queue |
Number of failed tasks in the root queue |
50 |
|
Terminated Applications of root queue |
Number of killed tasks in the root queue |
50 |
||
CPU & memory |
Pending Memory |
Pending memory capacity |
83886080 MB |
|
Application |
Pending Applications |
Pending tasks |
60 |
|
ZooKeeper |
Connection |
ZooKeeper Connections Usage |
Indicates the percentage of the used connections to the total connections of ZooKeeper. |
80% |
CPU & memory |
Directmemory Usage Calculate |
Indicates the ZooKeeper heap memory usage. |
95% |
|
Heap Memory Usage Calculate |
Indicates the ZooKeeper direct memory usage. |
80% |
||
GC |
ZooKeeper GC Duration per Minute |
Indicates the GC time of ZooKeeper every minute. |
12000 ms |
|
Ranger |
GC |
UserSync GC Duration |
UserSync garbage collection (GC) duration |
12000 ms |
RangerAdmin GC Duration |
RangerAdmin GC duration |
12000 ms |
||
TagSync GC Duration |
TagSync GC duration |
12000 ms |
||
CPU & memory |
UserSync Non-Heap Memory Usage |
UserSync non-heap memory usage |
80.0% |
|
UserSync Direct Memory Usage |
UserSync direct memory usage |
80.0% |
||
UserSync Heap Memory Usage |
UserSync heap memory usage |
95.0% |
||
RangerAdmin Non-Heap Memory Usage |
RangerAdmin non-heap memory usage |
80.0% |
||
RangerAdmin Heap Memory Usage |
RangerAdmin heap memory usage |
95.0% |
||
RangerAdmin Direct Memory Usage |
RangerAdmin direct memory usage |
80.0% |
||
TagSync Direct Memory Usage |
TagSync direct memory usage |
80.0% |
||
TagSync Non-Heap Memory Usage |
TagSync non-heap memory usage |
80.0% |
||
TagSync Heap Memory Usage |
TagSync heap memory usage |
95.0% |
||
ClickHouse |
Cluster Quota |
Clickhouse service quantity quota usage in ZooKeeper |
Quota of the ZooKeeper nodes used by a ClickHouse service |
90% |
Capacity quota usage of the Clickhouse service in ZooKeeper |
Capacity quota of ZooKeeper directory used by the ClickHouse service |
90% |
||
IoTDB |
GC |
IoTDBServer GC Duration |
IoTDBServer garbage collection (GC) duration |
12000 ms |
CPU & memory |
IoTDBServer Heap Memory Usage |
IoTDBServer heap memory usage |
90% |
|
IoTDBServer Direct Memory Usage |
IoTDBServer direct memory usage |
90% |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot