Updated on 2024-09-23 GMT+08:00

Configuring Alarm Thresholds for an MRS Cluster

Manager allows you to configure monitoring metric thresholds to monitor the health of various metrics. If abnormal data occurs and meets the preset conditions, the system will trigger an alarm, which will appear on the alarm page.

Configuring Alarm Thresholds for an MRS Cluster (MRS 3.x or Later)

  1. Log in to FusionInsight Manager.
  2. Choose O&M > Alarm > Thresholds.
  3. Select a monitoring metric for a host or service in the cluster.

    Figure 1 Configuring the threshold for a metric
    For example, after selecting Host Memory Usage, the information about this indicator threshold is displayed.
    • If the alarm sending switch is turned on, an alarm will be triggered if the threshold is reached.
    • When Alarm Severity is on, hierarchical alarms are enabled. The system dynamically reports alarms at each severity based on the real-time metric values and hierarchical thresholds set for the severity. MRS 3.3.0 or later supports this function.
    • Alarm ID and Alarm Name: alarm information triggered against the threshold
    • Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable.
    • Check Period (s): interval for the system to check the monitoring metric.
    • The rules in the rule list are used to trigger alarms.

  4. Click Create Rule to add rules used for monitoring indicators.

    Table 1 Monitoring indicator rule parameters

    Parameter

    Description

    Example Value

    Rule Name

    Name of a rule.

    CPU_MAX

    Severity

    Select an alarm severity.

    • Critical
    • Major
    • Minor
    • Warning

    Threshold Type

    You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.

    • Max value
    • Min value

    Date

    This parameter is used to set the date when the rule takes effect.

    • Daily
    • Weekly
    • Others

    Add Date

    This parameter is available only when Date is set to Others. You can set the date when the rule takes effect. Multiple options are available.

    09-30

    Thresholds

    This parameter is used to set the time range when the rule takes effect.

    Start and End Time: 00:00–08:30

    Threshold of the rule monitoring metric

    Threshold: 10

    You can click to set multiple time ranges for the threshold or click to delete one.

  5. Click OK to save the rules.
  6. Locate the row that contains an added rule, and click Apply in the Operation column. The value of Effective for this rule changes to Yes.

    A new rule can be applied only after you click Cancel for an existing rule.

Configuring Alarm Thresholds for an MRS Cluster (MRS 2.x or Earlier)

  1. On MRS Manager, click System.
  2. In Configuration, click Configure Alarm Threshold under Monitoring and Alarm, select monitoring metrics as planned, and set their baselines.
  3. Click a metric, for example, CPU Usage, and click Create Rule.
  4. In the displayed dialog box dialog box, set monitoring metric rule parameters.

    Table 2 Monitoring indicator rule parameters

    Parameter

    Description

    Value

    Rule Name

    Name of a rule.

    CPU_MAX

    Reference Date

    Date on which the reference metric history is generated.

    2014/11/06 (example)

    Threshold Type

    You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.

    • Max value
    • Min value

    Severity

    Severity

    • Critical
    • Major
    • Minor
    • Warning

    Time Range

    Period in which the rule takes effect.

    From 00:00 to 23:59 (example)

    Threshold

    Threshold of the rule monitoring metric

    80 (example)

    Date

    Type of date when the rule takes effect.

    • Workday
    • Weekend
    • Other

    Add Date

    This parameter is valid only when Date is set to Other. You can select multiple dates.

    11/30 (example)

  5. Click OK. A message is displayed in the upper right corner of the page, indicating that the template is saved successfully.

    Send alarm is selected by default. Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable. Check Period (s) indicates the interval at which MRS Manager checks monitoring metrics.

  6. Locate the row that contains the newly added rule and click Apply in the Operation column. A message is displayed in the upper right corner, indicating that the rule xx is successfully added. Click Cancel in the Operation column. A message is displayed in the upper right corner, indicating that the rule xx is successfully canceled.

Monitoring Metric Reference (MRS 3.x or Later)

FusionInsight Manager alarm monitoring metrics are classified as node information metrics and cluster service metrics. Table 3 lists the metrics whose thresholds can be configured for a node, and Table 4 lists the metrics whose thresholds can be configured for a component.

Table 3 Node monitoring metrics

Metric Group

Metric

Description

Default Threshold

CPU

Host CPU Usage

This indicator reflects the computing and control capabilities of the current cluster in a measurement period. By observing the indicator value, you can better understand the overall resource usage of the cluster.

90.0%

Disk

Disk Usage

Indicates the disk usage of a host.

90.0%

Disk Inode Usage

Indicates the disk inode usage in a measurement period.

80.0%

Memory

Host Memory Usage

Indicates the average memory usage at the current time.

90.0%

Host Status

Host File Handle Usage

Indicates the usage of file handles of the host in a measurement period.

80.0%

Host PID Usage

Indicates the PID usage of a host.

90%

Network Status

TCP Ephemeral Port Usage

Indicates the usage of temporary TCP ports of the host in a measurement period.

80.0%

Network Reading

Read Packet Error Rate

Indicates the read packet error rate of the network interface on the host in a measurement period.

0.5%

Read Packet Dropped Rate

Indicates the read packet dropped rate of the network interface on the host in a measurement period.

0.5%

Read Throughput Rate

Indicates the average read throughput (at MAC layer) of the network interface in a measurement period.

80%

Network Writing

Write Packet Error Rate

Indicates the write packet error rate of the network interface on the host in a measurement period.

0.5%

Write Packet Dropped Rate

Indicates the write packet dropped rate of the network interface on the host in a measurement period.

0.5%

Write Throughput Rate

Indicates the average write throughput (at MAC layer) of the network interface in a measurement period.

80%

Process

Uninterruptible Sleep Process

Number of D state processes on the host in a measurement period

0

omm Process Usage

omm process usage in a measurement period

90

Table 4 Cluster service indicators

Service

Metric Group

Metric Name

Metric Description

Default Threshold

DBService

Database

Usage of the Number of Database Connections

Usage of database connections

90%

Disk Space Usage of the Data Directory

Disk space usage of the data directory

80%

Flume

Agent

Heap Memory Usage Calculate

Flume heap memory usage

95.0%

Flume Direct Memory Usage Statistics

Flume direct memory usage

80.0%

Flume Non-heap Memory Usage

Flume non-heap memory usage

80.0%

Total GC duration of Flume process

Flume total GC time

12000 ms

HBase

GC

GC time for old generation

Total GC time of RegionServer

5000 ms

GC time for old generation

Total GC time of HMaster

5000 ms

CPU & memory

RegionServer Direct Memory Usage Statistics

RegionServer direct memory usage

90%

RegionServer Heap Memory Usage Statistics

RegionServer heap memory usage

90%

HMaster Direct Memory Usage

HMaster direct memory usage

90%

HMaster Heap Memory Usage Statistics

HMaster heap memory usage

90%

Service

Number of Online Regions of a RegionServer

Number of regions of a RegionServer

2000

Region in transaction count over threshold

Number of regions that are in the RIT state and reach the threshold duration

1

Replication

Replication sync failed times (RegionServer)

Number of times that DR data fails to be synchronized

1

Number of Log Files to Be Synchronized in the Active Cluster

Number of log files to be synchronized in the active cluster

128

Number of HFiles to Be Synchronized in the Active Cluster

Number of HFiles to be synchronized in the active cluster

128

Queue

Compaction Queue Size

Size of the Compaction queue

100

HDFS

File and Block

Lost Blocks

Number of block copies that the HDFS lacks of

0

Blocks Under Replicated

Total number of blocks that need to be replicated by the NameNode

1000

RPC

Average Time of Active NameNode RPC Processing

Average NameNode RPC processing time

100 ms

Average Time of Active NameNode RPC Queuing

Average NameNode RPC queuing time

200 ms

Disk

HDFS Disk Usage

HDFS disk usage

80%

DataNode Disk Usage

Disk usage of DataNodes in the HDFS

80%

Percentage of Reserved Space for Replicas of Unused Space

Percentage of the reserved disk space of all the copies to the total unused disk space of DataNodes.

90%

Resource

Faulty DataNodes

Indicates the number of faulty DataNodes.

3

NameNode Non-Heap Memory Usage Statistics

Indicates the percentage of NameNode non-heap memory usage.

90%

NameNode Direct Memory Usage Statistics

Indicates the percentage of direct memory used by NameNodes.

90%

NameNode Heap Memory Usage Statistics

Indicates the percentage of NameNode non-heap memory usage.

95%

DataNode Direct Memory Usage Statistics

Indicates the percentage of direct memory used by DataNodes.

90%

DataNode Heap Memory Usage Statistics

DataNode heap memory usage

95%

DataNode Heap Memory Usage Statistics

Indicates the percentage of DataNode non-heap memory usage.

90%

Garbage Collection

GC Time (NameNode)/GC Time (DataNode)

Indicates the Garbage collection (GC) duration of NameNodes per minute.

12000 ms

GC Time

Indicates the GC duration of DataNodes per minute.

12000 ms

Hive

HQL

Percentage of HQL Statements That Are Executed Successfully by Hive

Indicates the percentage of HQL statements that are executed successfully by Hive.

90.0%

Background

Background Thread Usage

Background thread usage

90%

GC

Total GC time of MetaStore

Indicates the total GC time of MetaStore.

12000 ms

Total GC Time in Milliseconds

Indicates the total GC time of HiveServer.

12000 ms

Capacity

Percentage of HDFS Space Used by Hive to the Available Space

Indicates the percentage of HDFS space used by Hive to the available space.

85.0%

CPU & memory

MetaStore Direct Memory Usage Statistics

MetaStore direct memory usage

95%

MetaStore Non-Heap Memory Usage Statistics

MetaStore non-heap memory usage

95%

MetaStore Heap Memory Usage Statistics

MetaStore heap memory usage

95%

HiveServer Direct Memory Usage Statistics

HiveServer direct memory usage

95%

HiveServer Non-Heap Memory Usage Statistics

HiveServer non-heap memory usage

95%

HiveServer Heap Memory Usage Statistics

HiveServer heap memory usage

95%

Session

Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer

Indicates the percentage of the number of sessions connected to the HiveServer to the maximum number of sessions allowed by the HiveServer.

90.0%

Kafka

Partition

Percentage of Partitions That Are Not Completely Synchronized

Indicates the percentage of partitions that are not completely synchronized to total partitions.

50%

Others

Unavailable Partition Percentage

Percentage of unavailable partitions of each Kafka topic

40%

User Connection Usage on Broker

Usage of user connections on Broker

80%

Disk

Broker Disk Usage

Indicates the disk usage of the disk where the Broker data directory is located.

80.0%

Disk I/O Rate of a Broker

I/O usage of the disk where the Broker data directory is located

80%

Process

Broker GC Duration per Minute

Indicates the GC duration of the Broker process per minute.

12000 ms

Heap Memory Usage of Kafka

Indicates the Kafka heap memory usage.

95%

Kafka Direct Memory Usage

Indicates the Kafka direct memory usage.

95%

Loader

Memory

Heap Memory Usage Calculate

Indicates the Loader heap memory usage.

95%

Direct Memory Usage of Loader

Indicates the Loader direct memory usage.

80.0%

Non-heap Memory Usage of Loader

Indicates the Loader non-heap memory usage.

80%

GC

Total GC time of Loader

Indicates the total GC time of Loader.

12000 ms

MapReduce

Garbage Collection

GC Time

Indicates the GC time.

12000 ms

Resource

JobHistoryServer Direct Memory Usage Statistics

Indicates the JobHistoryServer direct memory usage.

90%

JobHistoryServer Non-Heap Memory Usage Statistics

Indicates the JobHistoryServer non-heap memory usage.

90%

JobHistoryServer Heap Memory Usage Statistics

Indicates the JobHistoryServer non-heap memory usage.

95%

Oozie

Memory

Oozie Heap Memory Usage Calculate

Indicates the Oozie heap memory usage.

95.0%

Oozie Direct Memory Usage

Indicates the Oozie direct memory usage.

80.0%

Oozie Non-heap Memory Usage

Indicates the Oozie non-heap memory usage.

80%

GC

Total GC duration of Oozie

Indicates the Oozie total GC time.

12000 ms

Spark/Spark2x

Memory

JDBCServer2x Heap Memory Usage Statistics

JDBCServer2x heap memory usage

95%

JDBCServer2x Direct Memory Usage Statistics

JDBCServer2x direct memory usage

95%

JDBCServer2x Non-Heap Memory Usage Statistics

JDBCServer2x non-heap memory usage

95%

JobHistory2x Direct Memory Usage Statistics

JobHistory2x direct memory usage

95%

JobHistory2x Non-Heap Memory Usage Statistics

JobHistory2x non-heap memory usage

95%

JobHistory2x Heap Memory Usage Statistics

JobHistory2x heap memory usage

95%

IndexServer2x Direct Memory Usage Statistics

IndexServer2x direct memory usage

95%

IndexServer2x Heap Memory Usage Statistics

IndexServer2x heap memory usage

95%

IndexServer2x Non-Heap Memory Usage Statistics

IndexServer2x non-heap memory usage

95%

GC Count

Full GC Number of JDBCServer2x

Full GC times of JDBCServer2x

12

Full GC Number of JobHistory2x

Full GC times of JobHistory2x

12

Full GC Number of IndexServer2x

Full GC times of IndexServer2x

12

GC Time

Total GC Time in Milliseconds

Total GC time of JDBCServer2x

12000 ms

Total GC Time in Milliseconds

Total GC time of JobHistory2x

12000 ms

Total GC Time in Milliseconds

Total GC time of IndexServer2x

12000 ms

Storm

Cluster

Number of Available Supervisors

Indicates the number of available Supervisor processes in the cluster in a measurement period.

1

Slot Usage

Indicates the slot usage in the cluster in a measurement period.

80.0%

Nimbus

Nimbus Heap Memory Usage Calculate

Indicates the Nimbus heap memory usage.

80%

Yarn

Resources

NodeManager Direct Memory Usage Statistics

Indicates the percentage of direct memory used by NodeManagers.

90%

NodeManager Heap Memory Usage Statistics

Indicates the percentage of NodeManager heap memory usage.

95%

NodeManager Non-Heap Memory Usage Statistics

Indicates the percentage of NodeManager non-heap memory usage.

90%

ResourceManager Direct Memory Usage Statistics

Indicates the Kafka direct memory usage.

90%

ResourceManager Heap Memory Usage Statistics

Indicates the ResourceManager heap memory usage.

95%

ResourceManager Non-Heap Memory Usage Statistics

Indicates the ResourceManager non-heap memory usage.

90%

Garbage collection

GC Time

Indicates the GC duration of NodeManager per minute.

12000 ms

GC Time

Indicates the GC duration of ResourceManager per minute.

12000 ms

Others

Failed Applications of root queue

Number of failed tasks in the root queue

50

Terminated Applications of root queue

Number of killed tasks in the root queue

50

CPU & memory

Pending Memory

Pending memory capacity

83886080 MB

Application

Pending Applications

Pending tasks

60

ZooKeeper

Connection

ZooKeeper Connections Usage

Indicates the percentage of the used connections to the total connections of ZooKeeper.

80%

CPU & memory

Directmemory Usage Calculate

Indicates the ZooKeeper heap memory usage.

95%

Heap Memory Usage Calculate

Indicates the ZooKeeper direct memory usage.

80%

GC

ZooKeeper GC Duration per Minute

Indicates the GC time of ZooKeeper every minute.

12000 ms

Ranger

GC

UserSync GC Duration

UserSync garbage collection (GC) duration

12000 ms

RangerAdmin GC Duration

RangerAdmin GC duration

12000 ms

TagSync GC Duration

TagSync GC duration

12000 ms

CPU & memory

UserSync Non-Heap Memory Usage

UserSync non-heap memory usage

80.0%

UserSync Direct Memory Usage

UserSync direct memory usage

80.0%

UserSync Heap Memory Usage

UserSync heap memory usage

95.0%

RangerAdmin Non-Heap Memory Usage

RangerAdmin non-heap memory usage

80.0%

RangerAdmin Heap Memory Usage

RangerAdmin heap memory usage

95.0%

RangerAdmin Direct Memory Usage

RangerAdmin direct memory usage

80.0%

TagSync Direct Memory Usage

TagSync direct memory usage

80.0%

TagSync Non-Heap Memory Usage

TagSync non-heap memory usage

80.0%

TagSync Heap Memory Usage

TagSync heap memory usage

95.0%

ClickHouse

Cluster Quota

Clickhouse service quantity quota usage in ZooKeeper

Quota of the ZooKeeper nodes used by a ClickHouse service

90%

Capacity quota usage of the Clickhouse service in ZooKeeper

Capacity quota of ZooKeeper directory used by the ClickHouse service

90%

IoTDB

GC

IoTDBServer GC Duration

IoTDBServer garbage collection (GC) duration

12000 ms

CPU & memory

IoTDBServer Heap Memory Usage

IoTDBServer heap memory usage

90%

IoTDBServer Direct Memory Usage

IoTDBServer direct memory usage

90%