Updated on 2022-12-14 GMT+08:00

Configuring the Threshold

Scenarios

You can configure monitoring indicator thresholds to monitor the health status of indicators on FusionInsight Manager. If abnormal data occurs and the preset conditions are met, the system triggers an alarm and displays the alarm information on the alarm page.

Procedure

  1. Log in to FusionInsight Manager.
  2. Choose O&M > Alarm > Thresholds.
  3. Select a monitoring indicator for a specified host or service in the cluster.

    Figure 1 Configuring indicator thresholds
    For example, after selecting Host Memory Usage, the information about this indicator threshold is displayed.
    • If the alarm sending switch is displayed as , an alarm is triggered if the alarm threshold is reached.
    • The alarm ID and alarm name contain the alarm information that is triggered by the threshold:
    • FusionInsight Manager checks whether the value of each monitored indicator reaches the threshold. If the number of consecutive check times is equal to the value of Trigger Count, and the threshold is not reached in these checks, the system sends an alarm.
    • The value can be customized. Check Period (s) indicates the interval for the system to check monitoring indicators.
    • Rules for triggering an alarm.

  4. Click Create Rule to add rules used for monitoring indicators.

    Table 1 Monitoring indicator rule parameters

    Parameter

    Value

    Description

    Rule Name

    CPU_MAX (example value)

    Name of a rule.

    Alarm Severity

    • Critical
    • Major
    • Minor
    • Warning

    Alarm Severity

    • Critical
    • Major
    • Minor
    • Warning

    Threshold Type

    • Max value
    • Min value

    You can select the maximum or minimum value of an indicator. Setting this parameter to Max value, the system generates an alarm when the actual value of the indicator is greater than the threshold. Setting this parameter to Min value, the system generates an alarm when the actual value of the indicator is less than the threshold.

    Date

    • Daily
    • Weekly
    • Others

    This parameter is used to set the date when the rule takes effect.

    Add Date

    09-30

    This parameter is available only when Date is set to Others. You can set the date when the rule takes effect. Multiple options are available.

    Thresholds

    Start and End Time: 00: 00 to 08:30

    This parameter is used to set the time range when the rule takes effect.

    Threshold: 10

    Specifies the threshold of the rule monitoring indicator.

    For the last parameter in the table, you can click or to add or delete multiple start and end time or alarm indicator thresholds.

  5. Click OK to save the rules.
  6. Locate the row that contains an added rule, and click Apply in the Operation column. The value of Effective for this rule changes as Yes.

    You can apply a new rule only after clicking Cancel.

Monitoring Indicator Reference

FusionInsight Manager alarm monitoring indicators are categorized into node information indicators and cluster service indicators. Table 2 describes the indicators whose thresholds can be configured on nodes.

Table 2 Monitoring indicators on each node

Monitoring Indicator Group Name

Indicator Name

Description

Default Threshold

CPU

Host CPU Usage

This indicator reflects the computing and control capabilities of the current cluster in a measurement period. By observing the indicator value, you can better understand the overall resource usage of the cluster.

90.0%

Disk

Disk Usage

Indicates the disk usage of a host.

90.0%

Disk Inode Usage

Indicates the disk inode usage in a measurement period.

80.0%

Memory

Host Memory Usage

Indicates the average memory usage at the current time.

90.0%

Host Status

Host File Handle Usage

Indicates the usage of file handles of the host in a measurement period.

80.0%

Host PID Usage

Indicates the PID usage of a host.

90%

Network Status

TCP Ephemeral Port Usage

Indicates the usage of temporary TCP ports of the host in a measurement period.

80.0%

Network Reading

Read Packet Error Rate

Indicates the read packet error rate of the network interface on the host in a measurement period.

0.5%

Read Packet Dropped Rate

Indicates the read packet dropped rate of the network interface on the host in a measurement period.

0.5%

Read Throughput Rate

Indicates the average read throughput (at MAC layer) of the network interface in a measurement period.

80%

Network Writing

Write Packet Error Rate

Indicates the write packet error rate of the network interface on the host in a measurement period.

0.5%

Write Packet Dropped Rate

Indicates the write packet dropped rate of the network interface on the host in a measurement period.

0.5%

Write Throughput Rate

Indicates the average write throughput (at MAC layer) of the network interface in a measurement period.

80%

Process

Uninterruptible Sleep Process

Indicates the number of D state processes on the host in a measurement period.

0

omm Process Usage

Indicates the usage of the omm process within a measurement period.

90

Table 3 Cluster service indicators

Service

Monitoring Indicator Group Name

Indicator Name

Description

Default Threshold

DBService

Database

Database Connections Usage

Indicates the usage of the number of database connections.

90%

Disk Space Usage of the Data Directory

Disk space usage of the data directory.

80%

Flume

Agent

Heap Memory Usage Calculate

Indicates the Flume heap memory usage.

95.0%

Flume Direct Memory Usage Statistics

Indicates the Flume direct memory usage.

80.0%

Flume Non-heap Memory Usage

Indicates the Flume non-heap memory usage.

80.0%

Total GC duration of Flume process

Indicates the Flume total GC time.

12000ms

HBase

GC

GC time for old generation

Indicates the total GC time of RegionServer.

5000ms

GC time for old generation

Indicates the total GC time of HMaster.

5000ms

CPU and Memory

RegionServer Direct Memory Usage Statistics

Indicates the RegionServerReg direct memory usage.

90%

RegionServer Heap Memory Usage Statistics

Indicates the RegionServer heap memory usage.

90%

HMaster Direct Memory Usage

Indicates the HMaster direct memory usage.

90%

HMaster Heap Memory Usage Statistics

Indicates the HMaster heap memory usage.

90%

Service

Regions

Indicates the number of regions of a RegionServer.

2000

Region in transaction count over threshold

Number of regions that are in the RIT state and reach the threshold duration.

1

Replication

Replication sync failed times

Indicates the number of times that DR data fails to be synchronized.

1

Queue

Compaction Queue Size

Compaction queue size.

100

HDFS

File and Block

Lost Blocks

Number of missing copy blocks in the HDFS file system.

0

Blocks Under Replicated

Total number of blocks that need to be replicated by the NameNode.

1000

RPC

Average Time of Active NameNode RPC Processing

Indicates the average RPC processing time.

100ms

Average Time of Active NameNode RPC Queuing

Indicates the average RPC queuing time.

200ms

Disk

Disk Usage

Indicates the HDFS disk usage.

80%

Percentage of DataNode Capacity

Indicates the disk usage of DataNodes in the HDFS.

80%

Percentage of Reserved Space for Replicas of Unused Space

Indicates the percentage of the reserved disk space of all the copies to the total unused disk space of DataNodes.

90%

Resource

Faulty DataNodes

Indicates the number of faulty DataNodes.

3

NameNode Non Heap Memory Usage Statistics

Indicates the percentage of NameNode non-heap memory usage.

90%

NameNode Direct Memory Usage Statistics

Indicates the percentage of direct memory used by NameNodes.

90%

NameNode Heap Memory Usage Statistics

Indicates the percentage of NameNode non-heap memory usage.

95%

DataNode Non Heap Memory Usage Statistics

Indicates the percentage of DataNode non-heap memory usage.

90%

DataNode Direct Memory Usage Statistics

Indicates the percentage of direct memory used by DataNodes.

90%

DataNode Heap Memory Usage Statistics

Indicates the percentage of DataNode non-heap memory usage.

95%

Garbage Collection

GC Time

Indicates the Garbage collection (GC) duration of NameNodes per minute.

12000ms

GC Time

Indicates the GC duration of DataNodes per minute.

12000ms

Hive

HQL

Percentage of HQL Statements That Are Executed Successfully by Hive

Indicates the percentage of HQL statements that are executed successfully by Hive.

90.0%

Background

Background Thread Usage

Indicates the percentage of Background thread usage.

90%

GC

Total GC Time in Milliseconds

Indicates the total GC time of MetaStore.

12000ms

Total GC Time in Milliseconds

Indicates the total GC time of HiveServer.

12000ms

Capacity

Percentage of HDFS Space Used by Hive to the Available Space

Indicates the percentage of HDFS space used by Hive to the available space.

85.0%

CPU and Memory

MetaStore Direct Memory Usage Statistics

Indicates the MetaStore direct memory usage.

95%

MetaStore Non-Heap Memory Usage Statistics

Indicates the MetaStore non-heap memory usage.

95%

MetaStore Heap Memory Usage Statistics

Indicates the MetaStore heap memory usage.

95%

HiveServer Direct Memory Usage Statistics

Indicates the HiveServer direct memory usage.

95%

HiveServer Non-Heap Memory Usage Statistics

Indicates the HiveServer non-heap memory usage.

95%

HiveServer Heap Memory Usage Statistics

Indicates the HiveServer heap memory usage.

95%

Session

Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer

Indicates the percentage of the number of sessions connected to the HiveServer to the maximum number of sessions allowed by the HiveServer.

90.0%

Kafka

Partition

Percentage of Partitions That Are Not Completely Synchronized

Indicates the percentage of partitions that are not completely synchronized to total partitions.

50%

Other

Unavailable Partition Percentage

Disk usage of the disk where the Broker data directory is located.

40%

User Connection Usage on Broker

User connection usage on the broker.

80%

Disk

Broker Disk Usage

Indicates the disk usage of the disk where the Broker data directory is located.

80%

Process

Broker GC Duration per Minute

Indicates the GC duration of the Broker process per minute.

12000ms

Heap Memory Usage of Kafka

Indicates the Kafka heap memory usage.

95%

Kafka Direct Memory Usage

Indicates the Kafka direct memory usage.

95%

Loader

Memory

Heap Memory Usage Calculate

Indicates the Loader heap memory usage.

95%

Loader Direct Memory Usage Statistics

Indicates the Loader direct memory usage.

80.0%

Non heap Memory Usage Calculate

Indicates the Loader non-heap memory usage.

80%

GC

Total GC time in milliseconds

Indicates the total GC time of Loader.

12000ms

MapReduce

Garbage Collection

GC Time

Indicates the GC time.

12000ms

Resource

JobHistoryServer Direct Memory Usage Statistics

Indicates the JobHistoryServer direct memory usage.

90%

JobHistoryServer Non Heap Memory Usage Statistics

Indicates the JobHistoryServer non-heap memory usage.

90%

JobHistoryServer Heap Memory Usage Statistics

Indicates the JobHistoryServer non-heap memory usage.

95%

Oozie

Memory

Heap Memory Usage Calculate

Indicates the Oozie heap memory usage.

95.0%

Oozie Direct Buffer Resource Percentage

Indicates the Oozie direct memory usage.

80.0%

Non Heap Memory Usage Calculate

Indicates the Oozie non-heap memory usage.

80%

GC

Total GC duration of Oozie process

Indicates the Oozie total GC time.

12000ms

Spark2x

Memory

JDBCServer2x Heap Memory Usage Statistics

Indicates the JDBCServer2x heap memory usage.

95%

JDBCServer2x Direct Memory Usage Statistics

Indicates the JDBCServer2x direct memory usage.

95%

JDBCServer2x Non-Heap Memory Usage Statistics

Indicates the JDBCServer2x non-heap memory usage.

95%

JobHistory2x Direct Memory Usage Statistics

Indicates the JobHistory2x direct memory usage.

95%

JobHistory2x Non-Heap Memory Usage Statistics

Indicates the JobHistory2x non-heap memory usage.

95%

JobHistory2x Heap Memory Usage Statistics

Indicates the JobHistory2x heap memory usage.

95%

IndexServer2x Direct Memory Usage Statistics

Indicates the IndexServer2x direct memory usage.

95%

IndexServer2x Heap Memory Usage Statistics

ndicates the IndexServer2x heap memory usage.

95%

IndexServer2x Non-Heap Memory Usage Statistics

Indicates the IndexServer2x non-heap memory usage.

95%

GC number

Full GC Number of JDBCServer2x

Indicates the total GC number of JDBCServer2x.

12

Full GC Number of JobHistory2x

Indicates the total GC number of JobHistory2x.

12

Full GC Number of IndexServer2x

Indicates the total GC number of IndexServer2x.

12

GC Time

Total GC time in milliseconds

Indicates the total GC time of JDBCServer2x.

12000ms

Total GC time in milliseconds

Indicates the total GC time of JobHistory2x.

12000ms

Total GC time in milliseconds

Indicates the total GC time of IndexServer2x.

12000ms

Storm

Cluster

Number of Available Supervisors

Indicates the number of available Supervisor processes in the cluster in a measurement period.

1

Slot Usage

Indicates the slot usage in the cluster in a measurement period.

80.0%

Nimbus

Heap Memory Usage Calculate

Indicates the Nimbus heap memory usage.

80%

Yarn

Resource

NodeManager Direct Memory Usage Statistics

Indicates the percentage of direct memory used by NodeManagers.

90%

NodeManager Heap Memory Usage Statistics

Indicates the percentage of NodeManager heap memory usage.

95%

NodeManager Non Heap Memory Usage Statistics

Indicates the percentage of NodeManager non-heap memory usage.

90%

ResourceManager Direct Memory Usage Statistics

Indicates the Kafka direct memory usage.

90%

ResourceManager Heap Memory Usage Statistics

Indicates the ResourceManager heap memory usage.

95%

ResourceManager Non Heap Memory Usage Statistics

Indicates the ResourceManager non-heap memory usage.

90%

CPU and Memory

Pending Memory

Pending memory capacity.

83886080MB

Other

Failed Applications of root queue

Number of failed tasks in the root queue.

50

Terminated Applications of root queue

Number of killed tasks in the root queue.

50

Garbage collection

GC Time

Indicates the GC duration of NodeManager per minute.

12000ms

GC Time

Indicates the GC duration of ResourceManager per minute.

12000ms

Application

Pending Applications

Pending tasks.

60

ZooKeeper

Connection

ZooKeeper Connections Usage

Indicates the percentage of the used connections to the total connections of ZooKeeper.

80%

CPU and Memory

Heap Memory Usage Calculate

Indicates the ZooKeeper direct memory usage.

95%

Direct Memory Usage Calculate

Indicates the ZooKeeper heap memory usage.

80%

GC

ZooKeeper GC Duration per Minute

Indicates the GC time of ZooKeeper every minute.

12000ms

meta

OBS Meta data Operations

Average Time for Calling the OBS Metadata API

Average time for calling the OBS metadata APIs.

500ms

Success Rate for Calling the OBS Metadata API

Success rate of calling the OBS metadata APIs

99.0%

OBS data write operation

Success Rate for Calling the OBS Write API

Success rate of calling the OBS data write APIs.

99.0%

OBS data read operation

Success Rate for Calling the OBS Data Read API

Success rate of calling the OBS data read operation APIs.

99.0%

Ranger

GC

UserSync GC Duration

UserSync garbage collection (GC) duration.

12000ms

RangerAdmin GC Duration

RangerAdmin garbage collection (GC) duration.

12000ms

TagSync GC Duration

TagSync garbage collection (GC) duration.

12000ms

CPU and Memory

UserSync Non-Heap Memory Usage

UserSync non-heap memory usage in percentage.

80.0%

UserSync Direct Memory Usage

UserSync direct memory usage in percentage.

80.0%

UserSync Heap Memory Usage

UserSync heap memory usage in percentage.

95.0%

RangerAdmin Non-Heap Memory Usage

RangerAdmin non-heap memory usage.

80.0%

RangerAdmin Heap Memory Usage

RangerAdmin heap memory usage in percentage.

95.0%

RangerAdmin Direct Memory Usage

RangerAdmin direct memory usage.

80.0%

TagSync Direct Memory Usage

TagSync direct memory usage in percentage.

80.0%

TagSync Non-Heap Memory Usage

TagSync non-heap memory usage in percentage.

80.0%

TagSync Heap Memory Usage

TagSync heap memory usage in percentage.

95.0%

ClickHouse

Cluster Quota

Clickhouse service quantity quota usage in ZooKeeper

Quota of the ZooKeeper nodes used by the ClickHouse service.

90%

Capacity quota usage of the Clickhouse service in ZooKeeper

Capacity quota of ZooKeeper directory used by the ClickHouse service.

90%

IoTDB

GC

IoTDBServer GC Duration

IoTDBServer garbage collection (GC) duration.

12000ms

CPU and Memory

IoTDBServer Heap Memory Usage

IoTDBServer heap memory usage in percentage.

90%

IoTDBServer Direct Memory Usage

IoTDBServer direct memory usage in percentage.

90%