Help Center/ MapReduce Service/ Best Practices/ MRS Cluster Management/ Configuring Thresholds for Alarms
Updated on 2024-08-12 GMT+08:00

Configuring Thresholds for Alarms

MRS clusters provide easy-to-use alarming functions with intuitive monitoring metric views. You can quickly view statistics on key performance metrics (KPIs) of a cluster and evaluate the cluster health status. MRS allows you to configure metric thresholds to stay informed of cluster health status. If a threshold value is met, the system generates and displays an alarm on the metric dashboard.

If it is verified that the impact of some alarms on services can be ignored or the alarm thresholds need to be adjusted, you can customize cluster metrics or mask some alarms as required.

You can set thresholds for alarms of node information metrics and cluster service metrics. For details about these metrics, their impacts on the system, and default thresholds, see Monitoring Metric Reference.

These alarms may affect cluster functions or job running. If you want to mask or modify alarm rules, evaluate operation risks in advance.

Modifying Rules for Alarms with Custom Thresholds

  1. Log in to FusionInsight Manager of the target MRS cluster by referring to Accessing Log in the FusionInsight Manager (MRS 3.x or Later).
  2. Choose O&M > Alarm > Thresholds.
  3. Select a metric for a host or service in the cluster. For example, select Host Memory Usage.

    Figure 1 Viewing an alarm threshold
    • Switch: If this switch is turned on, an alarm will be triggered when the metric breaches this threshold.
    • Trigger Count: Manager checks whether the metric meets the threshold value. If the number of consecutive checks where the metric fails equals the value of Trigger Count, an alarm is generated. The value can be customized. If an alarm is frequently reported, you can set Trigger Count to a larger value to reduce the alarming frequency.
    • Check Period (s): Interval between each two checks
    • The rules to trigger alarms are listed on the page.

  4. Modify an alarm rule.

    • Add a new rule.
      1. Click Create Rule to add a rule that defines how an alarm will be triggered. For details, see Table 1.
      2. Click OK to save the rule.
      3. Locate the row that contains a rule that is in use, and click Cancel in the Operation column. If no rule is in use, skip this step.
      4. Locate the row that contains the new rule, and click Apply in the Operation column. The value of Effective for this rule changes to Yes.
    • Modify an existing rule.
      1. Click Modify in the Operation column of the row that contains the target rule.
      2. Modify rule parameters by referring to Table 1.
      3. Click OK.

    The following table lists the rule parameters you need to set for triggering an alarm of Host Memory Usage.

    Table 1 Alarm rule parameters

    Parameter

    Description

    Example Value

    Rule Name

    Rule name

    mrs_test

    Severity

    Alarm severity. The options are as follows:

    • Critical
    • Major
    • Minor
    • Warning

    Major

    Threshold Type

    Maximum or minimum value of a metric

    • Max value: An alarm will be generated when the metric value is greater than this value.
    • Min value: An alarm will be generated when the metric value is less than this value.

    Max. Value

    Date

    How often the rule takes effect

    • Daily
    • Weekly
    • Others

    Daily

    Add Date

    Date when the rule takes effect. This parameter is available only when Date is set to Others. You can set multiple dates.

    -

    Thresholds

    Start and End Time: Period when the rule takes effect.

    00:00 - 23:59

    Threshold: Alarm threshold value

    85

Masking Specified Alarms

  1. Log in to FusionInsight Manager of the target MRS cluster by referring to Accessing Log in the FusionInsight Manager (MRS 3.x or Later).
  2. Choose O&M > Alarm > Masking.
  3. In the list on the left of the displayed page, select the target service or module.
  4. Click Mask in the Operation column of the alarm you want to mask. In the dialog box that is displayed, click OK to change the masking status of the alarm to Mask.

    Figure 2 Masking an alarm
    • You can search for specified alarms in the list.
    • To cancel alarm masking, click Unmask in the row of the target alarm. In the dialog box that is displayed, click OK to change the alarm masking status to Display.
    • If you need to perform operations on multiple alarms at a time, select the alarms and click Mask or Unmask on the top of the list.

FAQ

  • How Do I View Uncleared Alarms in a Cluster?
    1. Log in to the MRS management console.
    2. Click the name of the target cluster and click the Alarms tab.
    3. Click Advanced Search, set Alarm Status to Uncleared, and click Search.
    4. Uncleared alarms of the current cluster are displayed.
  • How Do I Clear a Cluster Alarm?

    You can handle the alarms by referring to the alarm help. To view the help document, perform the following steps:

    • Console: Log in to the MRS management console, click the name of the target cluster, click the Alarms tab, and click View Help in the Operation column of the alarm list. Then, clear the alarm by referring to the alarm handling procedure.
    • Manager: Log in to FusionInsight Manager, choose O&M > Alarm > Alarms, and click View Help in the Operation column. Then, clear the alarm by referring to the alarm handling procedure.

Monitoring Metric Reference

FusionInsight Manager monitoring metrics are classified as node information metrics and cluster service metrics. Table 2 lists the metrics whose thresholds can be configured a node, and Table 3 lists metrics whose thresholds can be configured for a component.

Table 2 Node monitoring metrics and corresponding alarms

Metric Group

Metric

ID

Alarm

Impact on System

Default Threshold

CPU

Host CPU Usage

12016

CPU Usage Exceeds the Threshold

Service processes respond slowly or become unavailable.

90.0%

Disk

Disk Usage

12017

Insufficient Disk Capacity

Service processes become unavailable.

90.0%

Disk Inode Usage

12051

Disk Inode Usage Exceeds the Threshold

Data cannot be properly written to the file system.

80.0%

Memory

Host Memory Usage

12018

Memory Usage Exceeds the Threshold

Service processes respond slowly or become unavailable.

90.0%

Host Status

Host File Handle Usage

12053

Host File Handle Usage Exceeds the Threshold

The I/O operations, such as opening a file or connecting to network, cannot be performed and programs are abnormal.

80.0%

Host PID Usage

12027

Host PID Usage Exceeds the Threshold

No PID is available for new processes and service processes are unavailable.

90%

Network Status

TCP Temporary Port Usage

12052

TCP Temporary Port Usage Exceeds the Threshold

Services on the host fail to establish connections with the external and services are interrupted.

80.0%

Network Reading

Read Packet Error Rate

12047

Read Packet Error Rate Exceeds the Threshold

The communication is intermittently interrupted, and services time out.

0.5%

Read Packet Dropped Rate

12045

Read Packet Dropped Rate Exceeds the Threshold

The service performance deteriorates or some services time out.

0.5%

Read Throughput Rate

12049

Read Throughput Rate Exceeds the Threshold

The service system runs abnormally or is unavailable.

80%

Network Writing

Write Packet Error Rate

12048

Write Packet Error Rate Exceeds the Threshold

The communication is intermittently interrupted, and services time out.

0.5%

Write Packet Dropped Rate

12046

Write Packet Dropped Rate Exceeds the Threshold

The service performance deteriorates or some services time out.

0.5%

Write Throughput Rate

12050

Write Throughput Rate Exceeds the Threshold

The service system runs abnormally or is unavailable.

80%

Process

Total Number of Processes in D and Z States

12028

Number of Processes in the D State and Z State on a Host Exceeds the Threshold

Excessive system resources are used and service processes respond slowly.

0

omm Process Usage

12061

Process Usage Exceeds the Threshold

Switch to user omm fails. New omm process cannot be created.

90

Table 3 Cluster monitoring metrics and corresponding alarms

Service

Metric

ID

Alarm Name

Impact on System

Default Threshold

DBService

Usage of the Number of Database Connections

27005

Database Connection Usage Exceeds the Threshold

Upper-layer services may fail to connect to the DBService database, affecting services.

90%

Disk Space Usage of the Data Directory

27006

Disk Space Usage of the Data Directory Exceeds the Threshold

Service processes become unavailable.

When the disk space usage of the data directory exceeds 90%, the database enters the read-only mode and Database Enters the Read-Only Mode is generated. As a result, service data is lost.

80%

Flume

Heap Memory Resource Percentage

24006

Heap Memory Usage of Flume Server Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95.0%

Direct Memory Usage Statistics

24007

Flume Server Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

Non-heap Memory Usage

24008

Flume Server Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80.0%

Total GC Duration

24009

Flume Server GC Duration Exceeds the Threshold

Flume data transmission efficiency decreases.

12000ms

HBase

GC Duration of Old Generation

19007

HBase GC Duration Exceeds the Threshold

If the old generation GC duration exceeds the threshold, HBase data read and write are affected.

5000ms

RegionServer Direct Memory Usage Statistics

19009

Direct Memory Usage of the HBase Process Exceeds the Threshold

If the available HBase direct memory is insufficient, a memory overflow occurs and the service breaks down.

90%

RegionServer Heap Memory Usage Statistics

19008

Heap Memory Usage of the HBase Process Exceeds the Threshold

If the available HBase memory is insufficient, a memory overflow occurs and the service breaks down.

90%

HMaster Direct Memory Usage

19009

Direct Memory Usage of the HBase Process Exceeds the Threshold

If the available HBase direct memory is insufficient, a memory overflow occurs and the service breaks down.

90%

HMaster Heap Memory Usage Statistics

19008

Heap Memory Usage of the HBase Process Exceeds the Threshold

If the available HBase memory is insufficient, a memory overflow occurs and the service breaks down.

90%

Number of Online Regions of a RegionServer

19011

Number of RegionServer Regions Exceeds the Threshold

The data read/write performance of HBase is affected when the number of regions on a RegionServer exceeds the threshold.

2000

Region in RIT State That Reaches the Threshold Duration

19013

Duration of Regions in RIT State Exceeds the Threshold

Some data in the table is lost or becomes unavailable.

1

Handler Usage of RegionServer

19021

Number of Active Handlers of RegionServer Exceeds the Threshold

RegionServers or HBase cannot provide services properly.

90%

Synchronization Failures in Disaster Recovery

19006

HBase Replication Sync Failed

HBase data in a cluster fails to be synchronized to the standby cluster, causing data inconsistency between active and standby clusters.

1

Number of Log Files to Be Synchronized in the Active Cluster

19020

Number of HBase WAL Files to Be Synchronized Exceeds the Threshold

If the number of WAL files to be synchronized by a RegionServer exceeds the threshold, the number of ZNodes used by HBase exceeds the threshold, affecting the HBase service status.

128

Number of HFiles to Be Synchronized in the Active Cluster

19019

Number of HFiles to Be Synchronized Exceeds the Threshold

If the number of HFiles to be synchronized by a RegionServer exceeds the threshold, the number of ZNodes used by HBase exceeds the threshold, affecting the HBase service status.

128

Compaction Queue Size

19018

HBase Compaction Queue Size Exceeds the Threshold

The cluster performance may deteriorate, affecting data read and write.

100

HDFS

Lost Blocks

14003

Number of Lost HDFS Blocks Exceeds the Threshold

Data stored in HDFS is lost. HDFS may enter the security mode and cannot provide write services. Lost block data cannot be restored.

0

Blocks Under Replicated

14028

Number of Blocks to Be Supplemented Exceeds the Threshold

Data stored in HDFS is lost. HDFS may enter the security mode and cannot provide write services. Lost block data cannot be restored.

1000

Average Time of Active NameNode RPC Processing

14021

Average NameNode RPC Processing Time Exceeds the Threshold

NameNode cannot process the RPC requests from HDFS clients, upper-layer services that depend on HDFS, and DataNode in a timely manner. Specifically, the services that access HDFS run slowly or the HDFS service is unavailable.

100ms

Average Time of Active NameNode RPC Queuing

14022

Average NameNode RPC Queuing Time Exceeds the Threshold

NameNode cannot process the RPC requests from HDFS clients, upper-layer services that depend on HDFS, and DataNode in a timely manner. Specifically, the services that access HDFS run slowly or the HDFS service is unavailable.

200ms

HDFS Disk Usage

14001

HDFS Disk Usage Exceeds the Threshold

The performance of writing data to HDFS is affected.

80%

DataNode Disk Usage

14002

DataNode Disk Usage Exceeds the Threshold

Insufficient disk space will impact data write to HDFS.

80%

Percentage of Reserved Space for Replicas of Unused Space

14023

Percentage of Total Reserved Disk Space for Replicas Exceeds the Threshold

The performance of writing data to HDFS is affected. If all unused DataNode space is reserved for replicas, writing HDFS data fails.

90%

Total Faulty DataNodes

14009

Number of Dead DataNodes Exceeds the Threshold

Faulty DataNodes cannot provide HDFS services.

3

NameNode Non-Heap Memory Usage Statistics

14018

NameNode Non-Heap Memory Usage Exceeds the Threshold

If the non-heap memory usage of the HDFS NameNode is too high, data read/write performance of HDFS will be affected.

90%

NameNode Direct Memory Usage Statistics

14017

NameNode Direct Memory Usage Exceeds the Threshold

If the available direct memory of NameNode instances is insufficient, a memory overflow may occur and the service breaks down.

90%

NameNode Heap Memory Usage Statistics

14007

NameNode Heap Memory Usage Exceeds the Threshold

If the heap memory usage of the HDFS NameNode is too high, data read/write performance of HDFS will be affected.

95%

DataNode Direct Memory Usage Statistics

14016

DataNode Direct Memory Usage Exceeds the Threshold

If the available direct memory of DataNode instances is insufficient, a memory overflow may occur and the service breaks down.

90%

DataNode Heap Memory Usage Statistics

14008

DataNode Heap Memory Usage Exceeds the Threshold

The HDFS DataNode heap memory usage is too high, which affects the data read/write performance of the HDFS.

95%

DataNode Non-Heap Memory Usage Statistics

14019

DataNode Non-Heap Memory Usage Exceeds the Threshold

If the non-heap memory usage of the HDFS DataNode is too high, data read/write performance of HDFS will be affected.

90%

NameNode GC Duration Statistics

14014

NameNode GC Duration Exceeds the Threshold

A long GC duration of the NameNode process may interrupt the services.

12000ms

DataNode GC Duration Statistics

14015

DataNode GC Duration Exceeds the Threshold

A long GC duration of the DataNode process may interrupt the services.

12000ms

Hive

Hive SQL Execution Success Rate (Percentage)

16002

Hive SQL Execution Success Rate Is Lower Than the Threshold

The system configuration and performance cannot meet service processing requirements.

90.0%

Background Thread Usage

16003

Background Thread Usage Exceeds the Threshold

There are too many background threads, so the newly submitted task cannot run in time.

90%

Total GC Duration of MetaStore

16007

Hive GC Duration Exceeds the Threshold

If the GC duration exceeds the threshold, Hive data read and write are affected.

12000ms

Total GC Duration of HiveServer

16007

Hive GC Duration Exceeds the Threshold

If the GC duration exceeds the threshold, Hive data read and write are affected.

12000ms

Percentage of HDFS Space Used by Hive to the Available Space

16001

Hive Warehouse Space Usage Exceeds the Threshold

The system fails to write data, which causes data loss.

85.0%

MetaStore Direct Memory Usage Statistics

16006

Direct Memory Usage of the Hive Process Exceeds the Threshold

When the direct memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable.

95%

MetaStore Non-Heap Memory Usage Statistics

16008

Non-heap Memory Usage of the Hive Service Exceeds the Threshold

When the non-heap memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable.

95%

MetaStore Heap Memory Usage Statistics

16005

Heap Memory Usage of the Hive Process Exceeds the Threshold

When the heap memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable.

95%

HiveServer Direct Memory Usage Statistics

16006

Direct Memory Usage of the Hive Process Exceeds the Threshold

When the direct memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable.

95%

HiveServer Non-Heap Memory Usage Statistics

16008

Non-heap Memory Usage of the Hive Service Exceeds the Threshold

When the non-heap memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable.

95%

HiveServer Heap Memory Usage Statistics

16005

Heap Memory Usage of the Hive Process Exceeds the Threshold

When the heap memory usage of Hive is overhigh, the performance of Hive task operation is affected. In addition, a memory overflow may occur so that the Hive service is unavailable.

95%

Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer

16000

Percentage of Sessions Connected to the HiveServer to Maximum Number Allowed Exceeds the Threshold

If a connection alarm is generated, too many sessions are connected to the HiveServer and new connections cannot be created.

90.0%

Kafka

Percentage of Partitions That Are Not Completely Synchronized

38006

Percentage of Kafka Partitions That Are Not Completely Synchronized Exceeds the Threshold

Too many Kafka partitions that are not completely synchronized affect service reliability. In addition, data may be lost when leaders are switched.

50%

User Connection Usage on Broker

38011

User Connection Usage on Broker Exceeds the Threshold

If the number of connections of a user is excessive, the user cannot create new connections to the Broker.

80%

Broker Disk Usage

38001

Insufficient Kafka Disk Capacity

Kafka data write operations fail.

80.0%

Disk I/O Rate of a Broker

38009

Busy Broker Disk I/Os

The disk partition has frequent I/Os. Data may fail to be written to the Kafka topic for which the alarm is generated.

80%

Broker GC Duration per Minute

38005

GC Duration of the Broker Process Exceeds the Threshold

A long GC duration of the Broker process may interrupt the services.

12000ms

Heap Memory Usage of Kafka

38002

Kafka Heap Memory Usage Exceeds the Threshold

If the available Kafka heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

Kafka Direct Memory Usage

38004

Kafka Direct Memory Usage Exceeds the Threshold

If the available direct memory of the Kafka service is insufficient, a memory overflow occurs and the service breaks down.

95%

Loader

Heap Memory Usage

23004

Loader Heap Memory Usage Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95%

Direct Memory Usage Statistics

23006

Loader Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

Non-heap Memory Usage

23005

Loader Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80%

Total GC Duration

23007

GC Duration of the Loader Process Exceeds the Threshold

Loader service response is slow.

12000ms

MapReduce

GC Duration Statistics

18012

JobHistoryServer GC Duration Exceeds the Threshold

A long GC duration of the JobHistoryServer process may interrupt the services.

12000ms

JobHistoryServer Direct Memory Usage Statistics

18015

JobHistoryServer Direct Memory Usage Exceeds the Threshold

If the available direct memory of the MapReduce service is insufficient, a memory overflow occurs and the service breaks down.

90%

JobHistoryServer Non-Heap Memory Usage Statistics

18019

Non-Heap Memory Usage of JobHistoryServer Exceeds the Threshold

When the non-heap memory usage of MapReduce JobHistoryServer is overhigh, the performance of MapReduce task submission and operation is affected. In addition, a memory overflow may occur so that the MapReduce service is unavailable.

90%

JobHistoryServer Heap Memory Usage Statistics

18009

Heap Memory Usage of JobHistoryServer Exceeds the Threshold

When the heap memory usage of MapReduce JobHistoryServer is overhigh, the performance of MapReduce log archiving is affected. In addition, a memory overflow may occur, leading to unavailable YARN service.

95%

Oozie

Heap Memory Usage

17004

Oozie Heap Memory Usage Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95.0%

Direct Memory Usage

17006

Oozie Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

Non-heap Memory Usage

17005

Oozie Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80%

Total GC Duration

17007

GC Duration of the Oozie Process Exceeds the Threshold

Oozie responds slowly when it is used to submit tasks.

12000ms

Spark2x

JDBCServer2x Heap Memory Usage Statistics

43010

Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold

If available JDBCServe2x process heap memory is insufficient, a memory overflow occurs and the service breaks down

95%

JDBCServer2x Direct Memory Usage Statistics

43012

Direct Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold

If the available JDBCServer2x Process direct heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

JDBCServer2x Non-Heap Memory Usage Statistics

43011

Non-Heap Memory Usage of the JDBCServer2x Process Exceeds the Threshold

If the available JDBCServer2x Process non-heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

JobHistory2x Direct Memory Usage Statistics

43008

Direct Memory Usage of the JobHistory2x Process Exceeds the Threshold

If the available JobHistory2x Process directmemory is insufficient, a memory overflow occurs and the service breaks down.

95%

JobHistory2x Non-Heap Memory Usage Statistics

43007

Non-Heap Memory Usage of the JobHistory2x Process Exceeds the Threshold

If the available JobHistory2x Process non-heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

JobHistory2x Heap Memory Usage Statistics

43006

Heap Memory Usage of the JobHistory2x Process Exceeds the Threshold

If the available JobHistory2x Process heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

IndexServer2x Direct Memory Usage Statistics

43021

Direct Memory Usage of the IndexServer2x Process Exceeds the Threshold

If the available IndexServer2x process direct memory is insufficient, a memory overflow occurs and the service breaks down.

95%

IndexServer2x Heap Memory Usage Statistics

43019

Heap Memory Usage of the IndexServer2x Process Exceeds the Threshold

If the available IndexServer2x process heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

IndexServer2x Non-Heap Memory Usage Statistics

43020

Non-Heap Memory Usage of the IndexServer2x Process Exceeds the Threshold

If the available IndexServer2x process non-heap memory is insufficient, a memory overflow occurs and the service breaks down.

95%

Full GC Number of JDBCServer2x

43017

JDBCServer2x Process Full GC Number Exceeds the Threshold

The performance of the JDBCServer2x process is affected, or even the JDBCServer2x process is unavailable.

12

Full GC Number of JobHistory2x

43018

JobHistory2x Process Full GC Number Exceeds the Threshold

The performance of the JobHistory2x process is affected, or even the JobHistory2x process is unavailable.

12

Full GC Number of IndexServer2x

43023

IndexServer2x Process Full GC Number Exceeds the Threshold

If the GC number exceeds the threshold, IndexServer2x maybe run in low performance or even unavailable.

12

Total GC Duration (in Milliseconds) of JDBCServer2x

43013

JDBCServer2x Process GC Duration Exceeds the Threshold

If the GC duration exceeds the threshold, JDBCServer2x maybe run in low performance.

12000ms

Total GC Duration (in Milliseconds) of JobHistory2x

43009

JobHistory2x Process GC Duration Exceeds the Threshold

If the GC duration exceeds the threshold, JobHistory2x may run in low performance.

12000ms

Total GC Duration (in Milliseconds) of IndexServer2x

43022

IndexServer2x Process GC Duration Exceeds the Threshold

If the GC duration exceeds the threshold, IndexServer2x may run in low performance or even unavailable.

12000ms

Storm

Number of Available Supervisors

26052

Number of Available Supervisors of the Storm Service Is Less Than the Threshold

Existing tasks in the cluster cannot be performed. The cluster can receive new Storm tasks, but cannot perform these tasks.

1

Slot Usage

26053

Storm Slot Usage Exceeds the Threshold

New Storm tasks cannot be performed.

80.0%

Nimbus Heap Memory Usage

26054

Nimbus Heap Memory Usage Exceeds the Threshold

When the heap memory usage of Storm Nimbus is overhigh, frequent GCs occur. In addition, a memory overflow may occur so that the Yarn service is unavailable.

80%

Yarn

NodeManager Direct Memory Usage Statistics

18014

NodeManager Direct Memory Usage Exceeds the Threshold

If the available direct memory of NodeManager is insufficient, a memory overflow occurs and the service breaks down.

90%

NodeManager Heap Memory Usage Statistics

18018

NodeManager Heap Memory Usage Exceeds the Threshold

When the heap memory usage of Yarn NodeManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable.

95%

NodeManager Non-Heap Memory Usage Statistics

18017

NodeManager Non-heap Memory Usage Exceeds the Threshold

When the heap memory usage of Yarn NodeManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable.

90%

ResourceManager Direct Memory Usage Statistics

18013

ResourceManager Direct Memory Usage Exceeds the Threshold

If the available direct memory of ResourceManager is insufficient, a memory overflow occurs and the service breaks down.

90%

ResourceManager Heap Memory Usage Statistics

18008

ResourceManager Heap Memory Usage Exceeds the Threshold

When the heap memory usage of Yarn ResourceManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable.

95%

ResourceManager Non-Heap Memory Usage Statistics

18016

ResourceManager Non-Heap Memory Usage Exceeds the Threshold

When the non-heap memory usage of Yarn ResourceManager is overhigh, the performance of Yarn task submission and operation is affected. In addition, a memory overflow may occur so that the Yarn service is unavailable.

90%

NodeManager GC Duration Statistics

18011

NodeManager GC Duration Exceeds the Threshold

A long GC duration of the NodeManager process may interrupt the services.

12000ms

ResourceManager GC Duration Statistics

18010

ResourceManager GC Duration Exceeds the Threshold

A long GC duration of the ResourceManager process may interrupt the services.

12000ms

Number of Failed Tasks in the Root Queue

18026

Number of Failed Yarn Tasks Exceeds the Threshold

A large number of application tasks fail to be executed.

Failed tasks need to be submitted again.

50

Terminated Applications of the Root Queue

18025

Number of Terminated Yarn Tasks Exceeds the Threshold

A large number of application tasks are forcibly stopped.

50

Pending Memory

18024

Pending Yarn Memory Usage Exceeds the Threshold

It takes long time to end an application.

A new application cannot run after submission.

83886080MB

Pending Tasks

18023

Number of Pending Yarn Tasks Exceeds the Threshold

It takes long time to end an application.

A new application cannot run for a long time after submission.

60

ZooKeeper

ZooKeeper Connections Usage

13001

Available ZooKeeper Connections Are Insufficient

Available ZooKeeper connections are insufficient. When the connection usage reaches 100%, external connections cannot be handled.

80%

ZooKeeper Heap Memory Usage

13004

ZooKeeper Heap Memory Usage Exceeds the Threshold

If the available ZooKeeper memory is insufficient, a memory overflow occurs and the service breaks down.

95%

ZooKeeper Direct Memory Usage

13002

ZooKeeper Direct Memory Usage Exceeds the Threshold

If the available ZooKeeper memory is insufficient, a memory overflow occurs and the service breaks down.

80%

ZooKeeper GC Duration per Minute

13003

GC Duration of the ZooKeeper Process Exceeds the Threshold

A long GC duration of the ZooKeeper process may interrupt the services.

12000ms

Ranger

UserSync GC Duration

45284

UserSync GC Duration Exceeds the Threshold

UserSync responds slowly.

12000ms

PolicySync GC Duration

45292

PolicySync GC Duration Exceeds the Threshold

PolicySync responds slowly.

12000ms

RangerAdmin GC Duration

45280

RangerAdmin GC Duration Exceeds the Threshold

RangerAdmin responds slowly.

12000ms

TagSync GC Duration

45288

TagSync GC Duration Exceeds the Threshold

TagSync responds slowly.

12000ms

UserSync Non-Heap Memory Usage

45283

UserSync Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80.0%

UserSync Direct Memory Usage

45282

UserSync Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

UserSync Heap Memory Usage

45281

UserSync Heap Memory Usage Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95.0%

PolicySync Direct Memory Usage

45290

PolicySync Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

PolicySync Heap Memory Usage

45289

PolicySync Heap Memory Usage Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95.0%

PolicySync Non-Heap Memory Usage

45291

PolicySync Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80.0%

RangerAdmin Non-Heap Memory Usage

45279

RangerAdmin Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80.0%

RangerAdmin Heap Memory Usage

45277

RangerAdmin Heap Memory Usage Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95.0%

RangerAdmin Direct Memory Usage

45278

RangerAdmin Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

TagSync Direct Memory Usage

45286

TagSync Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

80.0%

TagSync Non-Heap Memory Usage

45287

TagSync Non-Heap Memory Usage Exceeds the Threshold

Non-heap memory overflow may cause service breakdown.

80.0%

TagSync Heap Memory Usage

45285

TagSync Heap Memory Usage Exceeds the Threshold

Heap memory overflow may cause service breakdown.

95.0%

ClickHouse

Clickhouse Service Quantity Quota Usage in ZooKeeper

45426

ClickHouse Service Quantity Quota Usage in ZooKeeper Exceeds the Threshold

After the ZooKeeper quantity quota of the ClickHouse service exceeds the threshold, you cannot perform cluster operations on the ClickHouse service on FusionInsight Manager. As a result, the ClickHouse service cannot be used.

90%

ClickHouse Service Capacity Quota Usage in ZooKeeper

45427

ClickHouse Service Capacity Quota Usage in ZooKeeper Exceeds the Threshold

After the ZooKeeper capacity quota of the ClickHouse service exceeds the threshold, you cannot perform cluster operations on the ClickHouse service on FusionInsight Manager. As a result, the ClickHouse service cannot be used.

90%

IoTDB

Maximum Merge (Intra-Space Merge) Latency

45594

IoTDBServer Intra-Space Merge Duration Exceeds the Threshold

Data write is blocked and the write operation performance is affected.

300000ms

Maximum Merge (Flush) Latency

45593

IoTDBServer Flush Execution Duration Exceeds the Threshold

Data write is blocked and the write operation performance is affected.

300000ms

Maximum Merge (Cross-Space Merge) Latency

45595

IoTDBServer Cross-Space Merge Duration Exceeds the Threshold

Data write is blocked and the write operation performance is affected.

300000ms

Maximum RPC (executeStatement) Latency

45592

IoTDBServer RPC Execution Duration Exceeds the Threshold

Running performance of the IoTDBServer process is affected.

10000s

Total GC Duration of IoTDBServer

45587

IoTDBServer GC Duration Exceeds the Threshold

A long GC duration of the IoTDBServer process may interrupt the services.

12000ms

Total GC Duration of ConfigNode

45590

ConfigNode GC Duration Exceeds the Threshold

A long GC duration of the ConfigNode process may interrupt services.

12000ms

IoTDBServer Heap Memory Usage

45586

IoTDBServer Heap Memory Usage Exceeds the Threshold

If the available IoTDBServer process heap memory is insufficient, a memory overflow occurs and the service breaks down.

90%

IoTDBServer Direct Memory Usage

45588

IoTDBServer Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause service breakdown.

90%

ConfigNode Heap Memory Usage

45589

ConfigNode Heap Memory Usage Exceeds the Threshold

If the available ConfigNode process heap memory is insufficient, a memory overflow occurs and the service breaks down.

90%

ConfigNode Direct Memory Usage

45591

ConfigNode Direct Memory Usage Exceeds the Threshold

Direct memory overflow may cause the IoTDB instance to be unavailable.

90%