Updated on 2024-11-29 GMT+08:00

Configuring Alarm Threshold

Scenario

You can configure monitoring indicator thresholds to monitor the health status of indicators on FusionInsight Manager. If abnormal data occurs and the preset conditions are met, the system triggers an alarm and displays the alarm information on the alarm page.

Procedure

  1. Log in to FusionInsight Manager.
  2. Choose O&M > Alarm > Thresholds.
  3. Select a monitoring metric for a host or service in the cluster.

    Figure 1 Configuring the threshold for a metric
    For example, after selecting Host Memory Usage, the information about this indicator threshold is displayed.
    • When Switch is on, an alarm will be triggered if the threshold is met.
    • When Alarm Severity is on, hierarchical alarms are enabled. The system dynamically reports alarms of the corresponding severity based on the real-time metric values and hierarchical thresholds set for that severity.
    • Alarm ID and Alarm Name: alarm information triggered against the threshold
    • Trigger Count: FusionInsight Manager checks whether the value of a monitoring metric reaches the threshold. If the number of consecutive checks reaches the value of Trigger Count, an alarm is generated. Trigger Count is configurable.
    • Check Period (s): interval for the system to check the monitoring metric.
    • The rules in the rule list are used to trigger alarms.

  4. Click Create Rule to add rules used for monitoring indicators.

    Table 1 Monitoring indicator rule parameters

    Parameter

    Description

    Example Value

    Rule Name

    Set a rule name.

    CPU_MAX

    Severity

    Select an alarm severity.

    After Alarm Severity is on, you need to configure the alarm severity in Thresholds.

    • Critical
    • Major
    • Minor
    • Warning

    Threshold Type

    You can use the maximum or minimum value of an indicator as the alarm triggering threshold. If Threshold Type is set to Max value, the system generates an alarm when the value of the specified indicator is greater than the threshold. If Threshold Type is set to Min value, the system generates an alarm when the value of the specified indicator is less than the threshold.

    • Max value
    • Min value

    Date

    This parameter is used to set the date when the rule takes effect.

    If Alarm Severity is on, only Daily is supported.

    • Daily
    • Weekly
    • Others

    Add Date

    This parameter is available only when Date is set to Others. You can set the date when the rule takes effect. Multiple options are available.

    09-30

    Thresholds

    This parameter is used to set the time range when the rule takes effect.

    If Alarm Severity is on, you cannot set the start time and end time. The default start time and end time are 00:00-23:59.

    Start and End Time: 00:00–08:30

    Thresholds of the rule monitoring metric

    After Alarm Severity is on, different alarm severities can be set for a cluster based on different thresholds.

    • Alarm severity
    • Threshold

    You can click to set multiple time ranges for the threshold or click to delete one.

  5. Click OK to save the rules.
  6. Locate the row that contains an added rule, and click Apply in the Operation column. The value of Effective for this rule changes to Yes.

    A new rule can be applied only after you click Cancel for an existing rule.

Monitoring Metric Reference

FusionInsight Manager alarm monitoring metrics are classified as node information metrics and cluster service metrics. Table 2 describes the metrics for which you can configure thresholds on nodes.

Table 2 Node monitoring metrics

Metric Group

Metric

Description

Default Threshold

CPU

Host CPU Usage

This indicator reflects the computing and control capabilities of the current cluster in a measurement period. By observing the indicator value, you can better understand the overall resource usage of the cluster.

90.0%

Disk

Disk Usage

Indicates the disk usage of a host.

95% (critical)

85% (major)

Disk Inode Usage

Indicates the disk inode usage in a measurement period.

95% (critical)

80% (major)

Memory

Host Memory Usage

Indicates the average memory usage at the current time.

95% (critical)

90% (major)

Host Status

Host File Handle Usage

Indicates the usage of file handles of the host in a measurement period.

95% (critical)

80% (major)

Host PID Usage

Indicates the PID usage of a host.

95% (critical)

90% (major)

Network Status

TCP Ephemeral Port Usage

Indicates the usage of temporary TCP ports of the host in a measurement period.

95% (critical)

80% (major)

Network Reading

Read Packet Error Rate

Indicates the read packet error rate of the network interface on the host in a measurement period.

5% (critical)

0.5% (major)

Read Packet Dropped Rate

Indicates the read packet dropped rate of the network interface on the host in a measurement period.

5% (critical)

0.5% (major)

Read Throughput Rate

Indicates the average read throughput (at MAC layer) of the network interface in a measurement period.

80%

Network Writing

Write Packet Error Rate

Indicates the write packet error rate of the network interface on the host in a measurement period.

5% (critical)

0.5% (major)

Write Packet Dropped Rate

Indicates the write packet dropped rate of the network interface on the host in a measurement period.

5% (critical)

0.5% (major)

Write Throughput Rate

Indicates the average write throughput (at MAC layer) of the network interface in a measurement period.

80%

Process

Uninterruptible Sleep Process

Number of D state and Z state processes on the host in a measurement period

0

omm Process Usage

omm process usage in a measurement period

95% (critical)

90% (major)

Table 3 Cluster service indicators

Service

Metric Group

Metric

Description

Default Threshold

DBService

Database

Usage of the Number of Database Connections

Indicates the usage of the number of database connections.

95% (critical)

90% (major)

Disk Space Usage of the Data Directory

Disk space usage of the data directory

85% (critical)

80% (major)

MOTService

Database

MOT Connections Usage

Usage of MOTService database connections

90%

MOT Disk Space Usage of the Data Directory

Disk space usage of the MOTService data directory

80%

MOT Used Memory Percentage

MOTService memory usage

85%

MOT Used CPU Percentage

MOTService CPU usage

80%

Elasticsearch

Disk

Data Directory Usage

Elasticsearch data directory usage

80%

Garbage Collection

GC Time

Garbage collection duration of the Elasticsearch instance process

30000ms

Memory

Heap Memory Usage

Elasticsearch heap memory usage

90%

Shard

Elasticsearch Shard Document Number

Number of Elasticsearch sharded files

100000000

Elasticsearch Shard Data Volume

Size of Elasticsearch shards

41943040

Number of Instance Shards

Total number of Elasticsearch instance shards

400

Replica Quantity Statistics

Total shard number

Number of primary shards whose Elasticsearch status is down

70000

Flume

Agent

Flume Heap Memory Usage Calculate

Indicates the Flume heap memory usage.

95.0% (critical)

90.0% (major)

Flume Direct Memory Usage Statistics

Indicates the Flume direct memory usage.

90.0% (critical)

80.0% (major)

Flume Non-heap Memory Usage

Indicates the Flume non-heap memory usage.

80.0%

Total GC duration of Flume process

Indicates the Flume total GC time.

12000 ms

FTP-Server

Process

FTP-Server Heap Memory Usage Calculate

Indicates the FTP-Server heap memory usage.

95.0%

FTP-Server Direct Buffer Usage Statistics

Indicates the FTP-Server direct memory usage.

80.0%

FTP-Server Non-Heap Memory Usage

Indicates the FTP-Server non-heap memory usage.

80.0%

Total GC duration of FTP-Server process

Indicates the total GC time of FTP-Server.

12000 ms

HBase

GC

GC time for old generation

Total GC time of RegionServer

5000 ms

GC time for old generation

Total GC time of HMaster

5000 ms

CPU & memory

RegionServer Direct Memory Usage Statistics

RegionServer direct memory usage

90%

RegionServer Heap Memory Usage Statistics

RegionServer heap memory usage

90%

HMaster Direct Memory Usage

HMaster direct memory usage

90%

HMaster Heap Memory Usage Statistics

HMaster heap memory usage

90%

Service

Number of Online Regions of a RegionServer

Number of regions of a RegionServer

5000 (critical)

2000 (major)

Region in transaction count over threshold

Number of regions that are in the RIT state and reach the threshold duration

1

Handler

RegionServer Handler Usage

Handler usage of RegionServer

100% (critical)

90% (major)

Replication

Replication sync failed times (RegionServer)

Number of times that DR data fails to be synchronized

1

Number of Log Files to Be Synchronized in the Active Cluster

Number of log files to be synchronized in the active cluster

128

Number of HFiles to Be Synchronized in the Active Cluster

Number of HFiles to be synchronized in the active cluster

128

RPC

Number of RegionServer Opened Connections

Number of open RegionServer RPC connections

200 (critical)

100 (major)

99th Percentile of the RegionServer RPC Request Response Time

99th percentile of the RegionServer RPC request response time

10000 ms (critical)

5000 ms (major)

99th Percentile of the RegionServer RPC Request Processing Time

99th percentile of the RegionServer RPC request processing time

10000 ms (critical)

5000 ms (major)

Operation statistics

Number of Timed-Out WAL Writes in RegionServers

Number of timed-out WAL writes in RegionServers

500 (critical)

300 (major)

Queue

Number of Tasks in RegionServer RPC Write Queues

Number of tasks in RegionServer RPC write queues

2000 (critical)

1600 (major)

Number of Tasks in RegionServer RPC Read Queues

Number of tasks in RegionServer RPC read queues

2000 (critical)

1600 (major)

RegionServer Call Queue Size

RegionServer call queue size

838860800 (critical)

629145600 (major)

Compaction Queue Size

Size of the Compaction queue

100

HDFS

File and Block

Lost Blocks

Number of backup blocks that the HDFS file system lacks

0

Blocks Under Replicated

Total number of blocks that need to be replicated by the NameNode

1000

RPC

Average Time of Active NameNode RPC Processing

Average NameNode RPC processing time

100 ms (major)

200 ms (critical)

Average Time of Active NameNode RPC Queuing

Average NameNode RPC queuing time

200 ms (major)

300 ms (critical)

Disk

HDFS Disk Usage

HDFS disk usage

80% (major)

90% (critical)

DataNode Disk Usage

Disk usage of DataNodes in the HDFS

80%

Percentage of Reserved Space for Replicas of Unused Space

Percentage of the reserved disk space of all the copies to the total unused disk space of DataNodes

90%

Resource

Faulty DataNodes

Number of faulty DataNodes

3

NameNode Non-Heap Memory Usage Statistics

Percentage of NameNode non-heap memory usage

90%

NameNode Direct Memory Usage Statistics

Percentage of direct memory used by NameNodes

90%

NameNode Heap Memory Usage Statistics

Percentage of NameNode non-heap memory usage

95%

DataNode Direct Memory Usage Statistics

Percentage of direct memory used by DataNodes

90%

DataNode Heap Memory Usage Statistics

DataNode heap memory usage

95%

DataNode Heap Memory Usage Statistics

Percentage of DataNode non-heap memory usage

90%

Garbage Collection

GC Time (NameNode)/GC Time (DataNode)

Garbage collection (GC) duration of NameNodes per minute

10000 ms (major)

15000 ms (critical)

GC Time

GC duration of DataNodes per minute

12000 ms (major)

20000 ms (critical)

Hive

HQL

Percentage of HQL Statements That Are Executed Successfully by Hive

Percentage of HQL statements that are executed successfully by Hive

90% (critical)

80% (major)

Connections

Percentage of Number of Sessions Connected to the MetaStore to the Maximum Allowed (MetaStore)

Percentage of the number of sessions connected to MetaStore to the maximum number of sessions allowed by MetaStore

90% (critical)

80% (major)

Background

Background Thread Usage

Background thread usage

90% (critical)

80% (major)

GC

Total GC time of MetaStore

Total GC time of MetaStore

12000 ms

HiveServer Total GC Time in Milliseconds

Total GC time of HiveServer

12000 ms

Capacity

Percentage of HDFS Space Used by Hive to the Available Space

Percentage of HDFS space used by Hive to the available space

95% (critical)

85% (major)

CPU & memory

MetaStore Direct Memory Usage Statistics

MetaStore direct memory usage

95% (critical)

85% (major)

MetaStore Non-Heap Memory Usage Statistics

MetaStore non-heap memory usage

95% (critical)

85% (major)

MetaStore Heap Memory Usage Statistics

MetaStore heap memory usage

95% (critical)

85% (major)

HiveServer Direct Memory Usage Statistics

HiveServer direct memory usage

95% (critical)

85% (major)

HiveServer Non-Heap Memory Usage Statistics

HiveServer non-heap memory usage

95% (critical)

85% (major)

HiveServer Heap Memory Usage Statistics

HiveServer heap memory usage

95% (critical)

85% (major)

Session

Percentage of Sessions Connected to the HiveServer to Maximum Number of Sessions Allowed by the HiveServer

Indicates the percentage of the number of sessions connected to the HiveServer to the maximum number of sessions allowed by the HiveServer.

90% (critical)

80% (major)

Kafka

Partition

Percentage of Partitions That Are Not Completely Synchronized

Indicates the percentage of partitions that are not completely synchronized to total partitions.

60% (critical)

50% (major)

Disk

Broker Disk Usage

Indicates the disk usage of the disk where the Broker data directory is located.

90% (critical)

85% (major)

Disk I/O Rate of a Broker

I/O usage of the disk where the Broker data directory is located

80%

Process

Broker GC Duration per Minute

Indicates the GC duration of the Broker process per minute.

12000 ms

Heap Memory Usage of Kafka

Indicates the Kafka heap memory usage.

95%

Kafka Direct Memory Usage

Indicates the Kafka direct memory usage.

100% (critical)

95% (major)

Others

User Connection Usage on Broker

Usage of user connections on Broker

90% (critical)

85% (major)

Loader

Memory

Heap Memory Usage Calculate

Indicates the Loader heap memory usage.

95% (critical)

80% (major)

Direct Memory Usage of Loader

Indicates the Loader direct memory usage.

95% (critical)

80% (major)

Non-heap Memory Usage of Loader

Indicates the Loader non-heap memory usage.

95% (critical)

80% (major)

GC

Total GC time of Loader

Indicates the total GC time of Loader.

20000 ms (critical)

12000 ms (major)

MapReduce

Garbage Collection

GC Time

Indicates the GC time.

20000 ms (critical)

12000 ms (major)

Resource

JobHistoryServer Direct Memory Usage Statistics

Indicates the JobHistoryServer direct memory usage.

95% (critical)

90% (major)

JobHistoryServer Non-Heap Memory Usage Statistics

Indicates the JobHistoryServer non-heap memory usage.

95% (critical)

90% (major)

JobHistoryServer Heap Memory Usage Statistics

Indicates the JobHistoryServer non-heap memory usage.

95% (critical)

90% (major)

Metadata

Others

Heap Memory Usage Calculate

Indicates the Metadata heap memory usage.

95%

Metadata Direct Memory Usage Statistics

Indicates the metadata direct memory usage.

80.0%

Metadata Non-heap Memory Usage

Indicates the metadata non-heap memory usage.

80.0%

Total GC time of Metadata

Indicates the metadata total GC time.

20000 ms (critical)

12000 ms (major)

Oozie

Memory

Oozie Heap Memory Usage Calculate

Indicates the Oozie heap memory usage.

95%

Oozie Direct Memory Usage

Indicates the Oozie direct memory usage.

90%

Oozie Non-heap Memory Usage

Indicates the Oozie non-heap memory usage.

90%

GC

Total GC duration of Oozie

Indicates the Oozie total GC time.

20000 ms (critical)

12000 ms (major)

Solr

Replica Quantity Statistics

Bad Replica Number

Number of bad replicas of a Solr instance

0

Garbage Collection

GC Time

Garbage collection duration of the Solr instance process

12000 ms

Memory

Heap Memory Usage

Indicates the heap memory usage.

99% (critical)

95% (major)

Shard

Solr Shard Data Volume

Data volume of Solr shards

83886080 (critical)

41943040 (Major)

Solr Shard Document Number

Number of Solr shard documents

400000000

Spark

Memory

JDBCServer Heap Memory Usage Statistics

JDBCServer heap memory usage

95% (critical)

85% (major)

JDBCServer Direct Memory Usage Statistics

JDBCServer direct memory usage

95% (critical)

85% (major)

JDBCServer Non-Heap Memory Usage Statistics

JDBCServer non-heap memory usage

95% (critical)

85% (major)

JobHistory Direct Memory Usage Statistics

JobHistory direct memory usage

95% (major)

85% (minor)

JobHistory Non-Heap Memory Usage Statistics

JobHistory non-heap memory usage

95% (major)

85% (minor)

JobHistory Heap Memory Usage Statistics

JobHistory heap memory usage

95% (major)

85% (minor)

IndexServer Direct Memory Usage Statistics

IndexServer direct memory usage

95% (critical)

85% (major)

IndexServer Heap Memory Usage Statistics

IndexServer heap memory usage

95% (critical)

85% (major)

IndexServer Non-Heap Memory Usage Statistics

IndexServer non-heap memory usage

95% (critical)

85% (major)

GC Count

Full GC Number of JDBCServer

Full GC times of JDBCServer

12 (critical)

9 (major)

Full GC Number of JobHistory

Full GC times of JobHistory

12 (critical)

9 (major)

Full GC Number of IndexServer

Full GC times of IndexServer

12 (critical)

9 (major)

GC Time

JDBCServer Total GC Time in Milliseconds

Total GC time of JDBCServer

12000 ms (critical)

9600 ms (major)

JobHistory Total GC Time in Milliseconds

Total GC time of JobHistory

12000 ms (major)

9600 ms (minor)

IndexServer Total GC Time in Milliseconds

Total GC time of IndexServer

12000 ms (critical)

9600 ms (major)

Yarn

Resources

NodeManager Direct Memory Usage Statistics

Indicates the percentage of direct memory used by NodeManagers.

90%

NodeManager Heap Memory Usage Statistics

Indicates the percentage of NodeManager heap memory usage.

95%

NodeManager Non-Heap Memory Usage Statistics

Indicates the percentage of NodeManager non-heap memory usage.

90%

ResourceManager Direct Memory Usage Statistics

Indicates the ResourceManager direct memory usage.

90%

ResourceManager Heap Memory Usage Statistics

Indicates the ResourceManager heap memory usage.

95%

ResourceManager Non-Heap Memory Usage Statistics

Indicates the ResourceManager non-heap memory usage.

90%

Garbage collection

GC Time

Indicates the GC duration of NodeManager per minute.

12000 ms (major)

20000 ms (critical)

GC Time

Indicates the GC duration of ResourceManager per minute.

10000 ms (major)

15000 ms (critical)

Others

Failed Applications of root queue

Number of failed tasks in the root queue

50

Terminated Applications of root queue

Number of killed tasks in the root queue

50

CPU & memory

Pending Memory

Pending memory capacity

83886080MB

Application

Pending Applications

Pending tasks

60

ZooKeeper

Connection

ZooKeeper Connections Usage

Indicates the percentage of the used connections to the total connections of ZooKeeper.

80% (major)

90% (critical)

CPU & memory

ZooKeeper Heap Memory Usage

Indicates the ZooKeeper heap memory usage.

95%

ZooKeeper Direct Memory Usage

Indicates the ZooKeeper direct memory usage.

80%

GC

ZooKeeper GC Duration per Minute

Indicates the GC time of ZooKeeper every minute.

5000 ms (major)

10000 ms (critical)

meta

OBS data write operation

Total Number of Failed OBS Write API Calls

Total number of failed OBS write API calls

10

OBS exception

Total Number of OBSFileConflictException Errors

Total number of OBSFileConflictException errors

5

Total Number of OBS AccessControlExceptions Errors

Total number of OBS AccessControlExceptions errors

5

Total Number of OBS EOFException Errors

Total number of OBS EOFException errors

5

Total Number of OBSMethodNotAllowedException Errors

Total number of OBSMethodNotAllowedException errors

5

Total Number of OBSIOException Errors

Total number of OBSIOException errors

5

Total Number of OBS FileNotFoundException Errors

Total number of OBS FileNotFoundException errors

5

Total Number of Throttled OBS Operations

Total number of throttled OBS operations

5

Total Number of OBSIllegalArgumentExceptions Errors

Total number of OBSIllegalArgumentExceptions errors

5

Total Number of Other OBS Exceptions

Total number of other OBS exceptions reported by all nodes

5

OBS data read operation

Total Number of Failed OBS Read API Calls

Total number of failed OBS read API calls

10

Total Number of Failed OBS readFully API Calls

Total number of failed OBS readFully API calls

10

Ranger

GC

UserSync GC Duration

UserSync garbage collection (GC) duration

20000 ms (critical)

12000 ms (major)

PolicySync GC Duration

PolicySync GC Duration

20000 ms (critical)

12000 ms (major)

RangerAdmin GC Duration

RangerAdmin GC duration

20000 ms (critical)

12000 ms (major)

TagSync GC Duration

TagSync GC duration

20000 ms (critical)

12000 ms (major)

CPU & memory

UserSync Non-Heap Memory Usage

UserSync non-heap memory usage

80.0%

UserSync Direct Memory Usage

UserSync direct memory usage

80.0%

UserSync Heap Memory Usage

UserSync heap memory usage

95.0%

PolicySync Direct Memory Usage

Percentage of the PolicySync direct memory usage

80.0%

PolicySync Heap Memory Usage

Percentage of PolicySync heap memory usage

95.0%

PolicySync Non-Heap Memory Usage

Percentage of PolicySync non-heap memory usage

80.0%

RangerAdmin Non-Heap Memory Usage

RangerAdmin non-heap memory usage

80.0%

RangerAdmin Heap Memory Usage

RangerAdmin heap memory usage

95.0%

RangerAdmin Direct Memory Usage

RangerAdmin direct memory usage

80.0%

TagSync Direct Memory Usage

TagSync direct memory usage

80.0%

TagSync Non-Heap Memory Usage

TagSync non-heap memory usage

80.0%

TagSync Heap Memory Usage

TagSync heap memory usage

95.0%

ClickHouse

Cluster Quota

Clickhouse service quantity quota usage in ZooKeeper

Quota of the ZooKeeper nodes used by a ClickHouse service

95% (critical)

90% (major)

Capacity quota usage of the Clickhouse service in ZooKeeper

Capacity quota of ZooKeeper directory used by the ClickHouse service

95% (critical)

90% (major)

Concurrencies

Concurrency Number (ClickHouseServer)

Actual number of concurrent SQL statements of the ClickHouse service

90

IoTDB

Merge

Maximum Task Merge (Intra-Space Merge) Latency

Maximum latency of IoTDBServer intra-space merge

300000ms

Maximum Merge Task (Flush) Latency

Maximum latency of IoTDBServer flush execution

300000ms

Maximum Task Merge (Cross-Space Merge) Latency

Maximum latency of IoTDBServer cross-space merge

300000ms

RPC

Maximum RPC (executeStatement) Latency

Maximum latency of IoTDBServer RPC execution

10000s

GC

Total GC duration of IoTDBServer

Total time used for IoTDBServer garbage collection (GC)

30000 ms (critical)

12000 ms (major)

Total GC Duration of ConfigNode

Total time used for ConfigNode garbage collection (GC)

30000 ms (critical)

12000 ms (major)

Memory

IoTDBServer Heap Memory Usage

IoTDBServer heap memory usage

100% (critical)

90% (major)

IoTDBServer Direct Memory Usage

IoTDBServer direct memory usage

100% (critical)

90% (major)

ConfigNode Heap Memory Usage

Percentage of the ConfigNode heap memory usage

100% (critical)

90% (major)

ConfigNode Direct Memory Usage

Percentage of the ConfigNode direct memory usage

100% (critical)

90% (major)

Containers

Others

Metaspace Usage

WebContainer metaspace usage

75.0%

Non-Heap Memory Usage

WebContainer non-heap memory usage

75.0%

Heap Memory Usage

WebContainer heap memory usage

95.0%

Failure Rate of Application Service Calling

Failure rate of application service calling (SGP)

10.0

Application Service Calling Latency

Application service calling latency (SGP)

10000.0

Maximum Number of Concurrent Application Services

Maximum number of concurrent application services (SGP)

120

BLU Health Status

BLU health status statistics

50.0%

LdapServer

Others

Process Connections of a Single SlapdServer Instance

Number of SlapdServer process connections

1000

CPU Usage of a Single SlapdServer Instance

SlapdServer CPU usage

1200%

Guardian

GC

TokenServer GC Duration

TokenServer GC duration

12000 ms

CPU & memory

TokenServer Heap Memory Usage

Percentage of the heap memory used by the TokenServer process

95.0%

TokenServer Non-Heap Memory Usage

Percentage of the non-heap memory used by the TokenServer process

80.0%

TokenServer Direct Memory Usage

Percentage of the TokenServer direct memory usage

80.0%

Doris

JVM

Accumulated Old-Generation GC Duration

Accumulated GC duration of the old-generation FE process

3000ms

Connection

FE Ratio of the number of MySQL port connections (FE)

Proportion of connections to the MySQL port of the FE node

95%

Disk

BE Data Disk Usage

BE data disk usage

95%

Disk Status of a Specified Data Directory

Statistics on abnormal disk status of a specified data directory on the BE.

1

Performance

Maximum Compaction Score of All BE Nodes

Maximum FE compaction score of all BE nodes

10

Maximum Duration of RPC Requests Received by Each Method of the FE Thrift Interface

Maximum duration of RPC requests received by each method of the FE thrift interface.

5000ms

Queue

Queue Length of BE Periodic Report Tasks on the FE

Queue length of BE periodic report tasks on the FE node

10

Number of FE Tasks Queuing in the Thread Pool Interacting with the BE

Number of FE tasks queuing in the thread pool interacting with the BE node

10

Number of FE Tasks Queuing in the Task Processing Thread Pool

Number of FE tasks that are queuing in the task processing thread pool on the FE node

10

Queue Length of Query Execution Thread Pool

Queue length of query execution thread pool

20

Exception

Failed Metadata Image Generation

Failed metadata image generation on the FE node

1

Failed Historical Metadata Image Clearing

Failed historical metadata image clearing on the FE node

1

Status of the Doris FE instance (FE)

Process status statistics of the Doris FE instance.

0

Status of the Doris BE instance (BE)

Process status statistics of the Doris BE instance.

0

Error Rate of TCP Packet Receiving (BE)

Error rate of TCP packet receiving on the BE

5%

Whether the Number of Task Failures of a Certain Type Increases (BE)

Whether the number of failures of a certain type of tasks executed on the BE increases

1

CPU and Memory

FE CPU Usage

CPU usage statistics on FE nodes

95% (critical)

90% (major)

FE Memory Usage

Memory usage statistics on FE nodes

90% (critical)

85% (major)

FE Memory Usage

Memory usage of FE nodes

95%

FE Heap Memory Usage Rate

Heap memory usage of FE nodes

95%

BE Memory Usage Rate

Memory usage statistics on BE nodes

90% (critical)

85% (major)

Maximum BE Memory and Remaining Machine Memory on the BE

The maximum memory required by the BE is greater than the remaining available memory.

1

BE CPU Usage

CPU usage statistics on BE nodes

95% (critical)

90% (major)