Updated on 2022-12-16 GMT+08:00

Database Monitoring Tool

Background

As the IT infrastructure, databases must be stable, high-throughput, and low-latency. To achieve the preceding objectives, ensure that the database cluster can continuously provide stable and fast services for users. A comprehensive database monitoring tool is required to help database O&M personnel monitor database running in real time and detect, locate, and handle exceptions in a timely manner.

The Database Monitor Service (DMS) is a native database monitoring service of GaussDB(DWS). It provides comprehensive database resource consumption indicators and database service execution indicators. DMS provides a visual tool to monitor the real-time and historical running status of database clusters, helping you detect, locate, and rectify faults.

Database Monitoring Principles

DMS uses a three-layer structure to monitor the GaussDB(DWS) database cluster:

  • Collection (dms-agent): The agent is embedded in each node of the database cluster to collect the cluster running status, collect the original data of cluster running, and report the data.
  • Storage (dms-collection): receives data reported by the agent and saves the data to the metric database.
  • Analysis (dms-monitoring): A large amount of data in the monitoring database is used, and the monitoring view is displayed to users through aggregation calculation.
Figure 1 DMS architecture

Database Monitoring Metrics

Table 1 describes the database monitoring metrics. For more information, see Database Monitoring Overview

Table 1 GaussDB(DWS) monitoring metrics

Monitored Object

Metric

Description

Value Range

Monitoring Period (Raw Data)

Cluster Overview

Cluster Status

Status of a cluster.

Normal/Abnormal/Degraded

30s

Nodes

Number of available nodes and total number of nodes (Available/Total) in a cluster.

≥ 0

60s

CNs

Number of CNs in a cluster.

≥ 0

60s

Databases

Number of created databases in a cluster.

≥ 0

90s

Resource Consumption

CPU Usage

Average real-time CPU usage of all nodes in a cluster.

0% to 100%

30s

Memory Usage

Average real-time memory usage of all nodes in a cluster.

0% to 100%

30s

Disk Usage

Average real-time disk usage of all nodes in a cluster.

0% to 100%

30s

Disk I/O

Average real-time disk I/O of all nodes in a cluster.

≥ 0 KB/s

30s

Network I/O

Average real-time network I/O of all NICs in a cluster.

≥ 0 KB/s

30s

Top 5 Time-Consuming Queries

Query ID

ID of a query, which is automatically generated by the database.

≥ 0

180s

SQL Statement

Query statement executed by a user.

Character string

180s

Execution Time

Execution time of a query statement (unit: ms).

≥ 0 ms

180s

Top 5 Queries with Most Data Written to Disk

Query ID

ID of a query, which is automatically generated by the database.

≥ 0

180s

SQL Statement

Query statement executed by a user.

Character string

180s

Data Written to Disk

Data to be written to disks after a user runs a statement (unit: MB).

≥ 0 MB

180s

Cluster Resource Metrics

CPU Usage

Average CPU usage and skew ratio of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Memory Usage

Average memory usage and skew ratio of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Disk Usage

Average usage and skew ratio of all disks in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Disk I/O Usage

Average I/O usage and skew rate of all disks in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Network I/O Usage

Average I/O usage and skew rate of all NICs in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Key Database Metrics

Cluster Status

Cluster running status.

Normal/Degraded/Abnormal

30s

Cluster Abnormal CNs

Number of abnormal CNs in the cluster

≥ 0

60s

Cluster Read-only

Whether the cluster is in the read-only state

Yes/No

30s

Concurrent Sessions

Number of concurrent sessions in a cluster within a specified period.

≥ 0

30s

Concurrent Queries

Number of concurrent queries in a cluster within a specified period.

≥ 0

30s

Node Monitoring-Overview

Node Name

Name of a node in a cluster.

Character string

30s

CPU Usage

CPU usage of a host.

0% to 100%

30s

Memory Usage

Memory usage of a host.

0% to 100%

30s

Average Disk Usage (%)

Disk usage of a host.

0% to 100%

30s

IP Address

Service IP address of a host.

Character string

30s

Disk I/O

Disk I/O of a host (unit: KB/s)

≥ 0 KB/s

30s

TCP Protocol Stack Retransmission Rate

Retransmission rate of TCP packets per unit time.

0% to 100%

30s

Status

Running status of a host

Online/Offline

30s

Node Monitoring-Disks

Node Name

Name of a node in a cluster.

Character string

30s

Disk Name

Name of a disk on a host.

Character string

30s

Disk Capacity

Disk capacity of the host (unit: GB)

≥ 0 GB

30s

Disk Usage

Disk usage of a host.

0% to 100%

30s

Disk Read Rate

Disk read rate of the host (unit: KB/s)

≥ 0 KB/s

30s

Disk Write Rate

Disk write rate of the host (unit: KB/s)

≥ 0 KB/s

30s

I/O Wait Time (await, ms)

Average waiting time for each I/O request (unit: ms)

≥ 0 ms

30s

I/O Service Time (svctm, ms)

Average processing time for each I/O request (unit: ms)

≥ 0 ms

30s

I/O Utility (util, %)

Disk I/O usage of a host.

0% to 100%

30s

Node Monitoring-Network

Node Name

Name of a node in a cluster.

Character string

30s

NIC Name

Name of the NIC on a host.

Character string

30s

NIC Status

NIC status.

up/down

30s

NIC Speed

Working rate of a NIC, in Mbit/s.

≥ 0

30s

Received Packets

Number of received packets of a NIC.

≥ 0

30s

Sent Packets

Number of sent packets of a NIC.

≥ 0

30s

Lost Packets Received

Number of received lost packets of a NIC.

≥ 0

30s

Receive Rate

Number of bytes received by a NIC per unit of time (KB/s).

≥ 0 KB/s

30s

Transmit Rate

Number of bytes sent by a NIC per unit of time (unit: KB/s)

≥ 0 KB/s

30s

Database Monitoring

Database Name

Name of the database created by a user in a cluster.

Character string

60s

Usage

Used capacity of the current database (unit: GB).

≥ 0 GB

86400s

Users

Number of users in the current database.

≥ 0

30s

Sessions

Number of sessions in the current database.

≥ 0

30s

Applications

Number of applications in the current database.

≥ 0

30s

Queries

Number of active queries in the current database.

≥ 0

30s

Scanning Rows

Number of rows returned by the full table scan query in the current database.

≥ 0

60s

Index Query Rows

Number of rows returned by the index query in the current database.

≥ 0

60s

Inserted Rows

Number of rows inserted in the current database.

≥ 0

60s

Updated Rows

Number of rows updated in the current database.

≥ 0

60s

Deleted Rows

Number of rows deleted from the current database.

≥ 0

60s

Executed Transactions

Number of transaction executions on the current database.

≥ 0

60s

Transaction Rollbacks

Number of transactions in the current database that have been rolled back.

≥ 0

60s

Deadlocks

Number of deadlocks detected in the current database.

≥ 0

60s

Temporary Files

Number of temporary files created in the current database.

≥ 0

60s

Temporary File Capacity

Size of temporary files written by the current database, in GB.

≥ 0

60s

Performance Monitoring

Cluster CPU Usage

Historical trend of the average CPU usage and skew of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Cluster Memory Usage

Historical trend of the average memory usage and skew of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Cluster Disk Usage

Historical trend of the average disk usage and skew of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Cluster Disk I/O

Historical trend of the average disk I/O and skew of all disks in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Cluster Network I/O

Historical trend of the average network I/O value and skew of all NICs in the cluster. The formula for calculating the skew is (max-avg)/max.

0% to 100%

30s

Cluster Status

Historical trend of the cluster status.

Normal/Abnormal/Degraded

30s

Cluster Read-only

Historical trend of the cluster read-only status change trend.

Yes/No

30s

Cluster Abnormal CNs

Historical trend of the number of abnormal CNs in the cluster.

≥ 0

60s

Cluster Abnormal DNs

Historical trend of the number of abnormal DNs in the cluster.

≥ 0

60s

Cluster CPU Usage of DNs

Historical trends of the average CPU usage and skew ratio changes of all DNs in the cluster. The formula for calculating the skew ratio is (max – avg)/max.

0% to 100%

60s

Cluster Sessions

Historical trend of the number of sessions in a cluster.

≥ 0

30s

Cluster Queries

Historical change trend of the number of queries in the cluster.

≥ 0

30s

Cluster Deadlocks

Historical trend of the number of deadlocks in a cluster.

≥ 0

60s

Cluster TPS

Average number of transactions per second of all databases in a cluster. Formula: (delta_xact_commit + delta_xact_rollback)/current_collect_rate

≥0

60s

Cluster QPS

Average number of concurrent requests per second of all databases in a cluster. Formula: delta_query_count/current_collect_rate

≥ 0

60s

Database Sessions

Historical trend of the number of sessions on a single database in a cluster.

≥ 0

30s

Database Queries

Historical trend of the number of queries on a single database in a cluster.

≥ 0

30s

Database Inserted Rows

Historical trend of the number of rows inserted into a single database in a cluster.

≥ 0

60s

Database Updated Rows

Historical trend of the number of updated rows in a single database in a cluster.

≥ 0

60s

Database Deleted Rows

Historical trend of the number of deleted rows in a single database in a cluster.

≥ 0

60s

Database Capacity

Historical trend of the capacity in a single database in a cluster.

≥ 0

86400s

Live Session

Session ID

ID of the current session (query thread ID).

Character string

30s

User Name

Name of the user who executes the current session.

Character string

30s

Database Name

Name of the database connected to the current session.

Character string

30s

Session Duration

Duration of the current session (unit: ms).

≥ 0 ms

30s

Application Name

Name of the application that creates the current session.

Character string

30s

Queries

Number of SQL statements executed in the current session.

≥ 0

30s

Latest Query Duration

Duration for executing the previous SQL statement in the current session.

≥ 0 ms

30s

Client IP Address

IP address of the client that initiates the current session.

Character string

30s

Connected CN

Connected CN of the current session.

Character string

30s

Session Status

Execution status of the current session.

Running/Idle/Retry

30s

Real-Time Query

Query ID

Query ID of a current query statement, which is a unique identifier allocated by the kernel to each query statement.

Character string

30s

User Name

Name of the user who submits the current query statement.

Character string

30s

Database Name

Name of the database corresponding to the current query statement.

Character string

30s

Application Name

Name of the application corresponding to the current query statement.

Character string

30s

Workload Queue

Name of the workload queue that carries the current query statement.

Character string

30s

Submitted

Timestamp when the current query statement is submitted.

Character string

30s

Blocking Time

Waiting time before the current query statement is executed, in ms.

≥ 0

30s

Execution Time

Execution time of the current query statement, in ms.

≥ 0

30s

CPU Time

Total CPU time spent by the current query statement on all DNs, in ms.

≥ 0

30s

CPU Time Skew

CPU time skew of the current query statement among all DNs.

0% to 100%

30s

Statement

Query statement that is being executed.

Character string

30s

Connected CN

Name of the CN that submits the current query statement.

Character string

30s

Client IP Address

IP address of the client that submits the current query statement.

Character string

30s

Lane

Lane where the current query statement is located.

Fast Lane/Slow Lane

30s

Query Status

Query status of the statement that is being executed.

Character string

30s

Session ID

Session ID of the current query statement, which is a unique identifier allocated by the kernel to each client connection.

Character string

30s

Queuing Status

Status of the current query execution in the database, indicating whether the query is queued in the workload queue.

Yes/No

30s

Historical Query

Query ID

Query ID of a query statement, which is a unique identifier allocated by the kernel to each query statement.

Character string

180s

User Name

Name of the user who submits a query statement.

Character string

180s

Application Name

Application name corresponding to a query statement.

Character string

180s

Database Name

Name of the database corresponding to a query statement.

Character string

180s

Workload Queue

Name of the workload queue that carries the current query statement.

Character string

180s

Submitted

Timestamp when a query statement is submitted.

Character string

180s

Blocking Time

Waiting time before the query statement is executed, in ms.

≥ 0

180s

Execution Time

Execution time of the query statement, in ms.

≥ 0

180s

CPU Time

Total CPU time spent by the query statement on all DNs, in ms.

≥ 0

180s

CPU Time Skew

CPU time skew of a query statement executed on all DNs.

0% to 100%

180s

Statement

Query statements to be parsed

Character string

180s

Slow Instance Monitoring

Slow Instance

Number of slow instances detected at the current time point.

≥ 0

240s

Detected

Time when a slow instance is detected for the first time.

Character string

240s

Node Name

Name of the node where the slow instance is deployed.

Character string

240s

Instance

Name of an instance.

Character string

240s

Slow Node Detections (within 24 hours)

Number of times that a slow instance is detected within 24 hours.

≥ 0

240s

Workload Queue Monitoring

Workload Queue

Name of the workload queue in the cluster.

Character string

120s

CPU Usage

Real-time CPU usage of the workload queue.

0% to 100%

120s

CPU Resource

CPU usage quotas of the workload queue.

0% to 100%

120s

Real-Time Concurrent Short Queries

Number of real-time concurrent simple queries in a workload queue.

≥ 0

120s

Concurrent Short Queries

Concurrent simple query quotas of a workload queue.

≥ 0

120s

Real-Time Concurrent Queries

Number of real-time concurrent complex queries in a workload queue.

≥ 0

120s

Query Concurrency

Concurrent complex query quotas of a workload queue.

≥ 0

120s

Storage

Storage quota of the workload queue.

≥ 0

120s

Disk Usage

Disk usage of the workload queue.

0% to 100%

120s

Memory

Memory quota of the workload queue.

≥ 0

120s

Memory Usage

Memory usage of the workload queue.

0% to 100%

120s

Waiting Queries

User

Name of the user of waiting queries

Character string

120s

Application

Name of the application to be queried.

Character string

120s

Database

Name of the database to be queried.

Character string

120s

Queuing Status

Execution status of a query in the database (CCN/CN/DN).

Character string

120s

Wait Time

Waiting time for a waiting query (unit: ms).

≥ 0 ms

120s

Workload Queue

Workload queue to which the waiting query belongs.

Character string

120s

Statement

Query statement for the waiting status.

Character string

120s

Circuit Breaking Queries

Query ID

Query ID of the circuit breaking query statement.

Character string

120s

Query Statement

Query statement for the circuit breaking status.

Character string

120s

Blocking Time

Blocking time before the query statement triggers circuit breaking, in ms.

≥ 0

120s

Execution Time

Execution time before the query statement triggers circuit breaking, in ms.

≥ 0

120s

CPU Time

Average CPU time consumed by each DN before the query statement triggers circuit breaking, in ms.

≥ 0

120s

CPU Skew

Skew rate of CPU time consumed by each DN before the query statement triggers circuit breaking.

0% to 100%

120s

Exception Handling

Handling method after the query statement triggers circuit breaking.

Abort/Degrade

120s

Status

Circuit breaking handling status of a query statement.

Executing/Completed

120s

SQL Tuning

Query ID

IP address of the current query (query logic ID).

Character string

180s

Database

Name of the database where the current query is executed.

Character string

180s

Schema Name

Name of the current query schema.

Character string

180s

User Name

Name of the user who performs the query.

Character string

180s

Client

Name of the client that initiates the current query.

Character string

180s

Client IP Address

IP address of the client that initiates the current query.

Character string

180s

Running Time

Execution time of the current query, in ms.

≥ 0

180s

CPU Time

CPU time of the current query, in ms.

≥ 0

180s

Scale-Out Started

Start time of the current query.

Timestamp

180s

Completed

End time of the current query.

Timestamp

180s

Details

Details about the current query.

Character string

180s

INODE

Inode Usage

Disk inode usage.

0% to 100%

30s

SCHEMA

Schema Usage

Database schema usage.

0% to 100%

3600s