Database Monitoring Tool
Background
As the IT infrastructure, databases must be stable, high-throughput, and low-latency. To achieve the preceding objectives, ensure that the database cluster can continuously provide stable and fast services for users. A comprehensive database monitoring tool is required to help database O&M personnel monitor database running in real time and detect, locate, and handle exceptions in a timely manner.
The Database Monitor Service (DMS) is a native database monitoring service of GaussDB(DWS). It provides comprehensive database resource consumption indicators and database service execution indicators. DMS provides a visual tool to monitor the real-time and historical running status of database clusters, helping you detect, locate, and rectify faults.
Database Monitoring Principles
DMS uses a three-layer structure to monitor the GaussDB(DWS) database cluster:
- Collection (dms-agent): The agent is embedded in each node of the database cluster to collect the cluster running status, collect the original data of cluster running, and report the data.
- Storage (dms-collection): receives data reported by the agent and saves the data to the metric database.
- Analysis (dms-monitoring): A large amount of data in the monitoring database is used, and the monitoring view is displayed to users through aggregation calculation.
Database Monitoring Metrics
Table 1 describes the database monitoring metrics. For more information, see Database Monitoring Overview
Monitored Object |
Metric |
Description |
Value Range |
Monitoring Period (Raw Data) |
---|---|---|---|---|
Cluster Overview |
Cluster Status |
Status of a cluster. |
Normal/Abnormal/Degraded |
30s |
Nodes |
Number of available nodes and total number of nodes (Available/Total) in a cluster. |
≥ 0 |
60s |
|
CNs |
Number of CNs in a cluster. |
≥ 0 |
60s |
|
Databases |
Number of created databases in a cluster. |
≥ 0 |
90s |
|
Resource Consumption |
CPU Usage |
Average real-time CPU usage of all nodes in a cluster. |
0% to 100% |
30s |
Memory Usage |
Average real-time memory usage of all nodes in a cluster. |
0% to 100% |
30s |
|
Disk Usage |
Average real-time disk usage of all nodes in a cluster. |
0% to 100% |
30s |
|
Disk I/O |
Average real-time disk I/O of all nodes in a cluster. |
≥ 0 KB/s |
30s |
|
Network I/O |
Average real-time network I/O of all NICs in a cluster. |
≥ 0 KB/s |
30s |
|
Top 5 Time-Consuming Queries |
Query ID |
ID of a query, which is automatically generated by the database. |
≥ 0 |
180s |
SQL Statement |
Query statement executed by a user. |
Character string |
180s |
|
Execution Time |
Execution time of a query statement (unit: ms). |
≥ 0 ms |
180s |
|
Top 5 Queries with Most Data Written to Disk |
Query ID |
ID of a query, which is automatically generated by the database. |
≥ 0 |
180s |
SQL Statement |
Query statement executed by a user. |
Character string |
180s |
|
Data Written to Disk |
Data to be written to disks after a user runs a statement (unit: MB). |
≥ 0 MB |
180s |
|
Cluster Resource Metrics |
CPU Usage |
Average CPU usage and skew ratio of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
Memory Usage |
Average memory usage and skew ratio of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Disk Usage |
Average usage and skew ratio of all disks in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Disk I/O Usage |
Average I/O usage and skew rate of all disks in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Network I/O Usage |
Average I/O usage and skew rate of all NICs in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Key Database Metrics |
Cluster Status |
Cluster running status. |
Normal/Degraded/Abnormal |
30s |
Cluster Abnormal CNs |
Number of abnormal CNs in the cluster |
≥ 0 |
60s |
|
Cluster Read-only |
Whether the cluster is in the read-only state |
Yes/No |
30s |
|
Concurrent Sessions |
Number of concurrent sessions in a cluster within a specified period. |
≥ 0 |
30s |
|
Concurrent Queries |
Number of concurrent queries in a cluster within a specified period. |
≥ 0 |
30s |
|
Node Monitoring-Overview |
Node Name |
Name of a node in a cluster. |
Character string |
30s |
CPU Usage |
CPU usage of a host. |
0% to 100% |
30s |
|
Memory Usage |
Memory usage of a host. |
0% to 100% |
30s |
|
Average Disk Usage (%) |
Disk usage of a host. |
0% to 100% |
30s |
|
IP Address |
Service IP address of a host. |
Character string |
30s |
|
Disk I/O |
Disk I/O of a host (unit: KB/s) |
≥ 0 KB/s |
30s |
|
TCP Protocol Stack Retransmission Rate |
Retransmission rate of TCP packets per unit time. |
0% to 100% |
30s |
|
Status |
Running status of a host |
Online/Offline |
30s |
|
Node Monitoring-Disks |
Node Name |
Name of a node in a cluster. |
Character string |
30s |
Disk Name |
Name of a disk on a host. |
Character string |
30s |
|
Disk Capacity |
Disk capacity of the host (unit: GB) |
≥ 0 GB |
30s |
|
Disk Usage |
Disk usage of a host. |
0% to 100% |
30s |
|
Disk Read Rate |
Disk read rate of the host (unit: KB/s) |
≥ 0 KB/s |
30s |
|
Disk Write Rate |
Disk write rate of the host (unit: KB/s) |
≥ 0 KB/s |
30s |
|
I/O Wait Time (await, ms) |
Average waiting time for each I/O request (unit: ms) |
≥ 0 ms |
30s |
|
I/O Service Time (svctm, ms) |
Average processing time for each I/O request (unit: ms) |
≥ 0 ms |
30s |
|
I/O Utility (util, %) |
Disk I/O usage of a host. |
0% to 100% |
30s |
|
Node Monitoring-Network |
Node Name |
Name of a node in a cluster. |
Character string |
30s |
NIC Name |
Name of the NIC on a host. |
Character string |
30s |
|
NIC Status |
NIC status. |
up/down |
30s |
|
NIC Speed |
Working rate of a NIC, in Mbit/s. |
≥ 0 |
30s |
|
Received Packets |
Number of received packets of a NIC. |
≥ 0 |
30s |
|
Sent Packets |
Number of sent packets of a NIC. |
≥ 0 |
30s |
|
Lost Packets Received |
Number of received lost packets of a NIC. |
≥ 0 |
30s |
|
Receive Rate |
Number of bytes received by a NIC per unit of time (KB/s). |
≥ 0 KB/s |
30s |
|
Transmit Rate |
Number of bytes sent by a NIC per unit of time (unit: KB/s) |
≥ 0 KB/s |
30s |
|
Database Monitoring |
Database Name |
Name of the database created by a user in a cluster. |
Character string |
60s |
Usage |
Used capacity of the current database (unit: GB). |
≥ 0 GB |
86400s |
|
Users |
Number of users in the current database. |
≥ 0 |
30s |
|
Sessions |
Number of sessions in the current database. |
≥ 0 |
30s |
|
Applications |
Number of applications in the current database. |
≥ 0 |
30s |
|
Queries |
Number of active queries in the current database. |
≥ 0 |
30s |
|
Scanning Rows |
Number of rows returned by the full table scan query in the current database. |
≥ 0 |
60s |
|
Index Query Rows |
Number of rows returned by the index query in the current database. |
≥ 0 |
60s |
|
Inserted Rows |
Number of rows inserted in the current database. |
≥ 0 |
60s |
|
Updated Rows |
Number of rows updated in the current database. |
≥ 0 |
60s |
|
Deleted Rows |
Number of rows deleted from the current database. |
≥ 0 |
60s |
|
Executed Transactions |
Number of transaction executions on the current database. |
≥ 0 |
60s |
|
Transaction Rollbacks |
Number of transactions in the current database that have been rolled back. |
≥ 0 |
60s |
|
Deadlocks |
Number of deadlocks detected in the current database. |
≥ 0 |
60s |
|
Temporary Files |
Number of temporary files created in the current database. |
≥ 0 |
60s |
|
Temporary File Capacity |
Size of temporary files written by the current database, in GB. |
≥ 0 |
60s |
|
Performance Monitoring |
Cluster CPU Usage |
Historical trend of the average CPU usage and skew of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
Cluster Memory Usage |
Historical trend of the average memory usage and skew of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Cluster Disk Usage |
Historical trend of the average disk usage and skew of all nodes in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Cluster Disk I/O |
Historical trend of the average disk I/O and skew of all disks in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Cluster Network I/O |
Historical trend of the average network I/O value and skew of all NICs in the cluster. The formula for calculating the skew is (max-avg)/max. |
0% to 100% |
30s |
|
Cluster Status |
Historical trend of the cluster status. |
Normal/Abnormal/Degraded |
30s |
|
Cluster Read-only |
Historical trend of the cluster read-only status change trend. |
Yes/No |
30s |
|
Cluster Abnormal CNs |
Historical trend of the number of abnormal CNs in the cluster. |
≥ 0 |
60s |
|
Cluster Abnormal DNs |
Historical trend of the number of abnormal DNs in the cluster. |
≥ 0 |
60s |
|
Cluster CPU Usage of DNs |
Historical trends of the average CPU usage and skew ratio changes of all DNs in the cluster. The formula for calculating the skew ratio is (max – avg)/max. |
0% to 100% |
60s |
|
Cluster Sessions |
Historical trend of the number of sessions in a cluster. |
≥ 0 |
30s |
|
Cluster Queries |
Historical change trend of the number of queries in the cluster. |
≥ 0 |
30s |
|
Cluster Deadlocks |
Historical trend of the number of deadlocks in a cluster. |
≥ 0 |
60s |
|
Cluster TPS |
Average number of transactions per second of all databases in a cluster. Formula: (delta_xact_commit + delta_xact_rollback)/current_collect_rate |
≥0 |
60s |
|
Cluster QPS |
Average number of concurrent requests per second of all databases in a cluster. Formula: delta_query_count/current_collect_rate |
≥ 0 |
60s |
|
Database Sessions |
Historical trend of the number of sessions on a single database in a cluster. |
≥ 0 |
30s |
|
Database Queries |
Historical trend of the number of queries on a single database in a cluster. |
≥ 0 |
30s |
|
Database Inserted Rows |
Historical trend of the number of rows inserted into a single database in a cluster. |
≥ 0 |
60s |
|
Database Updated Rows |
Historical trend of the number of updated rows in a single database in a cluster. |
≥ 0 |
60s |
|
Database Deleted Rows |
Historical trend of the number of deleted rows in a single database in a cluster. |
≥ 0 |
60s |
|
Database Capacity |
Historical trend of the capacity in a single database in a cluster. |
≥ 0 |
86400s |
|
Live Session |
Session ID |
ID of the current session (query thread ID). |
Character string |
30s |
User Name |
Name of the user who executes the current session. |
Character string |
30s |
|
Database Name |
Name of the database connected to the current session. |
Character string |
30s |
|
Session Duration |
Duration of the current session (unit: ms). |
≥ 0 ms |
30s |
|
Application Name |
Name of the application that creates the current session. |
Character string |
30s |
|
Queries |
Number of SQL statements executed in the current session. |
≥ 0 |
30s |
|
Latest Query Duration |
Duration for executing the previous SQL statement in the current session. |
≥ 0 ms |
30s |
|
Client IP Address |
IP address of the client that initiates the current session. |
Character string |
30s |
|
Connected CN |
Connected CN of the current session. |
Character string |
30s |
|
Session Status |
Execution status of the current session. |
Running/Idle/Retry |
30s |
|
Real-Time Query |
Query ID |
Query ID of a current query statement, which is a unique identifier allocated by the kernel to each query statement. |
Character string |
30s |
User Name |
Name of the user who submits the current query statement. |
Character string |
30s |
|
Database Name |
Name of the database corresponding to the current query statement. |
Character string |
30s |
|
Application Name |
Name of the application corresponding to the current query statement. |
Character string |
30s |
|
Workload Queue |
Name of the workload queue that carries the current query statement. |
Character string |
30s |
|
Submitted |
Timestamp when the current query statement is submitted. |
Character string |
30s |
|
Blocking Time |
Waiting time before the current query statement is executed, in ms. |
≥ 0 |
30s |
|
Execution Time |
Execution time of the current query statement, in ms. |
≥ 0 |
30s |
|
CPU Time |
Total CPU time spent by the current query statement on all DNs, in ms. |
≥ 0 |
30s |
|
CPU Time Skew |
CPU time skew of the current query statement among all DNs. |
0% to 100% |
30s |
|
Statement |
Query statement that is being executed. |
Character string |
30s |
|
Connected CN |
Name of the CN that submits the current query statement. |
Character string |
30s |
|
Client IP Address |
IP address of the client that submits the current query statement. |
Character string |
30s |
|
Lane |
Lane where the current query statement is located. |
Fast Lane/Slow Lane |
30s |
|
Query Status |
Query status of the statement that is being executed. |
Character string |
30s |
|
Session ID |
Session ID of the current query statement, which is a unique identifier allocated by the kernel to each client connection. |
Character string |
30s |
|
Queuing Status |
Status of the current query execution in the database, indicating whether the query is queued in the workload queue. |
Yes/No |
30s |
|
Historical Query |
Query ID |
Query ID of a query statement, which is a unique identifier allocated by the kernel to each query statement. |
Character string |
180s |
User Name |
Name of the user who submits a query statement. |
Character string |
180s |
|
Application Name |
Application name corresponding to a query statement. |
Character string |
180s |
|
Database Name |
Name of the database corresponding to a query statement. |
Character string |
180s |
|
Workload Queue |
Name of the workload queue that carries the current query statement. |
Character string |
180s |
|
Submitted |
Timestamp when a query statement is submitted. |
Character string |
180s |
|
Blocking Time |
Waiting time before the query statement is executed, in ms. |
≥ 0 |
180s |
|
Execution Time |
Execution time of the query statement, in ms. |
≥ 0 |
180s |
|
CPU Time |
Total CPU time spent by the query statement on all DNs, in ms. |
≥ 0 |
180s |
|
CPU Time Skew |
CPU time skew of a query statement executed on all DNs. |
0% to 100% |
180s |
|
Statement |
Query statements to be parsed |
Character string |
180s |
|
Slow Instance Monitoring |
Slow Instance |
Number of slow instances detected at the current time point. |
≥ 0 |
240s |
Detected |
Time when a slow instance is detected for the first time. |
Character string |
240s |
|
Node Name |
Name of the node where the slow instance is deployed. |
Character string |
240s |
|
Instance |
Name of an instance. |
Character string |
240s |
|
Slow Node Detections (within 24 hours) |
Number of times that a slow instance is detected within 24 hours. |
≥ 0 |
240s |
|
Workload Queue Monitoring |
Workload Queue |
Name of the workload queue in the cluster. |
Character string |
120s |
CPU Usage |
Real-time CPU usage of the workload queue. |
0% to 100% |
120s |
|
CPU Resource |
CPU usage quotas of the workload queue. |
0% to 100% |
120s |
|
Real-Time Concurrent Short Queries |
Number of real-time concurrent simple queries in a workload queue. |
≥ 0 |
120s |
|
Concurrent Short Queries |
Concurrent simple query quotas of a workload queue. |
≥ 0 |
120s |
|
Real-Time Concurrent Queries |
Number of real-time concurrent complex queries in a workload queue. |
≥ 0 |
120s |
|
Query Concurrency |
Concurrent complex query quotas of a workload queue. |
≥ 0 |
120s |
|
Storage |
Storage quota of the workload queue. |
≥ 0 |
120s |
|
Disk Usage |
Disk usage of the workload queue. |
0% to 100% |
120s |
|
Memory |
Memory quota of the workload queue. |
≥ 0 |
120s |
|
Memory Usage |
Memory usage of the workload queue. |
0% to 100% |
120s |
|
Waiting Queries |
User |
Name of the user of waiting queries |
Character string |
120s |
Application |
Name of the application to be queried. |
Character string |
120s |
|
Database |
Name of the database to be queried. |
Character string |
120s |
|
Queuing Status |
Execution status of a query in the database (CCN/CN/DN). |
Character string |
120s |
|
Wait Time |
Waiting time for a waiting query (unit: ms). |
≥ 0 ms |
120s |
|
Workload Queue |
Workload queue to which the waiting query belongs. |
Character string |
120s |
|
Statement |
Query statement for the waiting status. |
Character string |
120s |
|
Circuit Breaking Queries |
Query ID |
Query ID of the circuit breaking query statement. |
Character string |
120s |
Query Statement |
Query statement for the circuit breaking status. |
Character string |
120s |
|
Blocking Time |
Blocking time before the query statement triggers circuit breaking, in ms. |
≥ 0 |
120s |
|
Execution Time |
Execution time before the query statement triggers circuit breaking, in ms. |
≥ 0 |
120s |
|
CPU Time |
Average CPU time consumed by each DN before the query statement triggers circuit breaking, in ms. |
≥ 0 |
120s |
|
CPU Skew |
Skew rate of CPU time consumed by each DN before the query statement triggers circuit breaking. |
0% to 100% |
120s |
|
Exception Handling |
Handling method after the query statement triggers circuit breaking. |
Abort/Degrade |
120s |
|
Status |
Circuit breaking handling status of a query statement. |
Executing/Completed |
120s |
|
SQL Tuning |
Query ID |
IP address of the current query (query logic ID). |
Character string |
180s |
Database |
Name of the database where the current query is executed. |
Character string |
180s |
|
Schema Name |
Name of the current query schema. |
Character string |
180s |
|
User Name |
Name of the user who performs the query. |
Character string |
180s |
|
Client |
Name of the client that initiates the current query. |
Character string |
180s |
|
Client IP Address |
IP address of the client that initiates the current query. |
Character string |
180s |
|
Running Time |
Execution time of the current query, in ms. |
≥ 0 |
180s |
|
CPU Time |
CPU time of the current query, in ms. |
≥ 0 |
180s |
|
Scale-Out Started |
Start time of the current query. |
Timestamp |
180s |
|
Completed |
End time of the current query. |
Timestamp |
180s |
|
Details |
Details about the current query. |
Character string |
180s |
|
INODE |
Inode Usage |
Disk inode usage. |
0% to 100% |
30s |
SCHEMA |
Schema Usage |
Database schema usage. |
0% to 100% |
3600s |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot