Updated on 2023-03-30 GMT+08:00

Cluster Management and HA

Description

GaussDB(DWS) provides the cluster manager (CM) module to manage and monitor the running status of each functional unit and physical resource in the distributed system, ensuring stable running of the entire system. CMs are classified into primary and standby CMs. In normal cases, only the primary CM provides the GaussDB(DWS) cluster management service. If the primary CM is faulty, the standby CM will be promoted to the primary to provide cluster management.

Technical Principles

The cluster management module consists of the CMServer, CMAgent, and Monitor components, and provides tools for querying cluster status, starting and stopping a cluster, performing primary/standby switchover, and rebuilding instances. CMServer is deployed only on the primary and standby CMs. As the brain of the entire GaussDB(DWS) cluster, CMServer processes various status information reported by CMAgent and determines whether to change the status. Deployed on all nodes, CMAgent functions as an instance agent process, reports the status of CNs, DNs, GTMs, and other instances to CMServer, and receives and executes commands delivered by CMServer. Monitor is deployed on all nodes as a scheduled task that restarts CMAgent when it is stopped.

Hang Detection

The cluster management module uses the short connection mechanism to check whether the instance process is in an abnormal state such as network fault, disk I/O suspension, or process/thread suspension. If necessary, the cluster management module triggers the DN/GTM primary/standby switchover or CN removal process.

Take DNs as an example. By default, CMAgent creates instance connections every 180 seconds. If the connection fails, CMAgent retries every 84 seconds. If the connection fails for five consecutive times, the primary/standby DN switchover process is triggered. The complete hang detection period is about 600 seconds.

CN Retry

GaussDB(DWS) provides the CN Retry function to automatically retry SQL statements when an exception occurs, improving service continuity.

The CN triggers the retry mechanism when an error is reported during statement execution. For retryable errors, roll back the executed operation and execute the statement again. If the operation still fails, the error information is reported to the client.

Users are unaware of the retry process of SQL statements.