Help Center/ GaussDB(DWS)/ Best Practices/ Cluster Management/ Scaling Options for GaussDB(DWS) with a Coupled Storage-Compute Architecture

Updated on 2024-10-29 GMT+08:00

View PDF

Scaling Options for GaussDB(DWS) with a Coupled Storage-Compute Architecture

Scalability is a critical feature for cloud services. It refers to cloud services' ability to increase or decrease compute and storage resources to meet changing demand, achieving a balance between performance and cost.

Typically, a distributed architecture offers the following types of scalability:

Scale-out (horizontal scaling)
With a scale-out, more nodes are added to an existing system to increase storage and compute capacities. For GaussDB(DWS), this means to expand the cluster size. To ensure proper resource utilization, make sure the hardware devices you add use the same specifications as the ones already in the cluster do.

Scale-in (horizontal scaling)
Scale-in is the opposite of scale-out. With a scale-in, nodes are removed from an existing system to decrease storage and compute capacities and by doing so, increase resource utilization. GaussDB(DWS) is deployed by security ring, which means GaussDB(DWS) clusters are scaled in or out by security ring as well. We will talk about security rings in more detail in a later section.

Scale-up (vertical scaling)
With a scale-up, more CPUs, memory, disks, or NICs are added to existing servers to increase the corresponding capacities. In some cases, lower-capacity hardware is replaced by higher-capacity ones. This is also referred to as hardware upgrade, which may entail an OS upgrade sometimes.

Scale-down (vertical scaling)
Scale-down is the opposite of scale-up. With a scale-down, the hardware of an existing system is downgraded to match demand.

GaussDB(DWS) offers the standard data warehouse (DWS 2.0) and stream data warehouse, both of which use a distributed architecture with coupled storage and compute. They support both horizontal and vertical scaling. A cluster resizing option allows customers to perform horizontal and vertical scaling at the same time. The cluster topology can also be adjusted.

A Closer Look at GaussDB(DWS) Cluster Topology

To fully understand the scalability of GaussDB(DWS), one needs to understand GaussDB(DWS)'s typical cluster topology. The following figure shows a simplified ECS+EVS deployment structure of GaussDB(DWS).

ECSs provide compute resources, including CPUs and memory. GaussDB(DWS) database instances (such as CNs and DNs) are deployed on ECSs.
EVS provides storage resources. An EVS disk is attached to each DN.
All ECSs in a GaussDB(DWS) cluster are within the same VPC to ensure high-speed connections between them.
All the database instances deployed on ECSs form a distributed, massively parallel processing database (MPPDB) cluster to provide data analysis and processing capabilities as a whole.

Figure 1 Cluster topology

Once you have had a good look at the typical topology of a GaussDB(DWS) cluster, you can better understand GaussDB(DWS)'s scalability features. At present, GaussDB(DWS) offers the following scaling options: disk scaling, node flavor change, cluster scale-out, cluster scale-in, cluster resizing, and CN addition or deletion, as illustrated by the figure below:

Figure 2 GaussDB(DWS) scaling options

Disk Scaling

With disk scaling, the size of all EVS disks attached to all ECSs in a cluster is changed. This option can be used to quickly scale disk capacity.
Disk capacity can only be scaled up, and not down.
Disk scaling is a lightweight operation that typically can be completed within 5 to 10 minutes. It does not entail data migration or the restarting of services, so it does not interrupt services. Nonetheless, you are advised to perform this operation during off-peak hours.
GaussDB(DWS) standard data warehouses and stream data warehouses support this operation. The cluster version must be 8.1.1.203 or later.
For details, see Disk Capacity Expansion of an EVS Cluster.

Figure 3 Disk scaling

Changing the Node Flavor

This operation changes the flavor of all ECSs in a cluster. It can be used to quickly change CPU and memory specifications.
A flavor is a preset resource template of a combination of a specific number of vCPUs and memory. For example, the flavor dwsx.16xlarge includes 64 vCPUs and 512 GB memory.
Changing the node flavor is a lightweight operation that typically can be completed within 5 to 10 minutes. It does not involve data migration, but services will need to be restarted once, causing a service interruption in minutes. You are advised to perform this operation during off-peak hours.
GaussDB(DWS) standard data warehouses and stream data warehouses support this operation. The cluster version must be 8.1.1.300 or later.
For details, see Changing the Node Flavor.

Figure 4 Changing the node flavor

Scaling Out a Cluster

Cluster scale-out is a typical horizontal scaling scenario for MPPDBs, where homogeneous nodes are added to an existing cluster to increase capacity. GaussDB(DWS) 2.0 uses coupled storage and compute, so a cluster scale-out expands both compute and storage capacities.

To balance the load and achieve optimal performance, metadata replication and data redistribution are performed during a cluster scale-out. Therefore, the time needed to complete a cluster scale-out is positively correlated with the number of database objects as well as the data size. To ensure reliability, new nodes are automatically added to security rings. This is why at least three nodes must be added for a scale-out operation.

Figure 5 Scaling out a cluster

8.1.1 and later versions support online scale-out. During an online scale-out, GaussDB(DWS) does not restart and can continue to provide services. During data redistribution, you can perform insert, update, and delete operations on tables, but data updates may still be blocked for a short period of time. Redistribution consumes large quantities of CPU and I/O resources, significantly impacting job performance. Therefore, you are advised to perform redistribution when services are stopped or during periods of light load. A phase-by-phase approach is recommended for cluster scale-out: Perform high-concurrency redistribution during periods of light load, and stop redistribution or perform low-concurrency redistribution during periods of heavy load.

Cluster scale-out can be performed phase by phase or in one-click mode.

A phase-by-phase approach separates a scale-out operation into three phases: adding ECSs, adding nodes, and data redistribution. You can schedule the scale-out tasks in a way that can minimize the risk of service interruption.

On the other hand, a one-click scale-out is more convenient to users.

**Table 1** Comparing two different scale-out approaches
Approach	Characteristics	Impact
Phase-by-phase scale-out	A scale-out operation is divided into three phases: adding ECSs, adding nodes, and data redistribution. You can schedule each phase for the most appropriate times and perform them separately.	The risk of service interruption can be minimized.
One-click scale-out	During a one-click scale-out, adding ECSs, adding nodes, and redistributing data are all performed automatically.	It is more convenient to users.

GaussDB(DWS) Cluster Security Ring

A security ring is the minimum set of nodes required for the horizontal deployment of multi-replica DNs. Cluster scale-out and scale-in are both performed by security ring. The main idea behind security rings is fault isolation. Any fault that occurs within a security ring stays within that ring.

GaussDB(DWS) uses a primary-standby-secondary architecture, so the minimum number of nodes in a security ring is 3. When a fault occurs within a ring, it has no impact on nodes outside that ring. The scope of impact is minimized (3 nodes), and the impact on each node in that faulty ring is 1/(N-1), that is, 1/2. In extreme scenarios, the entire cluster is a security ring. If a fault occurs within this ring, the scope of impact is the largest (the entire cluster), but the impact on each node in the ring is the smallest, that is, 1/(N-1).

A common practice is to form an N+1 ring, where each node evenly distributes its N replicas to the remaining N nodes in the same ring. When a fault occurs in the ring, the scope of impact in the entire cluster is N+1 nodes, and the impact on each node in the ring is 1/N.

Figure 6 Typical N+1 security ring

Scaling In a Cluster

Cluster scale-in is also a typical horizontal scaling scenario for MPPDBs, where some of the nodes of an existing cluster are removed to reduce capacity. A cluster scale-in reduces both compute and storage capacities.
Each GaussDB(DWS) cluster physically consists of multiple ECSs. To improve reliability, a set number of ECSs (typically three) form a logical security ring, so each GaussDB(DWS) cluster consists of a number of security rings. A cluster scale-in is performed by security ring. The security rings at the end of a cluster are first removed.
A cluster scale-in involves data migration. Data on the removed nodes needs to be redistributed to the remaining nodes. This means the time needed to complete a cluster scale-in is positively correlated with the number of database objects as well as the data size.
GaussDB(DWS) standard data warehouses and stream data warehouses support cluster scale-in. 8.1.1.300 and later versions support online scale-in. During an online scale-in, GaussDB(DWS) does not restart and can continue to provide services. During data redistribution, you can perform insert, update, and delete operations on tables, but data updates may still be blocked for a short period of time. Redistribution consumes large quantities of CPU and I/O resources, significantly impacting job performance. Therefore, you are advised to perform redistribution when services are stopped or during periods of light load.

Figure 7 Scaling in a cluster

Adding or Deleting CNs

Adding or deleting coordinator nodes (CNs) is another way of cluster scaling in GaussDB(DWS).
CNs are an important component of GaussDB(DWS). It provides interfaces to external applications, optimizes global execution plans, distributes execution plans to data nodes (DNs), and summarizes results from each node into a single result set.
CN capacities determine the entire cluster's concurrency handling capability. By adding more CNs, you increase the cluster's concurrency handling capability.
CNs use a multi-active architecture. To ensure data consistency, if data on some CNs is damaged, DDL services will be blocked. To quickly restore DDL services, you can remove the faulty CNs.
In 8.1.1 and later versions, GaussDB(DWS) standard data warehouses and stream data warehouses support this operation.
When a CN is added, metadata needs to synchronized. The time it takes to add a CN depends on the metadata size. In 8.1.3, CNs can be added and deleted online. During CN addition, GaussDB(DWS) does not restart and can continue to provide services. DDL services will be blocked for a short period of time (with no error reported). No other services are affected.

Figure 8 Adding or deleting a CN

Resizing a Cluster

Cluster resizing allows you to perform horizontal and vertical scaling at the same time, including cluster scale-out and scale-in, as well as scale-up and scale-down. The cluster topology can also be adjusted.
Clustering resizing relies on multiple node groups and data redistribution. During cluster resizing, a new cluster is created based on new resource requirements and cluster planning. Then, data is redistributed between the old and new clusters. Once data migration is complete, services are migrated to the new cluster, and after that, the old cluster is released.
Cluster resizing involves data migration. Data on the nodes in the old cluster needs to be redistributed to the nodes in the new cluster, with the data still available in the old cluster. The time it takes to resize a cluster is positively correlated with the number of database objects as well as the data size.
GaussDB(DWS) standard data warehouses support cluster resizing, but agents must be upgraded to 8.2.0.2. Currently, during cluster resizing, the old cluster can only support read-only services. Online service capabilities can be expected later.
For details, see Changing All Specifications.

Figure 9 Resizing a cluster

Comparing Different Scaling Options

The table below compares different scaling options for GaussDB(DWS).

**Table 2** Comparing different scaling options for GaussDB(DWS)
Option	Scaled Object	Scope	Impact	Product
Disk scaling	Disk capacity	EVS disks attached to all ECSs in a cluster	Can be completed within 5 to 10 minutes. There is no need to restart services, so it has no impact on services. Should be performed during off-peak hours.	Cluster version: 8.1.1.203 or later Product form: standard data warehouse and stream data warehouse
Changing the node flavor	Compute capacity	The flavor (CPU cores and memory size) of all ECSs in a cluster	Can be completed within 5 to 10 minutes. Services will need to be restarted once, causing a service interruption in minutes. Should be performed during off-peak hours.	Cluster version: 8.1.1.300 or later Product form: standard data warehouse and stream data warehouse
Cluster scale-out	Disk and compute capacities	Adding homogeneous ECSs in a distributed architecture	Online scale-out supported. During an online scale-out, GaussDB(DWS) does not restart and can continue to provide services. The duration is positively correlated with the number of database objects as well as the data size.	Cluster version: all versions. Online scale-out is supported since 8.1.1. Product form: standard data warehouse and stream data warehouse
Cluster scale-in	Disk and compute capacities	Removing some of the ECSs in a distributed architecture	Online scale-in supported. During an online scale-in, GaussDB(DWS) does not restart and can continue to provide services. The duration is positively correlated with the number of database objects as well as the data size.	Cluster version: 8.1.1.300 Product form: standard data warehouse and stream data warehouse
Cluster resizing	Disk and compute capacities, and cluster topology	Using a new ECS flavor (new hardware specifications) and new cluster topology to create a new cluster, and redistributing data between the old and new clusters	The duration is positively correlated with the number of database objects as well as the data size. Read-only services can be provided during cluster resizing.	Cluster version: Agent 8.2.0.2 or later Product form: standard data warehouse
Adding or deleting CNs	CN instances	Adding CNs to enhance concurrency, or removing faulty CNs to quickly restore DDL services	Online addition and deletion of CNs is supported in 8.1.3 and later. During CN addition, GaussDB(DWS) does not restart and can continue to provide services.	Cluster version: 8.1.1. (Online addition and deletion of CNs is supported in 8.1.3 and later.) Product form: standard data warehouse and stream data warehouse

Application Scenarios for Different Scaling Options

Table 3 describes when to use each scaling option.

**Table 3** Application scenarios for different scaling options for GaussDB(DWS)
Category	Problem to Solve	Recommended Scaling Option	Impact on Services	Estimated Duration
Storage	Insufficient storage space. CPU, memory, and disk I/O capacities are sufficient.	Increase disk capacity.	Online services can be maintained.	No need for data migration. Can be completed within 5 to 10 minutes.
Storage	Excessive storage space, which needs to be reduced to cut costs. CPU, memory, and disk I/O capacities are sufficient.	Create a cluster with smaller disk capacity (but otherwise unchanged), and migrate data to the new cluster by performing a DR switchover.	Data becomes read-only during the DR switchover, which typically takes less than 30 minutes.	The duration is positively correlated with the data size.
Compute	Insufficient CPU or memory capacity	Use a larger ECS flavor.	The cluster needs to restart once.	No need for data migration. Can be completed within 5 to 10 minutes.
Compute	Insufficient disk I/O	Create a cluster with smaller disk capacity (but otherwise unchanged), and migrate data to the new cluster by performing a DR switchover.	Data becomes read-only during the DR switchover, which typically takes less than 30 minutes.	The duration is positively correlated with the data size.
Distributed compute and storage	Insufficient distributed capabilities due to insufficient nodes	Scale out the cluster.	Online services can be maintained (partially impacted).	Data migration is needed. The duration is positively correlated with the sizes of metadata as well as service data.
Distributed compute and storage	Too many nodes, leading to a high cost	Scale in the cluster.	Online services can be maintained (partially impacted).	Data migration is needed. The duration is positively correlated with the size of service data.
Cluster topology	Change both the cluster topology and node flavor (the number of DNs changes).	Resizes the cluster.	Read-only services	Data migration is needed. The duration is positively correlated with the sizes of metadata as well as service data.
	Change both the cluster topology and node flavor (the number of DNs remains the same).	Perform cluster DR switchover and data migration	Online services can be maintained (partially impacted).	Data migration is needed. The duration is positively correlated with the size of service data.
	Insufficient concurrency support	Add CNs.	Online services can be maintained (partially impacted).	Data migration is needed. The duration is positively correlated with the size of metadata.