GDS-based Cross-Cluster Interconnection

Background

In medium- and large-sized enterprises, datawarehouses are layered. Multiple GaussDB(DWS) clusters are deployed and data is synchronized across clusters. Data synchronization between clusters should support a large amount of data and parallel execution. How to provide efficient cross-cluster data synchronization has become one of the key issues in the data field.

Function

Full data migration between GaussDB(DWS) clusters

Partial data migration based on filter conditions between GaussDB(DWS) clusters

Technical Principles

An SQL statement that triggers synchronization is converted into a pair of GDS import and export jobs through query rewriting. The jobs are executed in the source and destination clusters, forming an efficient and real-time data transfer channel for data migration and synchronization. You can execute synchronization on the destination cluster to pull data from the source cluster or, reversely, execute synchronization on the source cluster to push data to the destination cluster.

Click to enlarge

1. Remotely connect to the source cluster, create a GDS write-only foreign table, and initiate an export job.

2. Create a GDS read-only foreign table and initiate an import job.

3. Worker thread A receives data from the source cluster and writes the data to a local file.

4. Worker thread B reads local file data and sends the data to the destination cluster.

5. The destination cluster obtains the final result based on the job results at both ends and returns the final result to the user.

Click to enlarge

1. Remotely connect to the target cluster, create a GDS read-only foreign table, and initiate an import job.

2. Create a GDS write-only foreign table and initiate the export job.

3. Worker thread A receives data from the source cluster and writes the data to the named pipe.

4. Worker thread B reads the named pipe data and sends it to the destination cluster.

5. The source cluster integrates its export job result and the import job result of the destination cluster, then returns the final result to the user.

Benefits

One SQL statement is used to start the migration service. With GDS, the computing power of the nodes in the clusters at both ends is fully utilized to provide convenient and efficient logical data synchronization or migration between GaussDB(DWS) clusters without occupying disk space, improving system resource utilization.

Parent topic: SQL on Anywhere

Previous topic: SQL on Hadoop

Next topic: Cluster Management and HA