Help Center/ GaussDB(DWS)/ Best Practices/ Data Migration/ Practice of Data Interconnection Between Two DWS Clusters Based on GDS
Updated on 2024-03-13 GMT+08:00

Practice of Data Interconnection Between Two DWS Clusters Based on GDS

This practice demonstrates how to migrate 15 million rows of data between two data warehouse clusters within minutes based on the high concurrency of GDS import and export.

  • This function is supported only by clusters of version 8.1.2 or later.
  • GDS is a high-concurrency import and export tool developed by GaussDB(DWS). For more information, visit GDS Usage Description.
  • This section describes only the operation practice. For details about GDS interconnection and syntax description, see GDS-based cross-cluster interconnection.

This practice takes about 90 minutes. The cloud service resources used in this practice are Data Warehouse Service (DWS), Elastic Cloud Server (ECS), and Virtual Private Cloud (VPC). The basic process is as follows:

  1. Preparations
  2. Step 1: Creating Two DWS Clusters
  3. Step 2: Preparing Source Data
  4. Step 3: Installing and Starting the GDS Server
  5. Step 4: Implementing Data Interconnection Across DWS Clusters

Supported Regions

Table 1 Regions and OBS bucket names

Region

OBS Bucket

EU-Dublin

dws-demo-eu-west-101

Constraints

In this practice, two sets of DWS and ECS services are deployed in the same region and VPC to ensure network connectivity.

Preparations

  • You have registered a Huawei account and enabled Huawei Cloud services.. Before using GaussDB(DWS), check the account status. The account cannot be in arrears or frozen.
  • You have obtained the AK and SK of the account.
  • You have created a VPC and subnet. For details, see Creating a VPC.

Step 1: Creating Two DWS Clusters

Create two GaussDB(DWS) clusters in the Europe-Dublin region. For details, see Creating a Cluster. The two clusters are named dws-demo01 and dws-demo02.

Step 2: Preparing Source Data

  1. On the Cluster Management page of the GaussDB (DWS) console, click Login in the Operation column of the source cluster dws-demo01.

    This practice uses version 8.1.3.x as an example. 8.1.2 and earlier versions do not support this login mode. You can use Data Studio to connect to a cluster. For details, see Using Data Studio to Connect to a Cluster.

  2. The login username is dbadmin, the database name is gaussdb, and the password is the password of user dbadmin set during data warehouse cluster creation. Select Remember Password, enable Collect Metadata Periodically and Show Executed SQL Statements, and click Log In.

    Figure 1 Logging In to GaussDB(DWS)

  3. Click the database name gaussdb and click SQL Window in the upper right corner to access the SQL editor.
  4. Copy the following SQL statement to the SQL window and click Execute SQL to create the test TPC-H table ORDERS.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    CREATE TABLE ORDERS
     ( 
     O_ORDERKEY BIGINT NOT NULL , 
     O_CUSTKEY BIGINT NOT NULL , 
     O_ORDERSTATUS CHAR(1) NOT NULL , 
     O_TOTALPRICE DECIMAL(15,2) NOT NULL , 
     O_ORDERDATE DATE NOT NULL , 
     O_ORDERPRIORITY CHAR(15) NOT NULL , 
     O_CLERK CHAR(15) NOT NULL , 
     O_SHIPPRIORITY BIGINT NOT NULL , 
     O_COMMENT VARCHAR(79) NOT NULL)
     with (orientation = column)
     distribute by hash(O_ORDERKEY)
     PARTITION BY RANGE(O_ORDERDATE)
     ( 
     PARTITION O_ORDERDATE_1 VALUES LESS THAN('1993-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_2 VALUES LESS THAN('1994-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_3 VALUES LESS THAN('1995-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_4 VALUES LESS THAN('1996-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_5 VALUES LESS THAN('1997-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_6 VALUES LESS THAN('1998-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_7 VALUES LESS THAN('1999-01-01 00:00:00')
     );
    

  5. Run the following SQL statement to create an OBS foreign table:

    Replace AK and SK with the actual AK and SK of the account. <obs_bucket_name> is obtained from Supported Regions.

    // Hard-coded or plaintext AK and SK are risky. For security purposes, encrypt your AK and SK and store them in the configuration file or environment variables.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    CREATE FOREIGN TABLE ORDERS01
     (
    LIKE orders
     ) 
     SERVER gsmpp_server 
     OPTIONS (
     ENCODING 'utf8',
     LOCATION obs://<obs_bucket_name>/tpch/orders.tbl',
     FORMAT 'text',
     DELIMITER '|',
    ACCESS_KEY 'access_key_value_to_be_replaced',
    SECRET_ACCESS_KEY 'secret_access_key_value_to_be_replaced',
     CHUNKSIZE '64',
     IGNORE_EXTRA_DATA 'on'
     );
    

  6. Run the following SQL statement to import data from the OBS foreign table to the source data warehouse cluster: The import takes about 2 minutes. Please wait.

    If an import error occurs, the AK and SK values of the foreign table are incorrect. In this case, run the DROP FOREIGN TABLE order01; command to delete the foreign table, create a foreign table again, and run the following statement to import data again:

    1
    INSERT INTO orders SELECT * FROM orders01;
    

  7. Repeat the preceding steps to log in to the target cluster dws-demo02 and run the following SQL statement to create the target table orders:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    CREATE TABLE ORDERS
     ( 
     O_ORDERKEY BIGINT NOT NULL , 
     O_CUSTKEY BIGINT NOT NULL , 
     O_ORDERSTATUS CHAR(1) NOT NULL , 
     O_TOTALPRICE DECIMAL(15,2) NOT NULL , 
     O_ORDERDATE DATE NOT NULL , 
     O_ORDERPRIORITY CHAR(15) NOT NULL , 
     O_CLERK CHAR(15) NOT NULL , 
     O_SHIPPRIORITY BIGINT NOT NULL , 
     O_COMMENT VARCHAR(79) NOT NULL)
     with (orientation = column)
     distribute by hash(O_ORDERKEY)
     PARTITION BY RANGE(O_ORDERDATE)
     ( 
     PARTITION O_ORDERDATE_1 VALUES LESS THAN('1993-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_2 VALUES LESS THAN('1994-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_3 VALUES LESS THAN('1995-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_4 VALUES LESS THAN('1996-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_5 VALUES LESS THAN('1997-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_6 VALUES LESS THAN('1998-01-01 00:00:00'), 
     PARTITION O_ORDERDATE_7 VALUES LESS THAN('1999-01-01 00:00:00')
     );
    

Step 3: Installing and Starting the GDS Server

  1. Create an ECS by referring to Purchasing an ECS. Note that the ECS and GaussDB(DWS) instances must be created in the same region and VPC. In this example, the CentOS 7.6 version is selected as the ECS image.
  2. Downloading the GDS Package

    1. Log in to the GaussDB(DWS) console.
    2. In the navigation tree on the left, click Connections.
    3. Select the GDS client of the corresponding version from the drop-down list of CLI Client.

      Select a version based on the cluster version and the OS where the client is installed.

      The CPU architecture of the client must be the same as that of the cluster. If the cluster uses the x86 specifications, select the x86 client.

    4. Click Download.

  3. Use the SFTP tool to upload the downloaded client (for example, dws_client_8.2.x_redhat_x64.zip) to the /opt directory of the ECS.
  4. Log in to the ECS as the root user and run the following commands to go to the /opt directory and decompress the client package:

    1
    2
    cd /opt
    unzip dws_client_8.2.x_redhat_x64.zip
    

  5. Create a GDS user and the user group to which the user belongs. This user is used to start GDS and read source data.

    1
    2
    groupadd gdsgrp
    useradd -g gdsgrp gds_user
    

  6. Change the owner of the GDS package directory and source data file directory to the GDS user.

    1
    2
    chown -R gds_user:gdsgrp /opt/gds/bin
    chown -R gds_user:gdsgrp /opt
    

  7. Switch to user gds.

    1
    su - gds_user
    

  8. Run the following commands to go to the gds directory and execute environment variables:

    1
    2
    cd /opt/gds/bin
    source gds_env
    

  9. Run the following command to start GDS. You can view the internal IP address of the ECS on the ECS console.

    1
    /opt/gds/bin/gds -d /opt -p  ECS Intranet IP:5000 -H 0.0.0.0/0 -l /opt/gds/bin/gds_log.txt -D -t 2
    

  10. Enable the network port between the ECS and DWS.

    The GDS server (ECS in this experiment) needs to communicate with DWS. The default security group of the ECS does not allow inbound traffic from GDS port 5000 and DWS port 8000. Perform the following steps:

    1. Return to the ECS console and click the ECS name to go to the ECS details page.
    2. Switch to the Security Groups tab and click Configure Rule.
    3. Select Inbound Rules, click Add Rule, set Priority to 1, set Protocol Port to 5000, and click OK.

    4. Repeat the preceding steps to add an inbound rule of 8000.

Step 4: Implementing Data Interconnection Across DWS Clusters

  1. Create a server.

    1. Obtain the private IP address of the source data warehouse cluster: Switch to the DWS console, choose Cluster Management on the left, and click the source cluster name dws-demo01.
    2. Go to the cluster details page and record the internal IP address of DWS.

    3. Switch back to the DWS console and click Log In in the Operation column of the target dws-demo02. The SQL window is displayed,

      Run the following command to create a server:

      The private IP address of the source data warehouse cluster is obtained in the previous step. The private IP address of the ECS server is obtained from the ECS console. The login password of user dbadmin is set when the data warehouse cluster is created.

      1
      2
      3
      4
      5
      6
      7
      8
      9
      CREATE SERVER server_remote FOREIGN DATA WRAPPER GC_FDW OPTIONS
       (
      address'Private network IP address of the source DWS cluster :8000',
       dbname 'gaussdb',
       username 'dbadmin',
      password'Password of user dbadmin',
      syncsrv'gsfs://Internal IP address of the ECS server:5000'
       )
       ;
      

  2. Create a foreign table for interconnection.

    In the SQL window of the destination cluster dws-demo02, run the following command to create a foreign table for interconnection:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    CREATE FOREIGN TABLE ft_orders
     (
     O_ORDERKEY BIGINT , 
     O_CUSTKEY BIGINT , 
     O_ORDERSTATUS CHAR(1) , 
     O_TOTALPRICE DECIMAL(15,2) , 
     O_ORDERDATE DATE , 
     O_ORDERPRIORITY CHAR(15) , 
     O_CLERK CHAR(15) , 
     O_SHIPPRIORITY BIGINT , 
     O_COMMENT VARCHAR(79) 
    
     ) 
     SERVER server_remote 
     OPTIONS 
     (
     schema_name 'public',
     table_name 'orders',
     encoding 'SQL_ASCII'
     );
    

  3. Import all table data.

    In the SQL window, run the following SQL statement to import full data from the ft_orders foreign table: Wait for about 1 minute.

    1
    INSERT INTO orders SELECT * FROM ft_orders;
    

    Run the following SQL statement. It is found that 15 million lines of data are successfully imported.

    1
    SELECT count(*) FROM orders;
    

  4. Import data based on filter criteria.

    Run the following SQL statements to import data based on the filter criteria:

    1
    INSERT INTO orders SELECT * FROM ft_orders WHERE o_orderkey < '10000000';