Help Center> GaussDB(DWS)> Best Practices> Import and Export> Tutorial: Using GDS to Import Data from a Remote Server
Updated on 2024-03-13 GMT+08:00

Tutorial: Using GDS to Import Data from a Remote Server

Overview

This practice demonstrates how to use General Data Service (GDS) to import data from a remote server to GaussDB(DWS).

GaussDB(DWS) allows you to import data in TXT, CSV, or FIXED format.

In this tutorial, you will:

  • Generate the source data files in CSV format to be used in this tutorial.
  • Upload the source data files to a data server.
  • Create foreign tables used for importing data from a data server to GaussDB(DWS) through GDS.
  • Start GaussDB(DWS), create a table, and import data to the table.
  • Analyze import errors based on the information in the error table and correct these errors.

Preparing an ECS as the GDS Server

For details about how to purchase a Linux ECS, see section "Purchasing an ECS" in the Elastic Cloud Server Getting Started. After the purchase, log in to the ECS by referring to section "Logging In to a Linux ECS".

  • The ECS OS must be supported by the GDS package.
  • The ECS and DWS are in the same region, VPC, and subnet.
  • The ECS security group rule must allow access to the DWS cluster, that is, the inbound rule of the security group is as follows:
    • Protocol: TCP
    • Port: 5000
    • Source: Select IP Address and enter the IP address of the GaussDB(DWS) cluster, for example, 192.168.0.10/32.
  • If the firewall is enabled in the ECS, ensure that the listening port of GDS is enabled on the firewall:
    1
    iptables  -I INPUT -p tcp -m tcp --dport <gds_port> -j ACCEPT
    

Downloading the GDS Package

  1. Log in to the GaussDB(DWS) console.
  2. In the navigation tree on the left, click Connections.
  3. Select the GDS client of the corresponding version from the drop-down list of CLI Client.

    Select a version based on the cluster version and the OS where the client is installed.

    The CPU architecture of the client must be the same as that of the cluster. If the cluster uses the x86 specifications, select the x86 client.

  4. Click Download.

Preparing Source Data Files

  • Data file product_info0.csv
    1
    2
    3
    4
    5
    100,XHDK-A,2017-09-01,A,2017 Shirt Women,red,M,328,2017-09-04,715,good!
    205,KDKE-B,2017-09-01,A,2017 T-shirt Women,pink,L,584,2017-09-05,40,very good!
    300,JODL-X,2017-09-01,A,2017 T-shirt men,red,XL,15,2017-09-03,502,Bad.
    310,QQPX-R,2017-09-02,B,2017 jacket women,red,L,411,2017-09-05,436,It's nice.
    150,ABEF-C,2017-09-03,B,2017 Jeans Women,blue,M,123,2017-09-06,120,good.
    
  • Data file product_info1.csv
    1
    2
    3
    4
    5
    200,BCQP-E,2017-09-04,B,2017 casual pants men,black,L,997,2017-09-10,301,good quality.
    250,EABE-D,2017-09-10,A,2017 dress women,black,S,841,2017-09-15,299,This dress fits well.
    108,CDXK-F,2017-09-11,A,2017 dress women,red,M,85,2017-09-14,22,It's really amazing to buy.
    450,MMCE-H,2017-09-11,A,2017 jacket women,white,M,114,2017-09-14,22,very good.
    260,OCDA-G,2017-09-12,B,2017 woolen coat women,red,L,2004,2017-09-15,826,Very comfortable.
    
  • Data file product_info2.csv
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    980,"ZKDS-J",2017-09-13,"B","2017 Women's Cotton Clothing","red","M",112,,,
    98,"FKQB-I",2017-09-15,"B","2017 new shoes men","red","M",4345,2017-09-18,5473
    50,"DMQY-K",2017-09-21,"A","2017 pants men","red","37",28,2017-09-25,58,"good","good","good"
    80,"GKLW-l",2017-09-22,"A","2017 Jeans Men","red","39",58,2017-09-25,72,"Very comfortable."
    30,"HWEC-L",2017-09-23,"A","2017 shoes women","red","M",403,2017-09-26,607,"good!"
    40,"IQPD-M",2017-09-24,"B","2017 new pants Women","red","M",35,2017-09-27,52,"very good."
    50,"LPEC-N",2017-09-25,"B","2017 dress Women","red","M",29,2017-09-28,47,"not good at all."
    60,"NQAB-O",2017-09-26,"B","2017 jacket women","red","S",69,2017-09-29,70,"It's beautiful."
    70,"HWNB-P",2017-09-27,"B","2017 jacket women","red","L",30,2017-09-30,55,"I like it so much"
    80,"JKHU-Q",2017-09-29,"C","2017 T-shirt","red","M",90,2017-10-02,82,"very good."
    
  1. Create a text file, open it using a local editing tool (for example, Visual Studio Code), and copy the sample data to the text file.
  2. Choose Format > Encode in UTF-8 without BOM.
  3. Choose File > Save as.
  4. In the displayed dialog box, enter the file name, set the file name extension to .csv, and click Save.
  5. Log in to the GDS server as user root.
  6. Create the /input_data directory for storing the data file.

    1
    mkdir -p /input_data
    

  7. Use MobaXterm to upload source data files to the created directory.

Installing and Starting GDS

  1. Log in to the GDS server as user root and create the /opt/bin/dws directory for storing the GDS package.

    1
    mkdir -p /opt/bin/dws
    

  2. Upload the GDS package to the created directory.

    For example, upload the dws_client_8.1.x_redhat_x64.zip package to the created directory.

  3. Go to the directory and decompress the package.

    1
    2
    cd /opt/bin/dws
    unzip dws_client_8.1.x_redhat_x64.zip
    

  4. Create a user (gds_user) and the user group (gdsgrp) to which the user belongs. This user is used to start GDS and must have the permission to read the source data file directory.

    1
    2
    groupadd gdsgrp
    useradd -g gdsgrp gds_user
    

  5. Change the owner of the GDS package and source data file directory to gds_user and change the user group to gdsgrp.

    1
    2
    chown -R gds_user:gdsgrp /opt/bin/dws/gds
    chown -R gds_user:gdsgrp /input_data
    

  6. Switch to user gds_user.

    1
    su - gds_user
    

    If the current cluster version is 8.0.x or earlier, skip 7 and go to 8.

    If the current cluster version is 8.1.x or later, go to the next step.

  7. Execute the script on which the environment depends (applicable only to 8.1.x).

    1
    2
    cd /opt/bin/dws/gds/bin
    source gds_env
    

  8. Start GDS.

    1
    /opt/bin/dws/gds/bin/gds -d /input_data/ -p 192.168.0.90:5000 -H 10.10.0.1/24 -l /opt/bin/dws/gds/gds_log.txt -D
    

    Replace the italic parts as required.

    • -d dir: directory for storing data files that contain data to be imported. This practice uses /input_data/ as an example.
    • -p ip:port: listening IP address and port for GDS. The default value is 127.0.0.1. Replace it with the IP address of a 10GE network that can communicate with GaussDB(DWS). The port number ranges from 1024 to 65535. The default value is 8098. This practice uses 192.168.0.90:5000 as an example.
    • -H address_string: hosts that are allowed to connect to and use GDS. The value must be in CIDR format. Set this parameter to enable a GaussDB(DWS) cluster to access GDS for data import. Ensure that the network segment covers all hosts in a GaussDB(DWS) cluster.
    • -l log_file: GDS log directory and log file name. This practice uses /opt/bin/dws/gds/gds_log.txt as an example.
    • -D: GDS in daemon mode. This parameter is used only in Linux.

Creating a Foreign Table

  1. Use an SQL client to connect to the GaussDB(DWS) database.
  2. Create the following foreign table:

    LOCATION: Replace it with the actual GDS address and port number.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    DROP FOREIGN TABLE IF EXISTS product_info_ext;
    CREATE FOREIGN TABLE product_info_ext
    (
        product_price                integer        not null,
        product_id                   char(30)       not null,
        product_time                 date           ,
        product_level                char(10)       ,
        product_name                 varchar(200)   ,
        product_type1                varchar(20)    ,
        product_type2                char(10)       ,
        product_monthly_sales_cnt    integer        ,
        product_comment_time         date           ,
        product_comment_num          integer        ,
        product_comment_content      varchar(200)              
    ) 
    SERVER gsmpp_server 
    OPTIONS(
    LOCATION 'gsfs://192.168.0.90:5000/*',
    FORMAT 'CSV' ,
    DELIMITER ',',
    ENCODING 'utf8',
    HEADER 'false',
    FILL_MISSING_FIELDS 'true',
    IGNORE_EXTRA_DATA 'true'
    )
    READ ONLY
    LOG INTO product_info_err 
    PER NODE REJECT LIMIT 'unlimited';
    

    If the following information is displayed, the foreign table has been created:

    1
    CREATE FOREIGN TABLE
    

Importing Data

  1. Run the following statements to create the product_info table in GaussDB(DWS) to store imported data:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    DROP TABLE IF EXISTS product_info;
    CREATE TABLE product_info
    (
        product_price                integer        not null,
        product_id                   char(30)       not null,
        product_time                 date           ,
        product_level                char(10)       ,
        product_name                 varchar(200)   ,
        product_type1                varchar(20)    ,
        product_type2                char(10)       ,
        product_monthly_sales_cnt    integer        ,
        product_comment_time         date           ,
        product_comment_num          integer        ,
        product_comment_content      varchar(200)                   
    ) 
    WITH (
    orientation = column,
    compression=middle
    ) 
    DISTRIBUTE BY hash (product_id);
    

  2. Import data from source data files to the product_info table through the foreign table product_info_ext.

    1
    INSERT INTO product_info SELECT * FROM product_info_ext ;
    
    If the following information is displayed, the data has been imported:
    1
    INSERT 0 20
    

  3. Run the SELECT statement to view the data imported to GaussDB(DWS).

    1
    SELECT count(*) FROM product_info;
    

    If the following information is displayed, the data has been imported:

    1
    2
    3
    4
    count 
    -------
         20
    (1 row)
    

  1. Run VACUUM FULL on the product_info table.

    1
    VACUUM FULL product_info
    

  2. Update statistics of the product_info table.

    1
    ANALYZE product_info;
    

Stopping GDS

  1. Log in to the data server where GDS is installed as user gds_user.
  2. Perform the following operations to stop GDS:

    1. Query the GDS process ID. The GDS process ID is 128954.
      ps -ef|grep gds
      gds_user 128954      1  0 15:03 ?        00:00:00 gds -d /input_data/ -p 192.168.0.90:5000 -l /opt/bin/gds/gds_log.txt  -D
      gds_user 129003 118723  0 15:04 pts/0    00:00:00 grep gds
    2. Run the kill command to stop GDS. 128954 indicates the GDS process ID.
      kill -9 128954

Deleting Resources

  1. Run the following command to delete the target table product_info:

    1
    DROP TABLE product_info;
    

    If the following information is displayed, the table has been deleted:

    1
    DROP TABLE
    

  2. Run the following command to delete the foreign table product_info_ext:

    1
    DROP FOREIGN TABLE product_info_ext;
    

    If the following information is displayed, the table has been deleted:

    1
    DROP FOREIGN TABLE