Updated on 2023-10-23 GMT+08:00

Installing, Configuring, and Starting GDS

Scenarios

GaussDB uses GDS to allocate source data for parallel data import. GDS needs to be deployed on data servers.

If a large volume of data is stored on multiple servers, deploy, configure, and start GDS on each server. Then, data on all the servers can be imported in parallel. The procedure for installing, configuring, and starting GDS is the same on each data server. This section describes how to perform this procedure on one data server.

Background

  1. GDS can be installed on the following x86/ARM OS: EulerOS 2.5/2.8.
  2. The GDS version must be consistent with the database version. Otherwise, the import or export may fail or not respond.

    Therefore, do not use an earlier version of GDS. After the database is upgraded, download the GDS of the new version as instructed in Procedure. When the import or export starts, GaussDB checks the GDS version and will display an error message and terminate the import or export if it detects a version mismatch.

    To obtain the GDS version, run the following command in the GDS decompression directory:

    gds -V

    To view the database version, run the following SQL statement after connecting to the database:

    1
    SELECT version();
    
  • The data server where GDS is deployed must use the recommended OS and communication parameter settings, which are the same as the configuration parameters of the cluster. For proper service running, ensure that the communication between the GDS data server and the cluster is normal.
    To use the inspection package to check system parameters on a data server, perform the following operations:
    1. Copy the inspection package to the GDS data server.
    2. Run the following command to check the system configuration parameters:
      gs_check -i CheckSysParams -L
    3. Modify parameter settings as prompted and run the command in the previous steps again.

      If the message "Warning reason: variable 'net.ipv4.tcp_retries1' RealValue '3' ExpectedValue '5'." is displayed, run the following commands:

      vim /etc/sysctl.conf // Set net.ipv4.tcp_retries1=5.
      sysctl -p // Make parameter settings take effect.

Procedure

  1. Log in to the data server where GDS is to be installed, and create a GDS user and its user group. This user is used to start GDS and read source data.

    groupadd gdsgrp
    useradd -g gdsgrp gds_user

  2. Switch to user gds_user.

    su - gds_user

  3. Create the /opt/bin directory for storing the GDS package.

    mkdir -p /opt/bin

  4. Change the owner of the GDS package directory and source data file directory to the GDS user.

    chown -R gds_user:gdsgrp /opt/bin/gds 
    chown -R gds_user:gdsgrp /input_data

  5. Upload the GDS package to the created directory.

    Use the SUSE Linux package as an example. Upload the GDS package GaussDB-Kernel-VxxxRxxxCxx-SUSE11-64bit-Gds.tar.gz in the software installation package to the newly created directory.

  6. (Optional) If SSL is used, upload the SSL certificates to the directory newly created in Step 1.

    The certificates are stored in the $GAUSSHOME/share/sslcert/gds directory of GaussDB. Download and upload the file.

  7. Go to the new directory and decompress the package.

    cd /opt/bin
    tar -zxvf GaussDB-Kernel-VxxxRxxxCxx-SUSE11-64bit-Gds.tar.gz
    export LD_LIBRARY_PATH="/opt/bin/lib:$LD_LIBRARY_PATH"  // GDS depends on the Cjson dynamic library. Therefore, you need to configure the path of the dynamic library.

  8. Start GDS.

    GDS is green software and can be started after being decompressed. You can start it in either of the following ways: One is to run the gds command to set startup parameters. The other is to write the startup parameters into the gds.conf configuration file and run the gds_ctl.py command to start GDS. The gds command is recommended when you do not need to import data again. The gds.conf configuration file is recommended when you need to import data again.
    • Run the gds command to start GDS.
      • If data is transmitted in non-SSL mode, run the following command to start GDS:
        gds -d dir -p ip:port -H address_string -l log_file -D -t worker_num --enable-ssl off

        Example:

        /opt/bin/gds/gds -d /input_data/ -p 192.168.0.90:5000 -H 10.10.0.1/24 -l /opt/bin/gds/gds_log.txt -D -t 2 --enable-ssl off
      • If data is transmitted in SSL mode, run the following command to start GDS:
        gds -d dir -p ip:port -H address_string -l log_file -D 
        -t worker_num --enable-ssl on --ssl-dir Cert_file

        Example:

        Run the following command to upload the SSL certificate mentioned in 6 to /opt/bin:
        /opt/bin/gds/gds -d /input_data/ -p 192.168.0.90:5000 -H 10.10.0.1/24 -l /opt/bin/gds/gds_log.txt -D --enable-ssl on --ssl-dir /opt/bin/

      Replace the information in italic as required.

      • -d dir: directory that stores source data files. It is /input_data/ in this tutorial.
      • -p ip:port: listening IP address and port for GDS. The default value is 127.0.0.1. Replace it with the IP address of a 10GE network that can communicate with GaussDB. The listening port can be any one ranging from 1024 to 65535. The default port is 8098. This parameter is set to 192.168.0.90:5000 in this tutorial.
      • -H address_string: network segment for hosts that can connect to and use GDS. The value must be in CIDR format. Set this parameter to enable the GaussDB cluster to access GDS for data import. Ensure that the network segment covers all hosts in the GaussDB cluster.
      • -l log_file: GDS log directory and log file name. This tutorial uses /opt/bin/gds/gds_log.txt as an example.
      • -D: GDS in daemon mode. This parameter is used only in Linux.
      • -t worker_num: number of concurrent GDS threads. If the data server and GaussDB have robust I/O resources, you can increase the number of concurrent GDS threads.

        GDS determines the number of threads based on the number of parallel import transactions. Even if multi-thread import is configured before GDS startup, the import of a single transaction will not be accelerated. By default, an INSERT statement is an import transaction.

      • --enable-ssl: Data transmission in SSL encryption mode. By default, the value on is used to enable the SSL encryption mode. If this parameter is not used, you need to add --ssl-dir to specify the SSL certificate directory.
      • --ssl-dir Cert_file: SSL certificate directory. Set it to the certificate directory mentioned in 6.
      • For details on how to set more parameters, see Server Tools > GDS > Parameter Description in the Tool Reference.
    • Run the gds_ctl.py command to start GDS.
      1. Run the following command to go to the config directory of the GDS package and modify the gds.conf configuration file. In this case, GDS is not in SSL mode. For details on the parameters in the gds.conf configuration file, see Table 1.
        vim /opt/bin/gds/config/gds.conf

        Example:

        The gds.conf configuration file contains the following information:

        <?xml version="1.0"?>
        <config>
        <gds name="gds1" ip="192.168.0.90" port="5000" data_dir="/input_data/" err_dir="/err" data_seg="100MB" err_seg="100MB" log_file="/log/gds_log.txt" host="10.10.0.1/24" daemon='true' recursive="true" parallel="32"></gds>
        </config>

        Details are as follows:

        • The data server IP address is 192.168.0.90 and the GDS listening port is 5000.
        • Data files are stored in the /input_data/ directory.
        • Error log files are stored in the /err directory.
        • The size of a single data file is 100 MB.
        • The size of a single error log file is 100 MB.
        • Run logs are stored in the /log/gds_log.txt file.
        • Only nodes with the IP address being 10.10.0.* can be connected.
        • The GDS process is running in daemon mode.
        • Recursive data file directories are used.
        • The number of concurrent import threads is 2.
      2. Start GDS and check whether it has been started:
        python3 gds_ctl.py start

        Example:

        cd /opt/bin/gds
        python3 gds_ctl.py start
        Start GDS gds1                  [OK]
        gds [options]:
         -d dir            Set data directory.
         -p port           Set GDS listening port.
            ip:port        Set GDS listening ip address and port.
         -l log_file       Set log file.
         -H secure_ip_range
                           Set secure IP checklist in CIDR notation.                   Required for GDS to start.
         -e dir            Set error log directory.
         -E size           Set size of per error log segment.(0 < si                   ze < 1TB)
         -S size           Set size of data segment.(1MB < size < 10                   0TB)
         -t worker_num     Set number of worker thread in multi-thre                   ad mode, the upper limit is 32. If withou                   t setting, the default value is 1.
         -s status_file    Enable GDS status report.
         -D                Run the GDS as a daemon process.
         -r                Read the working directory recursively.
         -h                Display usage.

The binary use of GDSs depends on some common library files. If GDSs are deployed on physical nodes outside the cluster and the physical environment where GDSs reside cannot provide such library files or the versions of related library files are incompatible, an error message similar to "/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found" may be displayed during the startup. Log in to the physical node where the cluster resides, copy the corresponding library file (for example, libstdc++.so.6 or libgcc_s.so.1) from the $GAUSSHOME/lib directory to the directory in step 4, and repeat step 4 to set environment variables. After the setting is successful, restart GDS.

If the problem persists, GDS does not support the current physical environment or platform. You are advised to switch to a supported physical environment and try again.

gds.conf Parameter Description

Table 1 Attributes in the gds.conf file

Attribute

Description

Value Range

name

Identifier

-

ip

Listening IP address

The IP address must be valid.

Default value: 127.0.0.1

port

Listening port

Value range: an integer ranging from 1024 to 65535

Default value: 8098

data_dir

Data file directory

-

err_dir

Error log file directory

Default value: data file directory

log_file

Log file path

-

host

Host IP address allowed to be connected to GDS (The value must in CIDR format and this parameter is set for the Linux OS only.)

-

recursive

Whether the data file directories are recursive

Value range:

  • true: recursive
  • false: not recursive

Default value: false

daemon

Whether a process is running in daemon mode

Value range:

  • true: The server is running in daemon mode.
  • false: The server is not running in daemon mode.

Default value: false

parallel

Number of concurrent data import threads

Value range: an integer ranging form 0 to 32

Default value: 1