Help Center> GaussDB(DWS)> Best Practices> Import and Export> Tutorial: Exporting ORC Data to MRS
Updated on 2024-04-25 GMT+08:00

Tutorial: Exporting ORC Data to MRS

GaussDB(DWS) allows you to export ORC data to MRS using an HDFS foreign table. You can specify the export mode and export data format in the foreign table. Data is exported from GaussDB(DWS) in parallel using multiple DNs and stored in HDFS. In this way, the overall export performance is improved.

Preparing the Environment

Create a GaussDB(DWS) cluster. Ensure that the MRS and GaussDB(DWS) clusters are in the same region, AZ, and VPC subnet and that the clusters can communicate with each other.

Creating an MRS Cluster

  1. Log in to the HUAWEI CLOUD console, choose Analytics > MapReduce Service and click Buy Cluster. Click the Custom Config tab, configure software parameters, and click Next.

    Table 1 Software configuration

    Parameter

    Example Value

    Region

    CN-Hong Kong

    Cluster Name

    mrs_01

    Cluster Version

    MRS 1.9.2 (recommended)

    NOTE:
    • For clusters of version 8.1.1.300 and later, MRS clusters support versions 1.6.*, 1.7.*, 1.8.*, 1.9.*, 2.0.*, 3.0.*, 3.1.*, and later (* indicates a number).
    • For clusters earlier than version 8.1.1.300, MRS clusters support versions 1.6.*, 1.7.*, 1.8.*, 1.9.*, and 2.0.* (* indicates a number).

    Cluster Type

    Analysis Cluster

  2. Configure hardware parameters and click Next.

    Table 2 Hardware configuration

    Parameter

    Example Value

    Billing Mode

    Pay-per-use

    AZ

    AZ2

    VPC

    vpc-01

    Subnet

    subnet-01

    Security Group

    Auto create

    EIP

    10.x.x.x

    Enterprise Project

    default

    Master

    2

    Analysis Core

    3

    Analysis Task

    0

  3. Configure the advanced settings based on the following table, click Buy Now, and wait for about 15 minutes for the cluster creation to complete.

    Table 3 Advanced settings

    Parameter

    Example Value

    Tag

    test01

    Hostname Prefix

    (Optional) Prefix for the name of an ECS or BMS in the cluster.

    Auto Scaling

    Retain the default value.

    Bootstrap Action

    Retain the default value. MRS 3.x does not support this parameter.

    Agency

    Retain the default value.

    Data Disk Encryption

    This function is disabled by default. Retain the default value.

    Alarm

    Retain the default value.

    Rule Name

    Retain the default value.

    Topic Name

    Select a topic.

    Kerberos Authentication

    This parameter is enabled by default.

    User Name

    admin

    Password

    This password is used to log in to the cluster management page.

    Confirm Password

    Enter the password of user admin again.

    Login Mode

    Password

    User Name

    root

    Password

    This password is used to remotely log in to the ECS.

    Confirm Password

    Enter the password of user root again.

    Secure Communications

    Select Enable.

Creating an MRS Data Source Connection

  1. Log in to the GaussDB(DWS) console and click the created data warehouse cluster. Ensure that the GaussDB(DWS) and MRS clusters are in the same region, AZ, and VPC subnet.
  2. Click the MRS Data Source tab and click Create MRS Cluster Connection.
  3. Select data source mrs_01 created in the previous step, enter the MRS account name admin and its password, and click OK.

Creating a Foreign Server

  1. Use Data Studio to connect to the created GaussDB(DWS) cluster.
  2. Create a user dbuser that has the permission for creating databases.

    1
    CREATE USER dbuser WITH CREATEDB PASSWORD 'password';
    

  1. Switch to user dbuser.

    1
    SET ROLE dbuser PASSWORD 'password';
    

  1. Create a database mydatabase.

    1
    CREATE DATABASE mydatabase;
    

  1. Perform the following steps to switch to database mydatabase:

    1. In the Object Browser window of the Data Studio client, right-click the database connection and choose Refresh from the shortcut menu. Then, the new database is displayed.
    2. Right-click the database name mydatabase and select Connect to DB from the shortcut menu.
    3. Right-click the database name mydatabase and select Open Terminal from the shortcut menu. The SQL command window for connecting to a database is displayed. Perform the following steps in the window.

  2. Grant the permission to create external servers to user dbuser. In 8.1.1 and later versions, you also need to grant the permission to use the public mode.

    1
    2
    GRANT ALL ON FOREIGN DATA WRAPPER hdfs_fdw TO dbuser;
    In GRANT ALL ON SCHEMA public TO dbuser; //8.1.1 and later versions, common users do not have permission on the public mode and need to grant permission. In versions earlier than 8.1.1, you do not need to perform this operation.
    

    The name of FOREIGN DATA WRAPPER must be hdfs_fdw. dbuser indicates the username of CREATE SERVER.

  3. Grant user dbuser the permission for using foreign tables.

    1
    ALTER USER dbuser USEFT;
    

  4. Switch to the Postgres database and query the foreign server automatically created by the system after the MRS data source is created.

    1
    SELECT * FROM pg_foreign_server;
    

    Information similar to the following is displayed:

    1
    2
    3
    4
    5
    6
                         srvname                      | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                     srvoptions
    --------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
     gsmpp_server                                     |       10 |  13673 |         |            |        |
     gsmpp_errorinfo_server                           |       10 |  13678 |         |            |        |
     hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca |    16476 |  13685 |         |            |        | {"address=192.168.1.245:9820,192.168.1.218:9820",hdfscfgpath=/MRS/8f79ada0-d998-4026-9020-80d6de2692ca,type=hdfs}
    (3 rows)
    

  5. Switch to database mydatabase and switch to user dbuser.

    1
    SET ROLE dbuser PASSWORD 'password';
    

  6. Create a foreign server.

    The server name, address, and configuration path must be the same as those in 8.

    1
    2
    3
    4
    5
    6
    7
    CREATE SERVER hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca FOREIGN DATA WRAPPER HDFS_FDW 
    OPTIONS 
    (
    address '192.168.1.245:9820,192.168.1.218:9820',   //The intranet IP addresses of the active and standby master nodes on the MRS management plane, which can be used to communicate with GaussDB(DWS).
    hdfscfgpath '/MRS/8f79ada0-d998-4026-9020-80d6de2692ca',
    type 'hdfs'
    );
    

  7. View the foreign server.

    1
    SELECT * FROM pg_foreign_server WHERE srvname='hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca';
    

    The server is successfully created if information similar to the following is displayed:

    1
    2
    3
    4
                         srvname                      | srvowner | srvfdw | srvtype | srvversion | srvacl |                                                     srvoptions
    --------------------------------------------------+----------+--------+---------+------------+--------+---------------------------------------------------------------------------------------------------------------------
     hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca |    16476 |  13685 |         |            |        | {"address=192.168.1.245:9820,192.168.1.218:9820",hdfscfgpath=/MRS/8f79ada0-d998-4026-9020-80d6de2692ca,type=hdfs}
    (1 row)
    

Creating a Foreign Table

Create an OBS foreign table that does not contain partition columns. The foreign server associated with the table is hdfs_server, the format of the file on HDFS corresponding to the table is ORC, and the data storage path on OBS is /user/hive/warehouse/product_info_orc/.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
DROP FOREIGN TABLE IF EXISTS product_info_output_ext;
CREATE FOREIGN TABLE product_info_output_ext
(
    product_price                integer        ,
    product_id                   char(30)       ,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    integer        ,
    product_comment_time         date           ,
    product_comment_num          integer        ,
    product_comment_content      varchar(200)                      
) SERVER hdfs_server_8f79ada0_d998_4026_9020_80d6de2692ca 
OPTIONS (
format 'orc', 
foldername '/user/hive/warehouse/product_info_orc/',
   compression 'snappy',
    version '0.12'
) Write Only;

Exporting Data

Create an ordinary table product_info_output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
DROP TABLE product_info_output;
CREATE TABLE product_info_output 
(    
    product_price                int            ,
    product_id                   char(30)       ,
    product_time                 date           ,
    product_level                char(10)       ,
    product_name                 varchar(200)   ,
    product_type1                varchar(20)    ,
    product_type2                char(10)       ,
    product_monthly_sales_cnt    int            ,
    product_comment_time         date           ,
    product_comment_num          int        ,
    product_comment_content      varchar(200)                   
) 
with (orientation = column,compression=middle)
distribute by hash (product_name);
Export data from table product_info_output to a data file using the product_info_output_ext foreign table.
1
INSERT INTO product_info_output_ext SELECT * FROM product_info_output;

If information similar to the following is displayed, the data has been exported.

1
INSERT 0 10

Viewing the Export Result

  1. Go to the MRS cluster list. Click a cluster name to go to the cluster details page.
  2. Click the Files tab and click HDFS File List. Check the exported ORC file in the user/hive/warehouse/product_info_orc directory.

    ORC data exported from GaussDB(DWS) complies with the following rules:

    1. Data exported to MRS (HDFS): When data is exported from a DN, the data is stored in HDFS in the segment format. The file is named in the format of mpp_DatabaseName_SchemaName_TableName_NodeName_n.orc.
    2. You are advised to export data from different clusters or databases to different paths. The maximum size of an ORC file is 128 MB, and that of a stripe file is 64 MB.
    3. After the export is complete, the _SUCCESS file is generated.