Updated on 2024-10-17 GMT+08:00

Connecting DataX to OBS

Overview

DataX is a data synchronization framework. It can efficiently synchronize data among heterogeneous data sources such as MySQL, SQL Server, Oracle, PostgreSQL, HDFS, Hive, HBase, OTS and ODPS. In big data scenarios, OBS can replace HDFS in the Hadoop system. This section describes how to connect DataX to OBS.

Procedure

  1. Download the DataX source code (version datax_v202308 as an example).
  2. Modify and compile DataX.

    1. Upgrade the Hadoop version which HdfsReader and HdfsWriter depend on. In this example, the Hadoop will be upgraded to version 2.8.3.

      Modify the pom.xml files under datax\hdfswriter\ and datax\hdfsreader\.

      <properties>
      <!--Upgrade from 2.7.1 to 2.8.3-->
      <hadoop.version>2.8.3</hadoop.version>
      </properties>
    2. Compile DataX.
    3. Generate the datax.tar.gz file in the /target directory, the root directory of the datax source code:

      mvn -U clean package assembly:assembly -Dmaven.test.skip=true

  3. Install DataX.

    1. Decompress datax.tar.gz to the /opt/datax directory.
    2. Download hadoop-huaweicloud from GitHub. You are advised to download the latest hadoop-huaweicloud version under Hadoop 2.8.3, for example, hadoop-huaweicloud-2.8.3-hw-53.8.
    3. Save the downloaded JAR package to /opt/datax/plugin/writer/hdfswriter/libs and /opt/datax/plugin/reader/hdfsreader/libs directories.

  4. Check whether the connection is successful.

    Example: txtfilereader is the source, and OBS is the destination.

    1. Create a job configuration file file2obs.json.
      {
          "setting":{
      
          },
          "job":{
              "setting":{
                  "speed":{
                      "channel":2
                  }
              },
              "content":[
                  {
                      "reader":{
                          "name":"txtfilereader",
                          "parameter":{
                              "path":[
                                  "/opt/test.txt"
                              ],
                              "encoding":"UTF-8",
                              "column":[
                                  {
                                      "index":0,
                                      "type":"STRING"
                                  },
                                  {
                                      "index":1,
                                      "type":"STRING"
                                  }
                              ],
                              "fieldDelimiter":"\t"
                          }
                      },
                      "writer":{
                          "name":"hdfswriter",
                          "parameter":{
                              "defaultFS":"obs://obs-bucket",##OBS bucket
                              "fileType":"text",
                              "path":"/test",##Path in the OBS bucket
                              "fileName":"test",
                              "column":[
                                  {
                                      "name":"col1",
                                      "type":"STRING"
                                  },
                                  {
                                      "name":"col2",
                                      "type":"STRING"
                                  }
                              ],
                              "writeMode":"append",
                              "fieldDelimiter":"\t",
                              "hadoopConfig": {##Hadoop configurations must be added.
                                  "fs.obs.impl":"org.apache.hadoop.fs.obs.OBSFileSystem",
                                  "fs.obs.access.key":"AK that can access OBS",
                                  "fs.obs.secret.key":"SK that can access OBS",
                                  "fs.obs.endpoint":"Region where the OBS bucket is located"
                              }
                          }
                      }
                  }
              ]
          }
      }
    2. Start DataX:

      python /opt/datax/bin/datax.py file2obs.json