Help Center/ GaussDB(DWS)/ Best Practices/ Data Migration/ Using Kettle to Migrate Small Tables from AWS Redshift to a GaussDB(DWS) Cluster
Updated on 2025-07-30 GMT+08:00

Using Kettle to Migrate Small Tables from AWS Redshift to a GaussDB(DWS) Cluster

This practice demonstrates how to use the open-source tool Kettle to migrate Redshift data to GaussDB(DWS).

Kettle Overview

Kettle is an open-source Extract, Transform, Load (ETL) tool, also known as KDE Extraction, Transportation, Transformation, and Loading Environment. It provides a GUI that enables you to design and configure ETL processes by dragging and dropping components. Kettle supports multiple data sources and destinations, including relational databases, files, APIs, and Hadoop. It offers extensive conversion and cleansing capabilities such as data format transformation, filtering, merging, and computing.

Key features include:

  • Building data pipelines with drag-and-drop operations, no coding required
  • Connecting to multiple data sources
  • Visualizing data pipelines ·
  • Developing pipelines using templates
  • Visualizing scheduled tasks
  • Comprehensive Hadoop support
  • DWS must bind an EIP before it can connect to Kettle.
  • Both Kettle and CDM are suitable for batch processing. For smaller data volumes or table quantities, Kettle is recommended; otherwise, use CDM.
  • Kettle can export data from and import data into databases.
  • Kettle can synchronize data in real-time using triggers, timestamp fields, and Kafka.

This practice takes about 90 minutes. For example, migrating Redshift involves the following operations:

  1. Prerequisites
  2. Migrating Metadata
  3. Deploying Kettle
  4. Creating a Transformation Task and Configuring Source and Destination Databases
  5. Migrating Data
  6. Executing Migration Jobs Concurrently
  7. Optimizing the Migration Job
  8. Checking Table Data

Prerequisites

  • You have purchased a GaussDB(DWS) cluster, bound an EIP to the cluster, and created the destination database dws_vd.
  • You have obtained the DSC tool package and downloaded and installed DSC.
  • You have obtained the DataCheck tool package and downloaded and installed DataCheck.
  • You have obtained the Kettle tool package (this chapter uses version 9.4.0.0-343 as an example).
  • JDK 1.8 has been installed and environment variables have been configured.
  • You have obtained the Kettle tool package. The tool package includes the GaussDB(DWS) driver package, Redshift driver package, and plugins folder.
    Figure 1 Kettle tool package

Migrating Metadata

Use the DSC migration tool developed by Huawei Cloud to convert the DDL statements of the sample table exported from the source end in Redshift to executable SQL statements in GaussDB(DWS).

  1. Decompress the obtained DSC tool package and save the files to be converted to the input directory of the DSC.

    The Redshift table structure design does not include indexes. Performance is mainly influenced by the data distribution key (DISTSTYLE) and sort key (SORTKEY) specified in the DDL.

    Figure 2 DDL statement

  2. Customize the features-pg.properties file.
  3. Go to the DSC directory and run the runDSC script. In Windows, double-click runDSC.bat. Run runDSC.sh in Linux.
  4. Convert the syntax.

    1
    runDSC.bat -S postgres
    

  5. View the result and conversion information.
  6. Connect to GaussDB(DWS), execute the DDL statement converted in the previous step to create a table.

    For more information about DSC, see DSC Tool Guide.

Deploying Kettle

  1. Unzip the downloaded Kettle tool package.

    Figure 3 Unzipping the Kettle tool package

  2. Unzip the Kettle tool package and copy the Redshift driver package and GaussDB(DWS) driver package from the package to the lib folder of Kettle.

    Figure 4 lib folder

  3. Copy the dws-table-output-plugin folder from the package to the plugins folder of Kettle.

    Figure 5 plugins folder

Creating a Transformation Task and Configuring Source and Destination Databases

  1. After deploying Kettle, double-click the Spoon script in the data-integration directory of Kettle to start Kettle.

    Figure 6 Starting Kettle

    The following page is displayed.

    Figure 7 Kettle GUI

  2. Choose File > New > Transformation to create a new transformation.

    Figure 8 Creating a transformation task

  3. Click Database connection under View, right-click New, and configure the source and destination database connections.

    • The connection configuration of the redshift-VD database (source) is as follows:
      • Custom connection URL: jdbc: redshift://dwdsi-simulacao.cslagxelrdnm.us-east-1.redshift.amazonaws.com:xxxx/dsidw
      • Username: xxxxxxx
      • Password: xxxxxxx
      Figure 9 Configuring the source
    • The connection configuration of the DWS-VD database (destination) is as follows:
      • Host Name: EIP of the GaussDB(DWS) cluster.
      • Database Name: dws_vd
      • Username: dbadmin
      • Password: password of user dbadmin.
      Figure 10 Configuring the destination
      Figure 11 options

      After connecting the databases, right-click the source and destination links and click Share. This prevents the need to reconfigure database links in future tasks.

  4. Drag the Table input and DWS TableOutput components from the Input and Output directories under Design to the right panel separately.
  5. Right-click the connection line of the Table input component to connect the two components.

    Figure 12 Connecting components

Migrating Data

Full data migration

  1. Right-click Table Input and choose Edit from the shortcut menu. Select the source database link.

    Figure 13 Editing Table Input

  2. Right-click DWS TableOutput and choose Edit from the shortcut menu. Select the destination database link. Select Truncate table and Specify database fields, click Get fields under Database fields to acquire the source and destination field mapping, and then click OK.

    Figure 14 Editing Table output
    Figure 15 Editing database fields

  3. After the configuration, click Run to start the migration task.

    Figure 16 Running the task

Incremental data migration

The steps for incremental and full migration are essentially the same. The difference is that a WHERE condition is added to the source SQL statement, and Truncate Table is deselected in the destination configuration.

  1. Right-click Table Input and choose Edit from the shortcut menu. Select the source database link.

    Figure 17 Editing Table Input

  2. Right-click DWS TableOutput and choose Edit from the shortcut menu. Select the destination database link. Deselect Truncate table, click Get fields under Database fields to acquire the source and destination field mapping, and then click OK.

    Figure 18 Editing Table output

  3. After the configuration, click Run to start the migration task.

    Figure 19 Running the task

Executing Migration Jobs Concurrently

Job task configuration combines the transformation of multiple tables into one task for faster execution.

  1. Choose File > New > Job. Drag the target components to the panel and connect them using lines.

    Figure 20 Creating a job

  2. Double-click Transformation and select the transformation task that has been configured and saved.

    Figure 21 Configuring a transformation

  3. Right-click the Start component, select the Run Next Entries in Parallel component, and set concurrent task execution.

    Figure 22 Setting concurrency

    After the setting, a double slash will appear between the Start and Transformation components.

    Figure 23 Setting concurrency succeeded

  4. Click Run to execute the conversion tasks concurrently.

    Figure 24 Executing the conversion tasks concurrently

Optimizing the Migration Job

  1. Configure the Kettle memory.

    To increase the number of concurrent Kettle requests and the size of cached data, you can adjust the Kettle memory size.

    Use Notepad++ to open the Spoon.bat script and edit the memory content. It is recommended to set it to 60-80% of the host memory.

    Figure 25 Configuring the memory

  2. Configure a job.

    • For tables with less than 10 million records, configure around 10 tables per job.
    • For larger tables with about 100 million records, configure 2 to 3 tables per job. This allows for individual task scheduling if any issues arise, enhancing efficiency.
    • For tables with over 100 million records, especially those with many fields, Kettle's extraction efficiency may be slow. Choose a migration method based on your specific service needs.
      Figure 26 Configuring a job
    • Group tasks with similar table data volumes into the same job. This ensures all tasks finish around the same time, preventing tail tasks from delaying subsequent jobs.
    • If a task error occurs, check the error logs. Issues are often due to source link restrictions causing disconnections. In such cases, restart Kettle and try again.

Checking Table Data

After the migration, check whether the data on the source and destination databases is consistent using DataCheck.

  1. Download and unzip DataCheck-*.zip, and then go to the DataCheck-* directory to use it. For how to use the files in the directory, see Table 1.
  2. Configure the tool package.

    • In Windows:
      Open the dbinfo.properties file in the conf folder and configure it based on your actual needs.

      You can use the command below in the tool to generate the ciphertext of src.passwd and dws.passwd.

      encryption.bat password

      After the command is executed, an encrypted file is generated in the local bin directory.

    • In Linux:

      The method of generating the ciphertext is similar to that for Windows. The command is sh encryption.sh Password. Other steps are the same.

  3. Check data.

    In Windows:

    1. Open the check_input.xlsx file, enter the schemas, databases, source table, and destination table to be checked, and fill in Row Range with a data query statement to specify the query scope.
      • After configuring the source database name in the configuration file, the source database name in the check_input.xlsx file defaults to this. However, if a different source database name is specified in the check_input.xlsx file, it will take precedence.
      • The Check Strategy offers three levels: high, middle, and low. If unspecified, the default is low.
      • The Check Mode supports statistics (statistical value checks).

      The following figure shows the check_input.xlsx file for metadata comparison.

      Figure 27 check_input.xlsx
    2. Run the datacheck.bat command in the bin directory to execute the check tool.

    3. View the generated check result file check_input_result.xlsx.

      The following figure shows the statistical value check.

    In Linux:

    1. Edit and upload the check_input.xlsx file. Refer to the step 1 for Windows.
    2. Run the sh datacheck.sh command to execute the check tool.

    3. View the check result in the check_input_result.xlsx file. (The check result analysis is the same as that for Windows.)

Related Information

For more information, see DataCheck.

Table 1 Description of the DataCheck directory

File or Folder

Description

DataCheck

bin

Saves the entry script of the check tool.

  • Windows version: datacheck.bat
  • Linux version: datacheck.sh

conf

Configuration file, which is used to configure the connection between the source database and the destination database and set log printing.

lib

Directory for storing the JAR packages required for running the check tool.

check_input.xlsx

  • Information about the table to be checked, including schema, table, and column names.
  • Check level information and check rules of users. Three check levels are supported: high, middle, and low. The default value is low.

logs

The package does not include this file. Once the check tool runs, it automatically generates this file to log the tool's execution process.

check_input_result.xlsx

The package does not include this file. Once the check tool runs, a check result file will be created in the same location as check_input.xlsx.

Table 2 Basic functions of the data check tool

DataCheck

  • Check data in GaussDB(DWS), MySQL, and PostgreSQL databases.
  • Check common fields, such as numeric, time, and character types.
  • Support three check levels, including high, middle, and low.
  • Check schemas, table names, and column names.
  • Specify the check scope of records. By default, all records are checked.
  • Support various check methods, including count(*), max, min, sum, avg, and sampling details check.
  • Output the check result and related check details.
Table 3 Data check levels

Check Level

Description

Syntax

Low

  • Quantity check
  • Number of records: COUNT(*)

Middle

  • Quantity check
  • Numeric type check
  • Number of records: COUNT(*)
  • Value check: MAX, MIN, SUM, and AVG

High

  • Quantity check
  • Numeric type check
  • Date type check
  • Character type check
  • Number of records: COUNT(*)
  • Value check: MAX, MIN, SUM, and AVG
  • Date check: MAX and MIN
  • Character check: order by limit 1000, which reads the data and checks whether the content is the same