Help Center/ Data Warehouse Service / Best Practices/ Data Migration/ Using Kettle to Migrate Data from BigQuery to a DWS Cluster

Updated on 2025-10-10 GMT+08:00

View PDF

Using Kettle to Migrate Data from BigQuery to a DWS Cluster

This practice demonstrates how to use the open-source tool Kettle to migrate BigQuery data to DWS. The syntax of metadata is converted using DSC before migration. Historical and incremental data is migrated using Kettle.

For details about Kettle, see Kettle Overview.

This practice takes about 90 minutes. For example, migrating BigQuery involves the following operations:

Prerequisites: Prepare the migration tool Kettle and related suite packages, and obtain the JSON key file of BigQuery.
Step 1: Configure the Connection Between BigQuery and DWS: Use Kettle to create two database connections to connect BigQuery and DWS.
Step 2: Migrate Metadata: Use DSC to migrate data.
Step 3: Migrate All Service Data: Migrate all historical data.
Step 4: Migrate Incremental Service Data: Migrate incremental data.
Step 5: Execute Migration Jobs Concurrently: Create a job to concurrently execute transformation tasks for simultaneous table migration.
Step 6: Check Table Data: Verify data consistency after migration.

Notes and Constraints

Entire database migration is not supported.
Migrating a single table is possible, but if you need to migrate multiple tables at once, you need to create multiple transformations within a single job.

Prerequisites

JDK 11 or later has been installed, and related environment variables have been configured.

Pre-migration Checks

Install Kettle and configure dws-client. For details, see Using Kettle to Import Data.
After Kettle is installed and configured, go to the data-integration\lib directory of the Kettle folder and delete the guava-17.0 file to connect to BigQuery.
Obtain the JSON key file from the Google Cloud and configure Kettle.
The following operations are for reference only. For details, visit the Google Cloud official website.
1. Log in to Google Cloud and select IAM from the product drop-down list.
2. In the navigation pane, choose Service Accounts. In the service account list, select a service account, click on the right, and select Manage keys. If no service account is available, create one.
3. Choose Create new key > JSON, click Create, save the key locally, and copy the local storage path.

Step 1: Configure the Connection Between BigQuery and DWS

Open the data-integration folder extracted from the Kettle folder and double-click Spoon.bat to start Kettle.

If it fails to run, check whether the JDK version is 11 or later.
Select File > New > Transformation.
Select View and double-click Database connections.
In the displayed Database Connection dialog box, select General from the left drop-down list and enter the following information in sequence.
- Connection name: Enter Bigquery.
- Connection type: Select Google BigQuery.
- Access: Select Native(JDBC).
- Host Name: Enter https://www.googleapis.com/bigquery/v2.
- Project ID: Enter the project ID obtained from JSON key generated in 3.
- Port Number: Enter 443.
In the left pane, click Options and add the following information in sequence:
- OAuthServiceAcctEmail: Enter the email address of the JSON Key generated by the service account in the corresponding project in Google Cloud IAM.
- OAuthPvtKeyPath: Enter the local path for storing JSON Key.
- OAuthType: Enter 0.
Click Test.
After the connection is successful, click OK.

The connection may fail because the client cannot find the JAR package. In this case, you need to visit the official website, download and decompress the latest JDBC driver package, save it to the data-integration\lib folder, delete the guava-17.0 file, restart the client, and connect to the database again.
Right-click Database connections under Transformation xx, click New, and configure the destination database connection.
On the left, select General and set the following parameters:
- Connection name: Enter DWS.
- Connection type: Select PostgreSQL.
- Access: Select Native(JDBC).
- Host Name: Enter the IP address of DWS. for example, 10.140.xx.xx.
- Database Name: Enter gaussdb.
- Port Number: Enter 8000.
- Username: Enter dbadmin.
- Password: Enter the password of user dbadmin of DWS.
On the left, click Options and add the following parameters.

stringtype: Set it to unspecified.
Confirm the parameters, and click Test. After the connection is successful, click OK.
Double-click the name of Transformation xx, switch to the Monitoring tab, and set Step performance measurement interval (ms) to 1000 to optimize the Kettle processing performance.

Step 2: Migrate Metadata

Use the Huawei-developed tool DSC to convert the DDLs of metadata exported from BigQuery to be common with SQL statements in DWS.

Export metadata and incremental metadata.

Obtain table information.

        
             SELECT * FROM Information_schema.tables

Obtain the DDLs of the tables.

        
             SELECT ddl FROM `project.dataset.INFORMATION_SCHEMA.TABLES` WHERE table_name = 'your_table_name'

Export the query results.

Query the latest changed tables if metadata changes.

        
             SELECT * FROM Information_schema.tables ORDER BY last_modified_time DESC;

Export metadata as described in 1.b, compare the metadata with that in DWS, convert the metadata using DSC, delete the existing tables, execute the new DDLs, and migrate the entire tables.

Configure the DSC tool. For details, see DSC Configuration.
Use the DSC to convert the DDLs.
1. Place the SQL file to be converted in the input folder of the DSC tool.
2. Open the cmd window and go to the DSC directory.
3. Execute the runDSC script (runDSC.sh in the Linux environment).
```
runDSC.bat -S mysql
```
4. View the conversion results and logs in the output directory of the DSC tool.
  DDLs before conversion:
  
  DDLs after conversion:
Connect to the DWS database and execute the converted DDLs to create the target table.

Step 3: Migrate All Service Data

In the left pane of Kettle, select Design > Input > Table input.
Double-click Table input, set Connection to the BigQuery connection name configured in Step 1: Configure the Connection Between BigQuery and DWS, and enter the query SQL statements in the SQL editing box.
Click OK. The table input information is created.
In the left pane, select Design > Output > Table output.
Double-click Table output and enter the following information.
- Connection: Select the DWS connection name configured in Step 1: Configure the Connection Between BigQuery and DWS.
- Target schema: Enter the corresponding schema.
- Target table: Enter the corresponding table name.
- Truncate table: Select this option if a table is fully migrated for multiple times. Otherwise, ignore it.
Click OK. The table output information is created successfully.
In the right pane, move the pointer to Table input and drag the arrow to Table output.
Click to run the job.

Step 4: Migrate Incremental Service Data

The steps for incremental migration are similar to those for full migration. The difference is that WHERE conditions are added to the source SQL statements.

Configure BigQuery. In the right pane, right-click Table input, select the source database connection, and add WHERE conditions to the SQL statements to query incremental data.

This method works for tables with time-specific fields. For tables without time-specific fields but with partitions, import data by partition. For tables without partitions, delete all data and import everything.
Figure 1 Configuring information for Table input
Configure the target database. Right-click Table output and edit it.

Figure 2 Configuring information for Table output
Click to execute the migration task.

Step 5: Execute Migration Jobs Concurrently

Job task configuration combines the transformation of multiple tables into one task for faster execution.

Choose File > New > Job. Drag the target components to the panel and connect them using lines.

Figure 3 Creating a job
Double-click Transformation and select the transformation task that has been configured and saved.

Figure 4 Configuring a transformation
Right-click the Start component, select the Run Next Entries in Parallel component, and set concurrent task execution.

Figure 5 Setting concurrency

After the setting, a double slash will appear between the Start and Transformation components.

Figure 6 Setting concurrency succeeded
Click Run to execute the conversion tasks concurrently.

Figure 7 Executing the conversion tasks concurrently

Step 6: Check Table Data

After the migration, check whether the data on the source and destination databases is consistent using DataCheck.

Create an ECS (running Linux or Windows) and bind an EIP to the ECS to access BigQuery.
Upload the software package to the specified directory (user-defined) on the ECS, decompress the DataCheck-*.zip package, and go to the DataCheck-* directory. For how to use the files in the directory, see Table 1.
Configure the tool package.
- In Windows:
  Open the dbinfo.properties file in the DataCheck/conf folder and configure it as needed.
  
  You can use the command below in the tool to generate the ciphertext of src.passwd and dws.passwd.
  
  encryption.bat password
  
  After the command is executed, an encrypted file is generated in the local bin directory.
- In Linux:
  Decompress the DataCheck package and go to the directory where the properties file is stored.
```
cd /DataCheck/conf
cat dbinfo.properties
```
  Configure the properties file as follows:
  
  The method of generating the ciphertext is similar to that for Windows. The command is sh encryption.sh Password.
Check data.

In Windows:
1. Open the check_input.xlsx file, enter the schemas, source table, and destination table to be checked, and fill in Row Range with a data query statement to specify the query scope.
  - The Check Strategy offers three levels: high, middle, and low. If unspecified, the default is low.
  - The Check Mode supports statistics (statistical value checks).
  The following figure shows the check_input.xlsx file for metadata comparison.
  
  Figure 8 check_input.xlsx
2. Run the datacheck.bat command in the bin directory to execute the check tool.
3. View the generated check result file check_input_result.xlsx.
  
  Check result analysis:
  - If Status is No Pass, the check fails.
  - The Check Result Diff column shows that the avg values in the value check are different. The avg value of DWS is 61.5125, and that of the source database is 61.5000.
  - The specific check command is displayed in Check SQL.
In Linux:
1. Edit and upload the check_input.xlsx file. Refer to the step 1 for Windows.
2. Run the sh datacheck.sh command to execute the check tool.
3. View the check result in the check_input_result.xlsx file. (The check result analysis is the same as that for Windows.)