Reading Data from PostgreSQL CDC and Writing Data to GaussDB(DWS)_Flink OpenSource SQL Jobs_Flink Jobs_ Developer Guide

Description

Change Data Capture (CDC) can synchronize incremental changes from the source database to one or more destinations. During data synchronization, CDC processes data, for example, grouping (GROUP BY) and joining multiple tables (JOIN).

This example creates a PostgreSQL CDC source table to monitor PostgreSQL data changes and insert the changed data into a GaussDB(DWS) database.

Prerequisites

You have created an RDS for PostgreSQL instance. In this example, the RDS for PostgreSQL database version is 11.

The version of the RDS for PostgreSQL database cannot be earlier than 11.
You have created a GaussDB(DWS) instance.

Overall Process

Figure 1 shows the overall development process.

Figure 1 Job development flowchart
Click to enlarge

Step 1: Create an Elastic Resource Pool and Create Queues Within It

Step 2: Create an RDS for PostgreSQL Database and Table

Step 3: Create a GaussDB(DWS) Database and Table

Step 4: Create an Enhanced Datasource Connection

Step 5: Run a Job

Step 6: Send Data and Query Results

Step 1: Create an Elastic Resource Pool and Create Queues Within It

The CIDR block of a new queue cannot overlap with the CIDR blocks of DMS Kafka and RDS for MySQL instances. Otherwise, datasource connections will fail to be created.

Log in to the DLI management console.
In the navigation pane on the left, choose Resources > Resource Pool.
On the displayed page, click Buy Resource Pool in the upper right corner.

On the displayed page, set the parameters.

In this example, we will buy the resource pool in the CN East-Shanghai2 region. Table 1 describes the parameters.

**Table 1** Parameter descriptions
Parameter	Description	Example Value
Region	Select a region where you want to buy the elastic resource pool.	CN East-Shanghai2
Project	Project uniquely preset by the system for each region	Default
Name	Name of the elastic resource pool	dli_resource_pool
Specifications	Specifications of the elastic resource pool	Standard
CU Range	The maximum and minimum CUs allowed for the elastic resource pool	64-64
CIDR Block	CIDR block the elastic resource pool belongs to. If you use an enhanced datasource connection, this CIDR block cannot overlap that of the data source. Once set, this CIDR block cannot be changed.	172.16.0.0/19
Enterprise Project	Select an enterprise project for the elastic resource pool.	default

Click Buy.
Click Submit.
In the elastic resource pool list, locate the pool you just created and click Add Queue in the Operation column.

Set the basic parameters listed below.

**Table 2** Basic parameters for adding a queue
Parameter	Description	Example Value
Name	Name of the queue to add	dli_queue_01
Type	Type of the queue To execute SQL jobs, select For SQL. To execute Flink or Spark jobs, select For general purpose.	For SQL jobs, select For SQL. For other scenarios, select For general purpose.
Engine	SQL queue engine. The options are Spark and HetuEngine.	Spark
Enterprise Project	Select an enterprise project.	default

Click Next and configure scaling policies for the queue.

Click Create to add a scaling policy with varying priority, period, minimum CUs, and maximum CUs.

Figure 2 shows the scaling policy configured in this example.

Figure 2 Configuring a scaling policy when adding a queue
Click to enlarge

**Table 3** Scaling policy parameters
Parameter	Description	Example Value
Priority	Priority of the scaling policy in the current elastic resource pool. A larger value indicates a higher priority. In this example, only one scaling policy is configured, so its priority is set to 1 by default.	1
Period	The first scaling policy is the default policy, and its Period parameter configuration cannot be deleted or modified. The period for the scaling policy is from 00 to 24.	00–24
Min CU	Minimum number of CUs allowed by the scaling policy	16
Max CU	Maximum number of CUs allowed by the scaling policy	64

Click OK.

Step 2: Create an RDS for PostgreSQL Database and Table

Log in to the RDS console. On the displayed page, locate the desired RDS for PostgreSQL instance, click More in its Operation column, and select Log In.
In the login dialog box that appears, enter the username and password and click Log In.
Create a database instance and name it testrdsdb.
Create a schema named test for the testrdsdb database.

Choose SQL Operations > SQL Query. On the page displayed, create an RDS for PostgreSQL table.

create table test.cdc_order(
  order_id VARCHAR,
  order_channel VARCHAR,
  order_time VARCHAR,
  pay_amount FLOAT8,
  real_pay FLOAT8,
  pay_time VARCHAR,
  user_id VARCHAR,
  user_name VARCHAR,
  area_id VARCHAR,
  primary key(order_id));

Run the following statement in the PostgreSQL instance:

ALTER TABLE test.cdc_order REPLICA IDENTITY FULL;

Step 3: Create a GaussDB(DWS) Database and Table

Connect to the created GaussDB(DWS) cluster.
Connect to the default database gaussdb of a GaussDB(DWS) cluster.
```
gsql -d gaussdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r
```
- gaussdb: Default database of the GaussDB(DWS) cluster
- Connection address of the GaussDB(DWS) cluster: If a public network address is used for connection, set this parameter to the public network IP address or domain name. If a private network address is used for connection, set this parameter to the private network IP address or domain name. If an ELB is used for connection, set this parameter to the ELB address.
- dbadmin: Default administrator username used during cluster creation
- -W: Default password of the administrator
Run the following command to create the testdwsdb database:
```
CREATE DATABASE testdwsdb;
```

Run the following command to exit the gaussdb database and connect to testdwsdb:

\q
gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r

Run the following commands to create a table:

create schema test;
set current_schema= test;
drop table if exists dws_order;
CREATE TABLE dws_order
(
  order_id VARCHAR,
  order_channel VARCHAR,
  order_time VARCHAR,
  pay_amount FLOAT8,
  real_pay FLOAT8,
  pay_time VARCHAR,
  user_id VARCHAR,
  user_name VARCHAR,
  area_id VARCHAR
);

Step 4: Create an Enhanced Datasource Connection

Connecting DLI to RDS
1. Go to the RDS console. In the navigation pane on the left, choose Instances. On the displayed page, click the name of the desired RDS instance. Basic information of the instance is displayed.
2. In the Connection Information pane, obtain the floating IP address, database port, VPC, and subnet.
3. Click the security group name. On the displayed page, click the Inbound Rules tab and add a rule to allow access from DLI queues. For example, if the CIDR block of the queue is 10.0.0.0/16, set Priority to 1, Action to Allow, Protocol to TCP, Type to IPv4, Source to 10.0.0.0/16, and click OK.
4. Log in to the DLI management console. In the navigation pane on the left, choose Datasource Connections. On the displayed page, click Create in the Enhanced tab.
5. In the displayed dialog box, set the following parameters: For details, see the following section:
  - Connection Name: Enter a name for the enhanced datasource connection. For this example, enter dli_rds.
  - Resource Pool: Select the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create Queues Within It.
  - VPC: Select the VPC of the RDS instance.
  - Subnet: Select the subnet of RDS instance.
  - Set other parameters as you need.
  Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
6. In the navigation pane on the left, choose Resources > Queue Management. On the page displayed, locate the queue you created in Step 1: Create an Elastic Resource Pool and Create Queues Within It, click More in the Operation column, and select Test Address Connectivity.
7. In the displayed dialog box, enter floating IP address:database port of the RDS instance you have obtained in 2 in the Address box and click Test to check whether the database is reachable.
Connecting DLI to GaussDB(DWS)
1. On the GaussDB(DWS) management console, choose Clusters. On the displayed page, click the name of the created GaussDB(DWS) cluster to view basic information.
2. On the Basic Information tab, locate the Database Attributes pane and obtain the private IP address and port number of the instance. In the Network pane, obtain VPC and subnet information.
3. Click the security group name. On the displayed page, click the Inbound Rules tab and add a rule to allow access from DLI queues. For example, if the CIDR block of the queue is 10.0.0.0/16, set Priority to 1, Action to Allow, Protocol to TCP, Type to IPv4, Source to 10.0.0.0/16, and click OK.
4. Check whether the RDS instance and GaussDB(DWS) instance are in the same VPC and subnet.
  1. If they are, go to 7. You do not need to create an enhanced datasource connection again.
  2. If they are not, go to 5. Create an enhanced datasource connection to connect RDS to the subnet where the GaussDB(DWS) instance locates.
5. Log in to the DLI management console. In the navigation pane on the left, choose Datasource Connections. On the displayed page, click Create in the Enhanced tab.
6. In the displayed dialog box, set the following parameters: For details, see the following section:
  - Connection Name: Enter a name for the enhanced datasource connection. For this example, enter dli_dws.
  - Resource Pool: Select the elastic resource pool created in Step 1: Create an Elastic Resource Pool and Create Queues Within It.
  - VPC: Select the VPC of the GaussDB(DWS) instance.
  - Subnet: Select the subnet of GaussDB(DWS) instance.
  - Set other parameters as you need.
  Click OK. Click the name of the created datasource connection to view its status. You can perform subsequent steps only after the connection status changes to Active.
7. In the navigation pane on the left, choose Resources > Queue Management. On the page displayed, locate the queue you created in Step 1: Create an Elastic Resource Pool and Create Queues Within It, click More in the Operation column, and select Test Address Connectivity.
8. In the displayed dialog box, enter floating IP address:database port of the GaussDB(DWS) instance you have obtained in 2 in the Address box and click Test to check whether the database is reachable.

Step 5: Run a Job

On the DLI management console, choose Job Management > Flink Jobs. On the Flink Jobs page, click Create Job.
In the Create Job dialog box, set Type to Flink OpenSource SQL and Name to FlinkCDCPostgreDWS. Click OK.

On the job editing page, set the following parameters and retain the default values of other parameters.

Queue: Select the queue created in Step 1: Create an Elastic Resource Pool and Create Queues Within It.
Flink Version: Select 1.12.
Save Job Log: Enable this function.
OBS Bucket: Select an OBS bucket for storing job logs and grant access permissions of the OBS bucket as prompted.
Enable Checkpointing: Enable this function.

Enter a SQL statement in the editing pane. The following is an example. Modify the parameters in bold as you need.

In this example, the syntax version of Flink OpenSource SQL is 1.12. In this example, the data source is Kafka and the result data is written to Elasticsearch.

**Table 4** Job running parameters
Parameter	Description
Queue	A shared queue is selected by default. You can select a CCE queue with dedicated resources and configure the following parameters: UDF Jar: UDF Jar file. Before selecting such a file, upload the corresponding JAR file to the OBS bucket and choose Data Management > Package Management to create a package. For details, see Creating a Package. In SQL, you can call a UDF that is inserted into a JAR file. NOTE: When creating a job, a sub-user can only select the queue that has been allocated to the user. If the remaining capacity of the selected queue cannot meet the job requirements, the system automatically scales up the capacity and you will be billed based on the increased capacity. When a queue is idle, the system automatically scales in its capacity.
CUs	Sum of the number of compute units and JobManager CUs of DLI. CU is also the billing unit of DLI. One CU equals 1 vCPU and 4 GB of memory. The value is the number of CUs required for job running and cannot exceed the number of CUs in the bound queue.
Job Manager CUs	Number of CUs of the management unit.
Parallelism	Maximum number of Flink OpenSource SQL jobs that can run at the same time. NOTE: This value cannot be greater than four times the compute units (number of CUs minus the number of JobManager CUs).
Task Manager Configuration	Whether to set Task Manager resource parameters. If this option is selected, you need to set the following parameters: CU(s) per TM: Number of resources occupied by each Task Manager. Slot(s) per TM: Number of slots contained in each Task Manager.
OBS Bucket	OBS bucket to store job logs and checkpoint information. If the selected OBS bucket is not authorized, click Authorize.
Save Job Log	Whether to save job run logs to OBS. The logs are saved in Bucket name/jobs/logs/Directory starting with the job ID. CAUTION: You are advised to configure this parameter. Otherwise, no run log is generated after the job is executed. If the job fails, the run log cannot be obtained for fault locating. If this option is selected, you need to set the following parameters: OBS Bucket: Select an OBS bucket to store user job logs. If the selected OBS bucket is not authorized, click Authorize. NOTE: If Enable Checkpointing and Save Job Log are both selected, you only need to authorize OBS once.
Alarm Generation upon Job Exception	Whether to notify users of any job exceptions, such as running exceptions or arrears, via SMS or email. If this option is selected, you need to set the following parameters: SMN Topic Select a user-defined SMN topic. For details about how to create a custom SMN topic, see "Creating a Topic" in Simple Message Notification User Guide.
Enable Checkpointing	Whether to enable job snapshots. If this function is enabled, jobs can be restored based on checkpoints. If this option is selected, you need to set the following parameters: Checkpoint Interval: interval for creating checkpoints, in seconds. The value ranges from 1 to 999999, and the default value is 30. Checkpoint Mode: checkpointing mode, which can be set to either of the following values: At least once: Events are processed at least once. Exactly once: Events are processed only once. OBS Bucket: Select an OBS bucket to store your checkpoints. If the selected OBS bucket is not authorized, click Authorize. Checkpoints are saved in Bucket name/jobs/checkpoint/Directory starting with the job ID. NOTE: If Enable Checkpointing and Save Job Log are both selected, you only need to authorize OBS once.
Auto Restart upon Exception	Whether to enable automatic restart. If this function is enabled, jobs will be automatically restarted and restored when exceptions occur. If this option is selected, you need to set the following parameters: Max. Retry Attempts: maximum number of retries upon an exception. The unit is times/hour. Unlimited: The number of retries is unlimited. Limited: The number of retries is user-defined. Restore Job from Checkpoint: This parameter is available only when Enable Checkpointing is selected.
Idle State Retention Time	How long the state of a key is retained without being updated before it is removed in GroupBy or Window. The default value is 1 hour.
Dirty Data Policy	Policy for processing dirty data. The following policies are supported: Ignore, Trigger a job exception, and Save. If you set this field to Save, Dirty Data Dump Address must be set. Click the address box to select the OBS path for storing dirty data.

create table PostgreCdcSource(
  order_id string,
  order_channel string,
  order_time string,
  pay_amount double,
  real_pay double,
  pay_time string,
  user_id string,
  user_name string,
  area_id STRING,
  primary key (order_id) not enforced
) with (
  'connector' = 'postgres-cdc',
  'hostname' = ' 192.168.15.153',--IP address of the PostgreSQL instance
  'port'= ' 5432',--Port number of the PostgreSQL instance
  'pwd_auth_name'= 'xxxxx', -- Name of the datasource authentication of the password type created on DLI. If datasource authentication is used, you do not need to set the username and password for the job.
  'database-name' = ' testrdsdb',--Database name of the PostgreSQL instance
  'schema-name' = ' test',-- Schema in the PostgreSQL database
  'table-name' = ' cdc_order'--Table name in the PostgreSQL database
);

create table dwsSink(
  order_id string,
  order_channel string,
  order_time string,
  pay_amount double,
  real_pay double,
  pay_time string,
  user_id string,
  user_name string,
  area_id STRING,  
  primary key(order_id) not enforced
) with (
  'connector' = 'gaussdb',
  'driver' = 'com.huawei.gauss200.jdbc.Driver',
  'url'='jdbc:gaussdb://192.168.168.16:8000/testdwsdb ', ---192.168.168.16:8000 indicates the internal IP address and port of the GaussDB(DWS) instance. testdwsdb indicates the name of the created GaussDB(DWS) database.
  'table-name' = ' test\".\"dws_order', ---test indicates the schema of the created GaussDB(DWS) table, and dws_order indicates the GaussDB(DWS) table name.
  'username' = 'xxxxx',--Username of the GaussDB(DWS) instance
  'password' = 'xxxxx',--Password of the GaussDB(DWS) instance
  'write.mode' = 'insert'
);

insert into dwsSink select * from PostgreCdcSource where pay_amount > 100;

Click Check Semantic and ensure that the SQL statement passes the check. Click Save. Click Start, confirm the job parameters, and click Start Now to execute the job. Wait until the job status changes to Running.

Step 6: Send Data and Query Results

Log in to the RDS console. On the displayed page, locate the desired RDS for PostgreSQL instance, click More in its Operation column, and select Log In.
In the login dialog box that appears, enter the username and password and click Log In.

In the Operation column of row where the created database locates, click SQL Window and enter the following statement to create a table and insert data to the table:

insert into test.cdc_order values
('202103241000000001','webShop','2021-03-24 10:00:00','50.00','100.00','2021-03-24 10:02:03','0001','Alice','330106'),
('202103251606060001','appShop','2021-03-24 12:06:06','200.00','180.00','2021-03-24 16:10:06','0002','Jason','330106'),
('202103261000000001','webShop','2021-03-24 14:03:00','300.00','100.00','2021-03-24 10:02:03','0003','Lily','330106'),
('202103271606060001','appShop','2021-03-24 16:36:06','99.00','150.00','2021-03-24 16:10:06','0001','Henry','330106');

Connect to the created GaussDB(DWS) cluster.

Connect to the default database testdwsdb of a GaussDB(DWS) cluster.

gsql -d testdwsdb -h Connection address of the GaussDB(DWS) cluster -U dbadmin -p 8000 -W password -r

Run the following statements to query table data:

select * from test.dws_order;

The query result is as follows:

order_channel              order_channel     order_time             pay_amount  real_pay  pay_time              user_id  user_name  area_id
202103251606060001         appShop         2021-03-24 12:06:06       200.0      180.0   2021-03-24 16:10:06      0002      Jason     330106
202103261000000001         webShop         2021-03-24 14:03:00       300.0      100.0   2021-03-24 10:02:03      0003      Lily      330106

Reading Data from PostgreSQL CDC and Writing Data to GaussDB(DWS)

Description

Prerequisites

Overall Process

Step 1: Create an Elastic Resource Pool and Create Queues Within It

Step 2: Create an RDS for PostgreSQL Database and Table

Step 3: Create a GaussDB(DWS) Database and Table

Step 4: Create an Enhanced Datasource Connection

Step 5: Run a Job

Step 6: Send Data and Query Results

Feedback

Was this page helpful?