Updated on 2024-11-29 GMT+08:00

Synchronizing Open-source Debezium JSON Data

Debezium is an open-source distributed platform for change data capture. It records row-level changes of each table in the form of event streams. Debezium is built on top of Kafka and provides a group of connectors compatible with Kafka Connect. Each connector is used to capture change events of a specific database and send event streams to Kafka topics. CDL can process JSON-format create (c), change (u), and delete (d) event messages captured by the Debezium connectors of MySQL, PostgreSQL, and Oracle databases of version 1.4.0.

Database Data Types and Spark (Hudi) Data Types

To write data to Hudi by consuming change event messages in Debezium JSON format of the database, see 2.6.3.12-Synchronizing Debezium JSON Data from ThirdKafka to Hudi. The supported database data types, and the mapping between them and Spark data types are listed in the following table.

Table 1 Mapping between PostgreSQL and Spark (Hudi) data types

PostgreSQL

Spark (Hudi)

int2

int

int4

int

int8

bigint

numeric[p, s]

  • decimal[p,s]: decimal.handing.mode of the Debezium connector is precise (default value).
  • string: decimal.handing.mode of the Debezium connector is string.
  • double: decimal.handing.mode of the Debezium connector is double.

bool

boolean

char

string

varchar

string

text

string

timestamptz

timestamp

timestamp

timestamp

date

date

json, jsonb

string

float4

float

float8

double

Table 2 Mapping between MySQL and Spark (Hudi) data types

MySQL

Spark (Hudi)

int

int

integer

int

bigint

bigint

double

double

decimal[p,s]

  • decimal[p,s]: decimal.handing.mode of the Debezium connector is precise (default value).
  • string: decimal.handing.mode of the Debezium connector is string.
  • double: decimal.handing.mode of the Debezium connector is double.

varchar

string

char

string

text

string

timestamp

timestamp

datetime

timestamp

date

date

json

string

float

double

Table 3 Mapping between Oracle and Spark (Hudi) data types

Oracle

Spark (Hudi)

NUMBER(1,0)

boolean

NUMBER(P, 0) P->[2, 9]

int

NUMBER(P, 0) P->[10, 18]

bigint

NUMBER(P, 0) P >= 19

NUMBER(P, S > 0)

NUMBER[(P)]

  • decimal: decimal.handing.mode of the Debezium connector is precise (default value).
  • string: decimal.handing.mode of the Debezium connector is string.
  • double: decimal.handing.mode of the Debezium connector is double.

FLOAT

decimal

BINARY_DOUBLE

double

CHAR

string

VARCHAR

string

TIMESTAMP

timestamp

timestamp with time zone

timestamp

DATE

timestamp