Updated on 2024-11-29 GMT+08:00

Submitting Flink on Hudi Jobs

Parameters

Table 1 Parameters for submitting a job

Parameter

Description

Recommended Value

Description

-c

Main class name

Set this parameter as needed.

Mandatory

-ynm

Flink YARN job name

Set this parameter as needed.

Mandatory

execution.checkpointing.interval

Interval for triggering checkpointing, in milliseconds

60000

Mandatory. Use -yD to pass this parameter.

execution.checkpointing.timeout

Checkpoint timeout, in minutes. The default value is 30 minutes.

30min

Mandatory. Use -yD to pass this parameter.

parallelism.default

Parallelism of a job, for example, a join job. The default value is 1.

Set this parameter as needed.

Mandatory. Use -yD to pass this parameter.

table.exec.state.ttl

Flink status TTL (join TTL). The default value is 0.

Set this parameter as needed.

Mandatory. Use -yD to pass this parameter.

Development Suggestions

  • The checkpoint interval must be greater than the checkpoint execution duration.

    The checkpoint execution duration depends on the checkpoint data volume. The larger the data volume, the longer the execution duration.

    The checkpoint execution duration depends on the number of partitions. The more the partitions, the longer the execution duration.

  • The checkpoint timeout must be greater than the checkpoint interval.

    The checkpoint interval indicates the interval for triggering a checkpoint operation. If a checkpoint operation takes longer time than the checkpoint timeout, the job fails.

  • If CDC is used, Changelog needs to be enabled for Hudi table read and write.

    To ensure Flink calculation accuracy when CDC is used, retain +I, +U, -U, and -D in Hudi tables. Changelog must be enabled when data is written to or read from the same Hudi table.

  • Hudi tables of the COW type are used when data is ingested to the lake in batches.

    Copy-on-write tables use Parquet columnar storage only. Files to be updated are rewritten by merging versions synchronously during writing. Less storage space is required. However, synchronous data rewriting increases the data update cost and read latency. Therefore, COW tables are not suitable for real-time data ingestion.