Submitting Flink on Hudi Jobs
Parameters
Parameter |
Description |
Recommended Value |
Description |
---|---|---|---|
-c |
Main class name |
Set this parameter as needed. |
Mandatory |
-ynm |
Flink YARN job name |
Set this parameter as needed. |
Mandatory |
execution.checkpointing.interval |
Interval for triggering checkpointing, in milliseconds |
60000 |
Mandatory. Use -yD to pass this parameter. |
execution.checkpointing.timeout |
Checkpoint timeout, in minutes. The default value is 30 minutes. |
30min |
Mandatory. Use -yD to pass this parameter. |
parallelism.default |
Parallelism of a job, for example, a join job. The default value is 1. |
Set this parameter as needed. |
Mandatory. Use -yD to pass this parameter. |
table.exec.state.ttl |
Flink status TTL (join TTL). The default value is 0. |
Set this parameter as needed. |
Mandatory. Use -yD to pass this parameter. |
Development Suggestions
- The checkpoint interval must be greater than the checkpoint execution duration.
The checkpoint execution duration depends on the checkpoint data volume. The larger the data volume, the longer the execution duration.
The checkpoint execution duration depends on the number of partitions. The more the partitions, the longer the execution duration.
- The checkpoint timeout must be greater than the checkpoint interval.
The checkpoint interval indicates the interval for triggering a checkpoint operation. If a checkpoint operation takes longer time than the checkpoint timeout, the job fails.
- If CDC is used, Changelog needs to be enabled for Hudi table read and write.
To ensure Flink calculation accuracy when CDC is used, retain +I, +U, -U, and -D in Hudi tables. Changelog must be enabled when data is written to or read from the same Hudi table.
- Hudi tables of the COW type are used when data is ingested to the lake in batches.
Copy-on-write tables use Parquet columnar storage only. Files to be updated are rewritten by merging versions synchronously during writing. Less storage space is required. However, synchronous data rewriting increases the data update cost and read latency. Therefore, COW tables are not suitable for real-time data ingestion.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot