Kafka Incremental Extraction

Overview

Kafka incremental extraction refers to extracting data from Kafka within a specified time range to achieve periodic data synchronization. This policy is suitable for periodically synchronizing data in Kafka to other storage systems (such as a Hive data lake). By properly configuring the start time, end time, and scheduling period for extracting Kafka data, you can create a task that synchronizes incremental data periodically and efficiently.

Scenarios

Common scenarios include but are not limited to the following:

Hourly synchronization: New data in Kafka is synchronized to a Hive data lake every hour.
Daily synchronization: New data in Kafka is synchronized to a Hive data lake every day.
More granular periodic synchronization: You can set a more granular interval (for example, every 15 minutes) for synchronizing data.

Procedure

Configure job parameter variables..
- Start time: Enter startTime:#{DateUtil.format(DateUtil.addHours(Job.planTime,-1),"yyyy-MM-dd HH:mm:ss")}, which indicates one hour before the task scheduling time
- End time: Enter endTime:endTime:#{DateUtil.format(Job.planTime,"yyyy-MM-dd HH:mm:ss")}, which indicates the task scheduling time.
Figure 1 Configuring job parameter variables
Configure a Kafka read task..
Set Consumption Record Policy to Time Range, Start Time to ${startTime}, and End Time to ${endTime}.

Figure 2 Configuring a Kafka read task
Configure a scheduling policy..
Set Scheduling Frequency to Hours. The task is scheduled by hour and extracts Kafka messages generated within one hour.

Figure 3 Configuring the scheduling period

Summary

By properly configuring the start time, end time, and scheduling period for extracting Kafka data, you can create a task that synchronizes incremental data periodically and efficiently. This method is suitable for synchronizing data in Kafka periodically to other storage systems and significantly improves the data processing efficiency and reliability. You are advised to adjust and optimize configurations based on your requirements and environment to achieve optimal synchronization performance.

Parent topic: Parameter Configuration Practices

Previous topic: Parameter Configuration Practices

Next topic: Hive Load, Truncate+Load, and Load Overwrite Modes

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot