Updated on 2024-08-30 GMT+08:00

Incremental File Migration

CDM supports incremental migration of file systems. After full migration is complete, all new files or only specified directories or files can be exported.

Currently, CDM supports the following incremental migration modes:

  1. Exporting the files in a specified directory
    • Application scenarios: The migration source is a file system (OBS/HDFS/FTP/SFTP). In incremental migration, only the specified files are written to the migration destination. The existing records are not updated or deleted.
    • Key configurations: File/Path Filter and Schedule Execution
    • Prerequisites: The source directory or file name contains the time field.
  2. Exporting the files modified after the specified time point
    • Application scenarios: The migration source is a file system (OBS/HDFS/FTP/SFTP). The specified time point refers to the time when the file is modified. CDM migrates the files modified at or after the specified time point.
    • Key configurations: Time Filter and Schedule Execution
    • Prerequisites: None

If you have configured a macro variable of date and time and schedule a CDM job through DataArts Studio DataArts Factory, the system replaces the macro variable of date and time with (Planned start time of the data development jobOffset) rather than (Actual start time of the CDM jobOffset).

File/Path Filter

  • Parameter position: When creating a table/file migration job, if the migration source is a file system, set Filter Type in advanced attributes of Source Job Configuration to Wildcard or Regular expression.
  • Parameter principle: If you select Wildcard for Filter Type, CDM filters files or paths based on the configured wildcard character and migrates only files or paths that meet the specified condition.
  • Example configurations:
    Suppose that the source file name contains the date and time field, such as 2017-10-15 20:25:26, the /opt/data/file_20171015202526.data file is generated. Set the parameters as follows:
    1. Filter Type: Select Wildcard.
    2. File Filter: Enter "*${dateformat(yyyyMMdd,-1,DAY)}*", which is the format of the macro variables of date and time supported by CDM. For details, see Using Macro Variables of Date and Time.
      Figure 1 Filtering files
    3. Schedule Execution: Set Cycle (days) to 1.

In this way, you can import the files generated in the previous day to the destination directory every day to implement incremental synchronization.

In incremental file migration, Path Filter is used in the same way as File Filter. The path name must contain the time field. In this case, all files in the specified path can be synchronized periodically.

Time Filter

  • Parameter position: When creating a table/file migration job, if the migration source is a file system, set select Yes for Time Filter.
  • Parameter principle: After you specify the start time and end time, only files that are modified between the start time (included) and end time (excluded) will be migrated.
  • Example configurations:
    For example, if you want CDM to synchronize only the files generated from January 1, 2021 to January 1, 2022 to the destination, configure the following parameters:
    1. Time Filter: select Yes.
    2. Minimum Timestamp: Enter a value in the format of yyyy-MM-dd HH:mm:ss, such as 2021-01-01 00:00:00.
    3. Maximum Timestamp: Enter a value in the format of yyyy-MM-dd HH:mm:ss, such as 2022-01-01 00:00:00.
    Figure 2 Time Filter

In this way, the CDM job migrates only the files generated from January 1, 2021 to January 1, 2022, and performs incremental synchronization next time it is started.