Updated on 2022-09-22 GMT+08:00

From HDFS

When the source link of a job is the Link to HDFS, that is, when data is exported from MRS HDFS, FusionInsight HDFS, or Apache HDFS, configure the source job parameters based on Table 1.

Table 1 Parameter description

Category

Parameter

Description

Example Value

Basic parameters

Source Link Name

Select a type from the drop-down list box.

hdfs_to_cdm

Source Directory/File

This parameter is available only when Pull List File is set to No.

Directory or file path from which data will be extracted.

This parameter can be configured as a macro variable of date and time and a path name can contain multiple macro variables. When the macro variable of date and time works with a scheduled job, the incremental data can be synchronized periodically. For details, see Incremental Synchronization Using the Macro Variables of Date and Time.

NOTE:

If you have configured a macro variable of date and time and schedule a CDM job through DataArts Studio DataArts Factory, the system replaces the macro variable of date and time with (Planned start time of the data development jobOffset) rather than (Actual start time of the CDM jobOffset).

/user/cdm/

File Format

File format used when transferring data. The options are as follows:
  • CSV: Source files will be migrated to tables after being converted to CSV format.
  • Binary: Files (even not in binary format) will be transferred directly. It is used for file copy.
  • Parquet: Source files will be migrated to tables after being converted to Parquet format.

CSV

Pull List File

This parameter is displayed only when File Format is set to Binary.

If the pull list file function is enabled, the content of a file (such as a .txt file) in an OBS bucket can be read as the list of files to be migrated. The content in the file must be the absolute path of the file to be migrated (rather than a directory). The following is example content:
/mrs/job-properties/application_1634891604621_0014/job.properties
/mrs/job-properties/application_1634891604621_0029/job.properties

Yes

OBS Link of List File

This parameter is available only when Pull List File is set to Yes. You can select the OBS link where the list file is located.

OBS_test_link

OBS Bucket of entries files

This parameter is available only when Pull List File is set to Yes. It indicates the name of the OBS bucket where the list file is located.

01

Path/Directory of entries files

This parameter is available only when Pull List File is set to Yes. It indicates the absolute path or directory of the list file in the OBS bucket.

/0521/Lists.txt

Advanced attributes

Line Separator

Lind feed character in a file. By default, the system automatically identifies \n, \r, and \r\n. This parameter is displayed only when File Format is set to CSV.

\n

Field Delimiter

Character used to separate fields in the file. To set the Tab key as the delimiter, set this parameter to \t. This parameter is displayed only when File Format is set to CSV.

,

Use First Row as Header

This parameter is displayed only when File Format is set to CSV. When you migrate a CSV file to a table, CDM writes all data to the table by default. If you set this parameter to Yes, CDM uses the first line of the CSV file as the heading line and does not write the line to the destination table.

No

Source File Processing Method

Operation performed on source files after the job completes.
  • No action
  • Rename: After the job completes, the source files are renamed by appending usernames and timestamps as suffixes to the file names.
  • Delete: After the job completes, the source files are deleted.

No action

Start Job by Marker File

Whether to start a job by a marker file. A job is only started if there is a marker file for starting the job in the source path. If there is no marker file, the job will be suspended for a period of time specified by Suspension Period.

ok.txt

Filter Type

Only paths or files that meet the filtering conditions are transferred. The options are None, Wildcard, and Regex. For details, see Incremental File Migration.

-

Path Filter

If you set Filter Type to Wildcard, enter a wildcard character to filter paths. The paths that meet the filtering condition are migrated. You can configure multiple paths separated by commas (,).

NOTE:

If you have configured a macro variable of date and time and schedule a CDM job through DataArts Studio DataArts Factory, the system replaces the macro variable of date and time with (Planned start time of the data development jobOffset) rather than (Actual start time of the CDM jobOffset).

*input

File Filter

If you set Filter Type to Wildcard, you can enter a wildcard character to search for files in a specified path. The files that meet the search criteria are migrated. You can configure multiple files separated by commas (,).

NOTE:

If you have configured a macro variable of date and time and schedule a CDM job through DataArts Studio DataArts Factory, the system replaces the macro variable of date and time with (Planned start time of the data development jobOffset) rather than (Actual start time of the CDM jobOffset).

*.csv

Time Filter

If you select Yes, files are transferred based on their modification time.

Yes

Minimum Timestamp

If you set Filter Type to Time Filter, and specify a point in time for this parameter, only the files modified after the specified time are transferred. The time format must be yyyy-MM-dd HH:mm:ss.

This parameter can be set to a macro variable of date and time. For example, ${timestamp(dateformat(yyyy-MM-dd HH:mm:ss,-90,DAY))} indicates that only files generated within the latest 90 days are migrated.

NOTE:

If you have configured a macro variable of date and time and schedule a CDM job through DataArts Studio DataArts Factory, the system replaces the macro variable of date and time with (Planned start time of the data development jobOffset) rather than (Actual start time of the CDM jobOffset).

2019-07-01 00:00:00

Maximum Timestamp

If you set Filter Type to Time Filter, and specify a point in time for this parameter, only the files modified before the specified time are transferred. The time format must be yyyy-MM-dd HH:mm:ss.

This parameter can be set to a macro variable of date and time. For example, ${timestamp(dateformat(yyyy-MM-dd HH:mm:ss))} indicates that only the files whose modification time is earlier than the current time are migrated.

NOTE:

If you have configured a macro variable of date and time and schedule a CDM job through DataArts Studio DataArts Factory, the system replaces the macro variable of date and time with (Planned start time of the data development jobOffset) rather than (Actual start time of the CDM jobOffset).

2019-07-30 00:00:00

Create Snapshot

If you set this parameter to Yes, CDM creates a snapshot for the source directory to be migrated (the snapshot cannot be created for a single file) before it reads files from HDFS. Then CDM migrates the data in the snapshot.

Only the HDFS administrator can create a snapshot. After the CDM job is completed, the snapshot is deleted.

No

Encryption

This parameter is displayed only when File Format is set to Binary.

If the source data is encrypted, CDM can decrypt the data before exporting it. Select whether to decrypt the source data and select a decryption algorithm. The options are as follows:
  • NONE: Export data without decrypting it.
  • AES-256-GCM: The AES 256-bit encryption algorithm is used to encrypt data. Currently, only the AES-256-GCM (NoPadding) encryption algorithm is supported. This parameter is used for encryption at the migration destination and decryption at the migration source.

For details, see Encryption and Decryption During File Migration.

AES-256-GCM

DEK

This parameter is displayed only when Encryption is set to AES-256-GCM. The key consists of 64 hexadecimal numbers and must be the same as the DEK configured during encryption. If the decryption and encryption keys are inconsistent, the system does not report an exception, but the decrypted data is incorrect.

DD0AE00DFECD78BF051BCFDA25BD4E320DB0A7AC75A1F3FC3D3C56A457DCDC1B

IV

This parameter is displayed only when Encryption is set to AES-256-GCM. The initialization vector consists of 32 hexadecimal numbers and must be the same as the IV configured during encryption. If the initialization vectors are inconsistent, the system does not report an exception, but the decrypted data is incorrect.

5C91687BA886EDCD12ACBC3FF19A3C3F

MD5 File Extension

This parameter is displayed only when File Format is set to Binary.

This parameter is used to check whether the files extracted by CDM are consistent with source files. For details, see MD5 Verification.

.md5

HDFS supports the UTF-8 encoding only. Retain the default value UTF-8.