Updated on 2022-12-07 GMT+08:00

Dumping Data to MRS

Prerequisites

DIS cannot dump data to MRS 3.x or later versions. Kerberos authentication must be disabled for the MRS cluster to which data is to be dumped.

Source Data Type: JSON, BLOB, and CSV; Dump File Format: Text

Table 1 Parameters for configuring a Text dump file

Parameter

Description

Value

Task Name

Name of the dump task. The names of dump tasks created for the same stream must be unique. A dump task name is 1 to 64 characters long. Only letters, digits, hyphens (-), and underscores (_) are allowed.

-

MRS Cluster

Click Select. In the Select MRS Cluster dialog box, select an MRS cluster. Data is dumped only to an MRS cluster that is not authenticated by Kerberos.

You can only select but not enter a value in this field.

-

HDFS Path

Click Select. In the Select HDFS Path dialog box, select an HDFS path.

You can only select but not enter a value in this field.

This parameter is available only after you select an MRS cluster.

File Directory

Directory created in MRS to store files from the DIS stream.

This directory name is 0 to 50 characters long.

By default, this parameter is left unspecified.

-

Offset

  • Latest: Maximum offset, indicating that the latest data will be read.
  • Earliest: Minimum offset, indicating that the earliest data will be read.

Latest

Dump Interval (s)

Interval at which data from the DIS stream will be imported into dump destination, such as OBS, MRS, DLI, and DWS. If no data was pushed to the DIS stream during the time specified here, the dump file will not be generated.

Value range: 30s to 900s

Unit: second

Default value: 300s

-

Temporary Bucket

OBS bucket in which a directory is created for temporarily storing user data. The data in the directory is deleted after being dumped to a specific destination.

-

Temporary Directory

Directory in the chosen Temporary Bucket for temporarily storing data. The data in the directory is deleted after being dumped to a specific destination.

If this field is left blank, the data is stored directly to the Temporary Bucket.

-

Source Data Type: JSON and CSV; Dump File Format: Parquet

Table 2 lists the differentiated parameters that need to be set when the source data type is JSON or CSV, the dump destination is MRS, and the dump file format is Parquet. For details about how to configure other common parameters, see Table 1.

Table 2 Parameters for configuring a Parquet dump file

Parameter

Description

Value

Source Data Schema

JSON or CSV data example, used to describe the JSON or CSV data format. DIS can generate an Avro schema based on the JSON or CSV data sample and convert the uploaded JSON or CSV data to the Parquet format.

-

Source Data Type: JSON and CSV; Dump File Format: CarbonData

Table 3 lists the differentiated parameters that need to be set when the source data type is JSON or CSV, the dump destination is OBS, and the dump file format is CarbonData. For details about how to configure other common parameters, see Table 1.

Table 3 Parameters for configuring a CarbonData dump file

Parameter

Description

Value

Source Data Schema

JSON or CSV data example, used to describe the JSON or CSV data format. DIS can generate an Avro schema based on the JSON or CSV data sample and convert the uploaded JSON or CSV data to the CarbonData format.

-

CarbonData Retrieval Attribute

Attribute of the carbon table, used to create a carbon writer.

The following keys are supported:

  • table_blocksize: Size of a table block. The value ranges from 1 MB to 2048 MB. The default value is 1024 MB.
  • table_blocklet_size: Size of the blocklet in a file. The default value is 64 MB.
  • local_dictionary_enable: Possible values can be true or false. The default value is false.
  • sort_columns: Specifies the index column. Multi-level index columns are separated by commas (,).
  • sort_scope: Specifies the scope where data is sorted during loading. Currently, the following types are supported:
    • local_sort: Default value, indicating that data is sorted in a node.
    • no_sort: Data is not sorted. It is used when data needs to be saved to a database quickly. After the data is saved to the database, you can use the Compaction command to create an index when the system is idle.
    • batch_sort: A CarbonData file is generated after the memory is sorted in a node and no full sorting is performed on the node. This configuration improves the loading speed, but the query performance is inferior to that of LOCAL_SORT.

-