Creating a Data Processing Task

Description

Data processing is a core step in data engineering. Data processing aims to preprocess data by using dataset processing operators to ensure that data meets model training standards and service requirements.

Currently, text, video, image, and weather datasets can be processed.

Procedure

On the Create Processing Job page, select the dataset to be processed.
Click Next. The processing step arrangement page is displayed. The list on the left displays the processing operators that can be selected for the current dataset.
1. In the Adding Operator pane on the left, select the required operators.
2. On the processing step orchestration page on the right, drag operators to adjust the operator execution sequence.
3. During orchestration, you can click Save new template in the upper right corner to save the current orchestration process as a template. During the creation of subsequent data processing tasks, you can select a processing template.
  If you select a processing template, the orchestrated processing steps will be deleted.

After the processing steps are orchestrated, click Next to go to the Task Configuration page.

Set resource parameters by referring to Table 1. In addition to the following parameters, you can customize parameters.

**Table 1** Parameter configuration
Parameter	Description
numExecutors	Number of executors. The default value is 2. The minimum value of the product of numExecutors and executorMemory is 4, and the maximum value is 16.
executorCores	Number of CPU kernels used by each executor process. The default value is 2. The minimum value of the product of numExecutors and executorMemory is 4, and the maximum value is 16. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
executorMemory	Memory size used by each Executor process. The default value is 4. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
driverCores	Number of CPU kernels used by each driver process. The default value is 2. The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.
driverMemory	Memory used by the driver process. The default value is 4. The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.

Automatically Generate Processing Dataset: If it is enabled, after the task is successfully executed, a processed dataset is automatically generated. This function can be used for publishing downstream datasets. If it is disabled, you need to perform operations in the processing task list to manually generate a processed dataset.
Enter the dataset name, description, and extended information (optional, including the industry, language, and custom information).