Creating a Data Processing Task
Description
Data processing is a core step in data engineering. Data processing aims to preprocess data by using dataset processing operators to ensure that data meets model training standards and service requirements.
Currently, text, video, image, and weather datasets can be processed.
Procedure
- On the Create Processing Job page, select the dataset to be processed.
- Click Next. The processing step arrangement page is displayed. The list on the left displays the processing operators that can be selected for the current dataset.
- In the Adding Operator pane on the left, select the required operators.
- On the processing step orchestration page on the right, drag operators to adjust the operator execution sequence.
- During orchestration, you can click Save new template in the upper right corner to save the current orchestration process as a template. During the creation of subsequent data processing tasks, you can select a processing template.
If you select a processing template, the orchestrated processing steps will be deleted.
- After the processing steps are orchestrated, click Next to go to the Task Configuration page.
- Set resource parameters by referring to Table 1. In addition to the following parameters, you can customize parameters.
Table 1 Parameter configuration Parameter
Description
numExecutors
Number of executors. The default value is 2.
The minimum value of the product of numExecutors and executorMemory is 4, and the maximum value is 16.
executorCores
Number of CPU kernels used by each executor process. The default value is 2.
The minimum value of the product of numExecutors and executorMemory is 4, and the maximum value is 16. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
executorMemory
Memory size used by each Executor process. The default value is 4.
The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
driverCores
Number of CPU kernels used by each driver process. The default value is 2.
The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.
driverMemory
Memory used by the driver process. The default value is 4.
The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.
- Automatically Generate Processing Dataset: If it is enabled, after the task is successfully executed, a processed dataset is automatically generated. This function can be used for publishing downstream datasets. If it is disabled, you need to perform operations in the processing task list to manually generate a processed dataset.
Enter the dataset name, description, and extended information (optional, including the industry, language, and custom information).
- Set resource parameters by referring to Table 1. In addition to the following parameters, you can customize parameters.
- Click Start Process in the lower right corner to start the processing task.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot