Combining Text Datasets Based on a Specific Ratio
Data combination is a process of combining multiple datasets based on a specific ratio and publishing the combined dataset. A proper ratio ensures the diversity, balance, and representativeness of datasets.
If a single dataset meets your requirements, skip this section and proceed with Publishing Text Datasets.
Creating a Text Data Combination Task
To create a text dataset combination task, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
- In the navigation pane, choose Data Engineering > Data Processing > Combine Task. On the displayed page, click Create data combine in the upper right corner.
- In the Dataset Modality area, select the dataset modality for which data combination is to be performed. Text, image, and prediction datasets can be combined, as shown in Figure 2.
- In the Select Dataset area, select at least two image datasets and click Next.
- In the data combination area, two modes are supported: by dataset and by label.
- By Dataset: You can set the number of data records in the datasets to be combined, as shown in Figure 3.
- By label: This mode is applicable to text datasets processed by the data labeling operator. You can obtain the label name and value on the dataset details page after performing the operations in Processing Text Datasets.
Figure 4 shows an example.
- After the data combination configuration is complete, click Next in the lower right corner to go to the resource configuration page and select whether to automatically generate a processed dataset.
Resource Allocation:
Click
to expand resource configuration and set task resources. You can also customize parameters. Click Add Parameters and enter the parameter name and value.
Table 1 Parameter configuration Parameter Name
Description
numExecutors
Number of executors. The default value is 2. An executor is a process running on a worker node. It executes tasks and returns the calculation result to the driver. One core in an executor can run one task at the same time. Therefore, more tasks can be processed at the same time if you increase the number of the executors. You can add executors (if they are available) to process more tasks concurrently and improve efficiency.
numExecutors x executorMemory must be greater than or equal to 4 and less than or equal to 16.
executorCores
Number of CPU kernels used by each executor process. The default value is 2. Multiple cores in an executor can run multiple tasks at the same time, which increases the task concurrency. However, because all cores share the memory of an executor, you need to balance the memory and the number of cores.
numExecutors x executorMemory must be greater than or equal to 4 and less than or equal to 16. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
executorMemory
Memory size used by each Executor process. The default value is 4. The executor memory is used for task execution and communication. You can increase the memory for a job that requires a great number of resources, and run small jobs concurrently with a smaller memory.
The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
driverCores
Number of CPU kernels used by each driver process. The default value is 2. The driver schedules jobs and communicates with executors.
The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.
driverMemory
Memory used by the driver process. The default value is 4. The driver schedules jobs and communicates with executors. Add driver memory when the number and parallelism level of the tasks increases.
The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.
Figure 5 Resource AllocationAutomatically Generate Processing Dataset
Select this option and configure the information for generating a processed dataset, as shown in Figure 6. Click Confirm in the lower right corner. The platform starts the data combination task. After the combination task is successfully executed, a processed dataset is automatically generated.
If you do not select this option, click OK in the lower right corner. The platform starts the data combination task. After the combination task is successfully executed, manually generate a processed dataset.
(Optional) Extended Info:
You can select the industry and language, or customize dataset properties.
Figure 7 Extended Info - Click Confirm. On the Data Combine Task page, after the task is executed successfully, check that the status is Success.
- Click Generate in the Operation column to generate a published dataset.
To view the published dataset, choose Data Engineering > Data Management > Datasets, and click the Published Dataset tab.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot