Updated on 2025-07-02 GMT+08:00

Combining Image Datasets Based on a Specific Ratio

Data combination is a process of combining multiple datasets based on a specific ratio and publishing the combined dataset. A proper ratio ensures the diversity, balance, and representativeness of datasets.

If a single dataset meets your requirements, skip this section and proceed with Publishing Image Datasets.

Creating an Image Data Combination Task

To create an image dataset combination task, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 1 My Spaces
  2. In the navigation pane, choose Data Engineering > Data Processing > Combine Task. On the displayed page, click Create data combine in the upper right corner.
  3. In the Select Dataset area, select at least two image datasets and click Next.
  4. On the Data Combine page, set the ratio of different datasets and click Next.
  5. After the data combination configuration is complete, click Next in the lower right corner to go to the resource configuration page and select whether to automatically generate a processed dataset.
    • Resource Allocation

      Click to expand resource configuration and set task resources. You can also customize parameters. Click Add Parameters and enter the parameter name and value.

      Table 1 Parameter configuration

      Parameter Name

      Description

      numExecutors

      Number of executors. The default value is 2.

      numExecutors x executorMemory must be greater than or equal to 4 and less than or equal to 16.

      executorCores

      Number of CPU kernels used by each executor process. The default value is 2.

      numExecutors x executorMemory must be greater than or equal to 4 and less than or equal to 16. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.

      executorMemory

      Memory size used by each Executor process. The default value is 4.

      The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.

      driverCores

      Number of CPU kernels used by each driver process. The default value is 2.

      The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.

      driverMemory

      Memory used by the driver process. The default value is 4.

      The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.

      Figure 2 Resource Allocation
    • Automatically Generate Processing Dataset

      Select and configure the information about the generated dataset, as shown in Figure 3. Click OK in the lower right corner. The platform starts the data combination task. After the task is successfully executed, a processed dataset is automatically generated.

      If you do not select this option, click OK in the lower right corner. The platform starts the combination task. After the combination task is successfully executed, manually generate a processed dataset.

      Figure 3 Automatically Generate Processing Dataset
    • (Optional) Extended Info

      You can select the industry and language, or customize dataset properties.

      Figure 4 Extended Info
  6. Click OK. On the Data Combine Task page, after the task is executed successfully, check that the status is Success.
  7. Click Generate in the Operation column to generate a published dataset.

    To view the published dataset, choose Data Engineering > Data Management > Datasets, and click the Published Dataset tab.