Combining and Publishing Datasets

Introduction to Data Combination

The quality and variety of data sources play a crucial role in the development of specific capabilities in LLMs. Based on their origin, fine-tuning data can be categorized into the following types:

General Q&A data and industry-specific Q&A data. General Q&A data covers mathematics, code, and logical reasoning, and is used to retain the general capabilities of the model.
Industry-specific Q&A data is used to improve the model's capability of solving downstream tasks. For example, the dataset used for a financial L1 model consists of 25% general mathematical data, 20.5% general code data, 21.5% general logical reasoning data, 12.5% general non-logical reasoning data, and 20.5% industry-specific data.

During actual training, the ratio of general Q&A data to industry-specific Q&A data is crucial. If the proportion of industry-specific data is too high, the model may lose too many general capabilities. On the other hand, if the proportion is too low, the model may not effectively learn the necessary industry knowledge. Typically, the ratio of industry-specific Q&A data to general Q&A data is around 1:3. However, if the quality of the industry-specific data is high, this ratio can be increased to enhance the model's performance in the target domain. If you want to retain as many general capabilities as possible, it is advised to include more high-quality general data.

For different industry scenarios, a more suitable combination strategy should be considered:

Healthcare: Patient consultation, case analysis, and drug recommendation are key services, and typically require accurate and high-quality domain-specific data. The data combination strategy should prioritize medical field data and real-world data from individual hospitals to ensure the model can effectively process professional text and more practical cases.
Finance: Financial news, stock market analysis reports, and financial regulations are key data sources. The data combination strategy should primarily focus on financial news and financial reports. However, the actual implementation should take data quality into account. If certain datasets contain a high volume of low knowledge-density content (e.g., financial reports), their proportion in the training data should be reduced.
Legal: The focus should be on legal provisions, case law, judicial documents, and contracts. This domain has high requirements for professionalism, and the data often includes numerous names and locations. Therefore, targeted data processing is necessary. The data combination strategy should emphasize domain-specific legal content and avoid over-inclusion of general data. Given the typically high quality of legal documents, the proportion of industry-specific data can be increased to enhance the model's professional performance.
Customer service: This domain includes customer conversation logs, FAQ data, and customer service manuals. The data combination should focus on content related to user interaction and question-answering. Customer conversation data is generally of lower quality, so the proportion of industry-specific data can be reduced.

Procedure for Data Combination and Publishing

You can use the dataset combination function on ModelArts Studio as follows:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
In the navigation pane, choose Data Engineering > Data Processing > Combine Task. On the displayed page, click Create data combine in the upper right corner.
In the Dataset Modality area, select the dataset modality for which data combination is to be performed. Text, image, video, and prediction datasets can be combined, as shown in Figure 1.
Figure 1 Dataset Modality
In the Select Dataset area, select at least two image datasets and click Next.
In the data combination area, two modes are supported: by dataset and by label.
- By Dataset: You can set the number of data records in the datasets to be combined, as shown in Figure 2.
  Figure 2 Example of setting the number of data records in the datasets to be combined
- By label: This mode is applicable to text datasets processed by the data labeling operator. You can obtain the label name and value on the dataset details page after performing the operations in Processing Datasets.
  Figure 3 shows an example.
  
  Figure 3 Example of setting the labels for filtering the data to be combined

After the data combination configuration is complete, click Next in the lower right corner to go to the resource configuration page and select whether to automatically generate a processed dataset.

Resource Allocation:

Click

to expand resource configuration and set task resources. You can also customize parameters. Click Add Parameters and enter the parameter name and value.

**Table 1** Parameter configuration
Parameter	Description
numExecutors	Number of executors. The default value is 2. The minimum value of the product of numExecutors and executorMemory is 4, and the maximum value is 16.
executorCores	Number of CPU cores used by each executor process. The default value is 2. The minimum value of the product of numExecutors and executorMemory is 4, and the maximum value is 16. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
executorMemory	Memory size used by each Executor process. The default value is 4. The ratio of executorCores to executorMemory must be in the range of 1:2 to 1:4.
driverCores	Number of CPU kernels used by each driver process. The default value is 2. The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.
driverMemory	Memory used by the driver process. The default value is 4. The ratio of driverCores to driverMemory must be in the range of 1:2 to 1:4.

Figure 4 Resource Allocation
Click to enlarge

Automatically Generate Processing Dataset

Select this option and configure the information for generating a processed dataset, as shown in Figure 5. Click Confirm in the lower right corner. The platform starts the data combination task. After the task is successfully executed, a processed dataset is automatically generated.

If you do not select this option, click OK in the lower right corner. The platform starts the data combination task. After the combination task is successfully executed, manually generate a processed dataset.

Figure 5 Automatically Generate Processing Dataset
Click to enlarge

(Optional) Extended Info

You can select the industry and language, or customize dataset properties.

Figure 6 Extended Info
Click to enlarge

Click Confirm. On the Data Combine Task page, after the task is executed successfully, check that the status is Success.
Click Generate in the Operation column to generate a published dataset.
To view the published dataset, choose Data Engineering > Data Management > Datasets, and click the Published Dataset tab.