Synthesizing Text Datasets

Constraints

Large models are required for debugging preset synthesis instructions, creating custom instructions, and running synthesis tasks. Therefore, you need to purchase and deploy the models on ModelArts Studio.

Creating a Text Data Synthesis Task

Before synthesizing a text dataset, import data. For details, see Importing Data to the Pangu Platform.

To create a text dataset synthesis task, perform the following steps:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
In the navigation pane, choose Data Engineering > Data Processing > Synthesis Task. On the displayed page, click Create data synthesis task in the upper right corner.
On the Create data synthesis task page, select the dataset to be synthesized, and select the content to be synthesized and the expected number of synthetic data records, as shown in Figure 2. The expected number of synthetic data records is the initial number of data records in the dataset.
Figure 2 Expected number of synthetic data records
If the source dataset type is the same as the target dataset type, you can enable Integrate the source data set into the synthesized data. After the synthesis task is complete, the generated data is merged with the original dataset. Click Next.
The synthesis orchestration page is displayed, as shown in Figure 3. The start bar of the page displays the fixed fields corresponding to the data type of the current dataset and other customized fields in the dataset. These fields can be selected in the input and output variables of the instruction. In the Add Instruction pane on the left, you can select preset or custom instructions. The instructions need to be orchestrated based on logic. If you select three instructions of the Q&A pair type at the same time, only one Q&A pair is saved in the final output result.
Figure 3 Synthesis orchestration example
After the instruction is orchestrated, you can click Create Synthesis Template to save the current instruction orchestration content for one-click reuse next time. You can click Select Synthesis Template on the right to select a template for orchestration, as shown in Figure 4. A template is visible only when the input type is the same as the output type. For example, if the input is pre-trained text and the output is single-turn Q&A, the template is invisible if the current input data type (for example, single-turn Q&A) is different from the input type (for example, pre-training) when the template is saved.
Figure 4 Select Synthesis Template
- Preset instructions: The platform provides multiple preset instructions for you to execute synthesis tasks. For details, see Introduction to Preset Data Instructions.
- Custom instructions: The platform supports orchestration of custom instructions. For details about how to create a custom instruction, see Creating a Custom Data Synthesis Instruction.
After selecting a command, click OK and set command parameters.
Figure 5 shows an example of configuring synthesis instruction parameters for a pre-trained text dataset. The synthesis task uses the pre-trained text to generate Q&A pairs. In addition to the fixed fields context and target of the Q&A pairs, the intermediate instruction result can be saved to the final output result. After the instruction is orchestrated, click Save Synthesis Template on the right. The template can be selected with one click to generate Q&A pairs for subsequent pre-trained text.

Figure 5 Example of configuring parameters for combining pre-trained text data into an instruction
After the instruction orchestration is complete, click Enable commissioning in the upper right corner to preview the instruction effect.
After the instruction debugging is complete, click Next in the lower right corner and select whether to automatically generate a processing dataset.
Select this option and configure the information for generating a processed dataset, as shown in Figure 6. Click Create and Start in the lower right corner. The platform starts the data synthesis task. After the synthesis task is successfully executed, a processed dataset is automatically generated.

If you do not select this option, click Create and Start in the lower right corner. The platform starts the data synthesis task. After the synthesis task is successfully executed, manually generate a processed dataset.

Figure 6 Automatically Generate Processing Dataset
After the data synthesis task is successfully executed, the status changes from Running to Success, indicating that the data has been synthesized.

After data synthesis is complete, if you do not need to use the data labeling and data combination functions, click Generate in the Operation column on the Synthesis Task page to generate a processed dataset.

To view the processed dataset, choose Data Engineering > Data Management > Datasets, and click the Processed Dataset tab.

Introduction to Preset Data Instructions

The data synthesis function of ModelArts Studio provides you with preset instructions. You can choose Data Engineering > Data Processing > Data Synthesis. On the Synthesis Task page, click Manage Synthesis Instructions. On the System Preset tab page, view instruction details, as shown in Figure 7. Click Commissioning to view the commissioning guide, as shown in Figure 8.

For details about the list of preset instructions, see Table 1.

Figure 7 Instruction details
Click to enlarge

Figure 8 Commissioning guide
Click to enlarge

**Table 1** List of preset instructions
Category	Instruction Name	Description
Generate Questions	Rewrite the question to a less difficult one.	This instruction can be used to enable the model to generate a simpler and less difficult question based on the question entered by the user.
	Rewriting the question to a more difficult one	This instruction can be used to enable the model to generate a more complex and more difficult question based on the question entered by the user.
	Generating answering requirements based on the questions	The instruction generalizes answer requirements of a corresponding question based on an input question, and the requirements are not directly related to content of the original question. The instruction may be orchestrated with an instruction for answering a question according to answering requirements, to implement synthesis of various styles of answers.
	Generating similar questions based on samples_few-shot	The instruction generates, by using a plurality of question examples entered by the user, one or more new questions that match the sample style.
	Generating questions based on text	This instruction generates a question based on the context entered by the user. It can be used to synthetize and orchestrate text generation Q&A pairs.
	Question rewriting	This instruction is used to rewrite a question to generate more complex questions, which can be used for instruction generalization.
Generate Answers	Answer rewriting	This instruction is used to modify the answer style based on the user-specified settings without changing the answer content. It can be used to generate and orchestrate instructions with persona settings to implement generalization of Q&A pairs.
	Generating answers based on text_complying with requirements	This instruction is used to generate answers based on the input context and the instructions and questions specified by users. It can be used to generate and orchestrate instructions to implement generalization of factual Q&A pairs.
	Generating an answer to a question	This instruction is used to generate answers based on questions.
	Generating an answer based on the text_with a designated persona	This instruction is used to generate answers based on the input context and user-specified persona settings and questions. It can be used to generate and orchestrate instructions with persona settings to implement generalization of factual Q&A pairs.
	Generating questions and answers based on questions_with a designated persona	This instruction is used to generate questions and answers based on the user-specified persona settings and questions. It can be used to generate and orchestrate instructions with persona settings to implement generalization of Q&A pairs.
	Generating answers based on text	This instruction is used to generate answers based on the context and questions entered by users. It can be used for synthesis of factual answers.
Generating Q&A pairs	Generating Q&A pairs based on text_true or false	This instruction is used to construct a true or false question from the text provided by the user, and provide a correct answer.
	Generating Q&A pairs based on text_fill-in-blank	This instruction is used to construct a fill-in-blank question from the text provided by the user, and provide a correct answer.
	Generating Q&A pairs based on text_Q&A question	This instruction is used to construct a question from the text provided by the user, and provide a correct answer.
	Extracting Q&A pairs based on text_financial scenario	This instruction is used to extract Q&A pairs based on financial documents entered by users.
Generating a persona setting	Generating a persona setting based on a question	This instruction is used to generate a persona setting based on the question entered by the user.
Other	Generalizing bad cases	This instruction is used to generate attack questions that may let the model make mistakes in similar scenarios based on the bad case questions and answers provided by users. Users can specify the number of attack questions to be generated. A maximum of 10 attack questions can be generated.
	Deriving the problem solving idea based on the answer	This instruction is used to generate the corresponding problem solving ideas based on the questions and answers entered by the user.
	Instruction generalization	This instruction is used to generalize instructions based on the style specified by the user. It can be used to generate and orchestrate related instructions that match Q&A pairs of the specified requirements to implement Q&A pair generalization.

Creating a Custom Data Synthesis Instruction

The platform allows you to create custom data synthesis instructions.

This section uses the scenario of generating a topic prose passage as an example to describe how to configure a custom data synthesis instruction.

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 9 My Spaces
In the navigation pane, choose Data Engineering > Data Processing > Data Synthesis. On the displayed page, click Manage Synthetic Instructions. On the Custom tab page, click Create Custom Instruction.
In the Create Instruction dialog box, enter the name and description, and click OK. The page for configuring a synthetic instruction is displayed.
Select the variable identifier "{{}}" and enter the instruction "Use {{topic}} as the subject and write a prose passage with no more than {{num}} characters."
Click Identify and then click Yes.

Figure 10 Instruction configuration

Set variables based on Table 2.

**Table 2** Data instruction variable configuration
Variable Type	Variable Name	Data Type	Variable Description
Input variables	topic	string	Topic
Input variables	num	string	Word count
Output variables	output	string	Prose

The variable description field of an output variable is the content that can be understood by the large model and needs to be filled in carefully.

Figure 11 Configuring variables
Click to enlarge

Commission data instructions.
- In the Commissioning > Model area, select the model required by the instruction and click Configure Hyperparameters to customize hyperparameter values.
  - Temperature: sampling temperature parameter. A higher value, such as 0.8, makes the output more random, while a lower value, such as 0.2, makes it more focused and deterministic. The value range is 0 to 1.
  - Diversity: core sampling parameter, top_p. The model considers labeling results with top_p probability quality. 0.1 indicates that only markers with the top 10% probability quality are considered. Set Temperature or Diversity only. The value range is 0 to 1.
  - Repetition Punishment: specifies the repeated sampling punishment. A higher value indicates a heavier penalty. You can specify this parameter to help models reduce the possibility of repeating the same behavior. The value range is [-2.0,2.0].
  - Sampling: sampling parameter, top_k. In each round of token generation, k tokens with the highest probability are reserved as candidates. The larger the value, the higher the variety of the generated text.
- In the Commissioning > Input area, you can view the effect by assigning values to variables.
  Figure 12 Instruction commissioning
After the commissioning is complete, click Create Now to create the data instruction.
The created data instruction is displayed on the Manage Synthetic Instructions > Custom page.