Help Center/ PanguLargeModels/ Help Panel/ Creating a Data Synthesis Task
Updated on 2025-07-02 GMT+08:00

Creating a Data Synthesis Task

Description

Data synthesis involves using either a preset or custom data instruction to process the original data and generate new data based on a specified number of epochs.

Currently, the data synthesis function supports the synthesis of single-turn Q&A, single-turn Q&A (persona) Q&A ranking, DPO, and DPO (persona) text datasets.

Procedure

  1. On the Create data synthesis task page, set Synthesis content and Expected number of synthetic data records.
  2. If the structure of the source datasets is the same as that of the synthesized dataset, you can enable Integrate the source data set into the synthesized data. After all the synthesis rounds are complete, the generated data is merged with the original dataset. Click Next.
  3. On the Synthesis step orchestration page, select preset or custom instructions in the Add Instruction pane on the left.
    • Preset instructions. The platform provides multiple preset instructions for you to execute synthesis tasks. For details, see Table 1.
      Table 1 List of preset data instructions

      Category

      Instruction Name

      Description

      Generate Questions

      Rewrite the question to a less difficult one.

      This instruction can be used to enable the model to generate a simpler and less difficult question based on the question entered by the user.

      Rewriting the question to a more difficult one

      This instruction can be used to enable the model to generate a more complex and more difficult question based on the question entered by the user.

      Generating answering requirements based on the questions

      The instruction generalizes answer requirements of a corresponding question based on an input question, and the requirements are not directly related to content of the original question. The instruction may be orchestrated with an instruction for answering a question according to answering requirements, to implement synthesis of various styles of answers.

      Generating similar questions based on samples_few-shot

      The instruction generates, by using a plurality of question examples entered by the user, one or more new questions that match the sample style.

      Generate questions based on text

      This instruction generates a question based on the context entered by the user. It can be used to synthetize and orchestrate text generation Q&A pairs.

      Question rewriting

      This instruction is used to rewrite a question to generate more complex questions, which can be used for instruction generalization.

      Generate Answers

      Answer rewriting

      This instruction is used to modify the answer style based on the user-specified settings without changing the answer content. It can be used to generate and orchestrate instructions with persona settings to implement generalization of Q&A pairs.

      Generating answers based on text_complying with requirements

      This instruction is used to generate answers based on the input context and the instructions and questions specified by users. It can be used to generate and orchestrate instructions to implement generalization of factual Q&A pairs.

      Generating an answer to a question

      This instruction is used to generate answers based on questions.

      Generating an answer based on the text_with a designated persona

      This instruction is used to generate answers based on the input context and user-specified persona settings and questions. It can be used to generate and orchestrate instructions with persona settings to implement generalization of factual Q&A pairs.

      Generating questions and answers based on questions_with a designated persona

      This instruction is used to generate questions and answers based on the user-specified persona settings and questions. It can be used to generate and orchestrate instructions with persona settings to implement generalization of Q&A pairs.

      Generating answers based on text

      This instruction is used to generate answers based on the context and questions entered by users. It can be used for synthesis of factual answers.

      Generating Q&A pairs

      Generating Q&A pairs based on text_true or false

      This instruction is used to construct a true or false question from the text provided by the user, and provide a correct answer.

      Generating Q&A pairs based on text_fill-in-blank

      This instruction is used to construct a fill-in-blank question from the text provided by the user, and provide a correct answer.

      Generating Q&A pairs based on text_single-answer question

      This instruction is used to construct a single-answer question with four options from the text provided by the user, and provide a correct answer.

      Generating Q&A pairs based on text_multiple-answer question

      This instruction is used to construct a multiple-answer question with four options from the text provided by the user, and provide a correct answer.

      Generating Q&A pairs based on text_Q&A question

      This instruction is used to construct a question from the text provided by the user, and provide a correct answer.

      Extracting Q&A pairs based on text_financial scenario

      This instruction is used to extract Q&A pairs based on financial documents entered by users.

      Generating a persona setting

      Generating a persona setting based on a question

      This instruction is used to generate a persona setting based on the question entered by the user.

      Others

      Generalizing bad cases

      This instruction is used to generate attack questions that may let the model make mistakes in similar scenarios based on the bad case questions and answers provided by users. Users can specify the number of attack questions to be generated. A maximum of 10 attack questions can be generated.

      Deriving the problem solving idea based on the answer

      This instruction is used to generate the corresponding problem solving ideas based on the questions and answers entered by the user.

      Instruction generalization

      This instruction is used to generalize instructions based on the style specified by the user. It can be used to generate and orchestrate related instructions that match Q&A pairs of the specified requirements to implement Q&A pair generalization.

    • Custom instructions. The platform supports orchestration of custom instructions. In the navigation pane, choose Data Engineering > Data Processing > Data Synthesis. On the displayed page, click Manage Synthetic Instructions.
  4. Configure the parameters for the selected instructions.

    Figure 1 shows an example of configuring synthesis instruction parameters for a pre-trained text dataset. The synthesis task uses the pre-trained text to generate Q&A pairs.

    Figure 1 Example of configuring parameters for combining pre-trained text data into an instruction
  5. After the instruction orchestration is complete, click Enable commissioning in the upper right corner to preview the instruction effect.
  6. After the instruction commissioning is complete, click Next in the lower right corner. Configure the following information:
    • Automatically Generate Processing Dataset: If it is enabled, after the task is successfully executed, a processed dataset is automatically generated. This function can be used for publishing downstream datasets. If it is disabled, you need to perform operations in the processing task list to manually generate a processed dataset.
    • Enter the dataset name and description.
    • (Optional) Enter extended information, including the industry, language, and customized information.
  7. Click Create and Start. The platform starts the data synthesis task.