Training Data Description

Supported Data Formats

The MindSpeed-LLM and Llama-Factory frameworks support the following common dataset formats:

Alpaca
ShareGPT
MOSS (only for MindSpeed-LLM)

The download links of the sample datasets are as follows:

Pre-training (MindSpeed-LLM): train-00000-of-00001-a09b74b3ef9c3b56.parquet. The data size is about 24 MB.
Fine-tuning: alpaca_gpt4_data.json. The data size is 43.6 MB.
Reinforcement learning (MindSpeed-RL): deepscaler.json. The data size is 21.5 MB.
Reinforcement learning (VeRL). Datasets include training and test datasets. The following figure shows an example.
- Large language model: https://huggingface.co/datasets/openai/gsm8k/tree/main
- Multimodal model: https://huggingface.co/datasets/hiyouga/geometry3k/tree/main

Custom Data

LLaMA-Factory: dataset included in the code package. For details about how to modify the dataset, see README_zh.md.
MindSpeed-LLM: The data formats are as follows:
- Pre-training data (MindSpeed-LLM): You can also prepare pre-training data by yourself. The data requirements are as follows:
  The data must be in standard .json format. Set --json-key to specify the columns for training. You can set –json-key to modify the dataset text field name. The default name is text. There are four columns in the Wikipedia dataset, including id, url, title, and text. You can specify –json-key to select the columns for training. For details, see Processing a Pre-training Dataset.
```
{
    'id': '1',
    'url': 'https://simple.wikipedia.org/wiki/April',
    'title': 'April',
    'text': 'April is the fourth month...'
}                     
```
- MOSS instruction fine-tuning data: This case also supports MOSS data in standard .json format, which includes multi-turn dialogues and instruction Q&A. The following is an example.
```
{
  "conversation_id": 1,
  "meta_instruction": "",
  "num_turns": 3,
  "chat": {
    "turn_1": {
      "Human": "<|Human|>: How do you guarantee adherence to proper security guidelines during work?<eoh>\n",
      "Inner Thoughts": "<|Inner Thoughts|>: None<eot>\n",
      "Commands": "<|Commands|>: None<eoc>\n",
      "Tool Responses": "<|Results|>: None<eor>\n",
      "MOSS": "<|MOSS|>: To ensure xxx, the following suggestions are provided:\n\n1. Understand related best practices.\n\nThese xxx environments.<eom>\n"
    },
    "turn_2": { ... },
    "turn_3": { ... },
  "category": "Brainstorming"
}
```
- Alpaca instruction fine-tuning data: For details, see the Alpaca-style dataset. The dataset contains the following fields:
  - instruction: describes the task to be performed by the model. Each instruction is unique.
  - input: optional context or input of the task. The instruction and input are combined into one single instruction: instruction\ninput.
  - output: answer to the generated instruction.
  - system: system prompt. It is used to set a scenario or provide guidelines for the entire dialogue.
  - history: a list that stores past conversations as pairs of user messages and model responses. This helps to maintain consistency and coherence in the dialogue.
```
[
    {
        "instruction": "Human instruction (required)",
        "input": "Human input (optional)",
        "output": "Model answer (required)",
        "system": "system prompt (optional)",
        "history": [
            ["First round instruction (optional)," "First round answer (optional)"],
            ["Second round instruction (optional)," "Second round answer (optional)"]
        ]
    }
]
```
- ShareGPT instruction fine-tuning data: The ShareGPT format comes from conversations between ChatGPT and its users. This dataset helps train dialogue systems. It gathers and organizes multiple exchanges that mimic real user-AI interactions. For details, see the ShareGPT dataset. The dataset contains the following fields:
  - conversations: Contains a series of dialog objects. Each object consists of the speaker (from) and speech content (value).
  - from: indicates the role of the dialogue, which can be human or gpt, indicating who said the sentence.
  - value: specific dialogue content.
  - system: system prompt. It is used to set a scenario or provide guidelines for the entire dialogue.
  - tools: describes the information about available external tools or functions. These tools may be used by the model to perform certain tasks or obtain more information.
```
[
    {
        "conversations": [
            {
                "from": "human",
                "value": "human instruction"
            },
            {
                "from": "function_call",
                "value": "tool parameter"
            },
            {
                "from": "observation",
                "value": "tool result"
            },
            {
                "from": "gpt",
                "value": "model answer"
            }
            ],
        "system": "system prompt (optional)",
        "tools": "tool description (optional)"
    }
]
```

MindSpeed-MM: The data formats are as follows:
- Qwen2.5VL-3b/7b: Download the COCO2017 dataset by referring to Preparing and Processing a Dataset, and use scripts to convert the dataset. Alternatively, preprocess the custom dataset to the following format:
```
{
  "id": 1,
  "image": "image path",
  "conversations": [
      {"from": "human", "value": "human instructions"},
      {"from": "gpt", "value": "model answer"},
  ],
}
```

Parent topic: Training Service Configurations

Previous topic: Modifying the Tokenizer File

Next topic: VeRL Data Processing Sample Script