Format Requirements for Other Datasets

In addition to text, image, video, and audio datasets, you can also import other types of datasets, that is, custom datasets used for model training, such as the common open-source datasets in Alpaca and ShareGPT formats.

Import from OBS: The size of a single file or compressed package cannot exceed 20 GB. If multiple files are imported, the total file size cannot exceed 20 GB.

Local upload: The size of a single file cannot exceed 1 GB, and a maximum of 20 files can be uploaded at a time.

This section describes the requirements for common open-source dataset formats.

Alpaca Dataset Format Requirements

Alpaca is a common dataset format used by open-source models (such as the DeepSeek and Qwen series) and is the main dataset format used for fine-tuning open-source models. It is especially used for instruction-tuning. The data format provides a clear task description (instruction), input, and output.

Typical Alpaca dataset format:

[
    {
        "instruction": "Human instruction (required)",
        "input": "Human input (optional)",
        "output": "Model answer (required)",
        "system": "System prompt (optional)",
        "history": [
            [
                "First-turn instruction (optional)",
                "First-turn answer (optional)"
            ],
            [
                "Second-turn instruction (optional)",
                "Second-turn answer (optional)"
            ]
        ]
    }
]

Field description:

instruction: Task instruction, which tells the model what operation needs to be performed.
input: Input required for the task. If the task is open-ended or does not require explicit input, this field can be an empty string.
output: Expected output of the task, which is the content that the model needs to generate given the instruction and input. To train a model that incorporates a CoT or thinking process, you can wrap the reasoning process within <think> and </think> tags or by prepend a prompt like "Let's think step by step."
system: System prompt, which specifies the style or role. This field is optional.
history: A list of tuples, each representing the instruction and response of each turn of conversation in the historical messages. During instruction supervision fine-tuning, the responses in the historical messages are also used for model learning. This field is optional.

Features:

The Alpaca data format is simple and easy to understand.
The task instruction and input content are separated, making it suitable for various natural language processing tasks, such as text generation, translation, and summarization.

ShareGPT Dataset Format Requirements

The ShareGPT format comes from the dataset that records the conversations between ChatGPT and users. It is mainly used for the training of dialog systems. It gathers and organizes multiple exchanges that mimic real user-AI interactions. ShareGPT datasets support diverse role types, such as human, gpt, observation, and function. They are presented in the conversations column based on different role objects.

Typical ShareGPT dataset format:

[
    {
        "conversations": [
            {
                "from": "human",
                "value": "human instruction"
            },
            {
                "from": "function_call",
                "value": "tool parameter"
            },
            {
                "from": "observation",
                "value": "tool result"
            },
            {
                "from": "gpt",
                "value": "model answer"
            }
        ],
        "system": "system prompt (optional)",
        "tools": "tool description (optional)"
    }
]

conversations: A list of conversations, including the role and content of each turn of conversation. This field is mandatory. The role fields are defined as follows:
- human: The instruction given by humans in a conversation.
- function_call: Tool calling. The tool is an AP that provides a certain function.
- observation: The result of function_call.
- gpt: The answer provided by the model based on the instruction given by humans.
Note: human and observation in roles must be in odd positions, and gpt and function must be in even positions.
system: System prompt. It is optional.

tools: A description of function_call. It is optional.

Features:

The ShareGPT format is closer to the way humans interact with AI and is suitable for building and fine-tuning conversational models.

Suggestions

The Alpaca format is suitable for single-turn instruction-tuning, such as task-oriented dialogs, Q&A systems, or tool calls. Its structured design simplifies the understanding and response of models to explicit instructions. It is often used for lightweight fine-tuning (such as LoRA fine-tuning) or basic capability training (such as text generation and translation).
The ShareGPT format focuses on multi-turn dialog scenarios. It records the interaction history between users and assistants through the conversations field. It is suitable for training conversational models (such as chatbots and customer service assistants) and performs better in tasks that require dialog coherence, such as context understanding, emotional dialogs, or complex reasoning.
The two formats can be used together, with the former enhancing basic capabilities and the latter improving interaction experience.

Parent Topic: Dataset Format Requirements

Previous topic: Format Requirements for Audio Datasets

Next topic: Data Connection

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot