Help Center/ PanguLargeModels/ User Guide/ Developing a DeepSeek Model/ Using Data Engineering to Build a DeepSeek Model Dataset
Updated on 2025-07-28 GMT+08:00

Using Data Engineering to Build a DeepSeek Model Dataset

Process of Building a DeepSeek Model Dataset

Table 1 describes how to use data engineering to build a third-party model dataset on ModelArts Studio.

Table 1 Process of building a third-party model dataset

Procedure

Step

Description

Reference

Importing data to the Pangu platform

Creating an import task

Import data stored in OBS or local data into the platform for centralized management, facilitating subsequent processing or publishing.

NOTE:

When importing a dataset, set the dataset type to Single Round QA.

Importing Data to the Pangu Platform

Processing other datasets

Processing other datasets

Use custom processing operators to preprocess data, ensuring it meets the model training standards and service requirements.

Processing Other Datasets

Publishing other datasets

Publishing other datasets

Data publishing refers to publishing a single dataset in a specific format as a published dataset for subsequent model training operations.

Publishing Other Datasets

Requirements on DeepSeek Datasets

Table 2 Dataset format for DeepSeek models

Model Type

Training Type

Data Volume

Dataset Format

Description

DeepSeek-R1-32K

DeepSeek-V3-32k

Pre-training

> 15B tokens

JSONL

JSONL format: text indicates the text data used for pre-training. The following is an example:
{"text":"Pangu models include the NLP model, multimodal model, CV model, scientific computing model, and prediction model."}

DeepSeek-V3-32k

Fine-tuning (single-turn Q&A)

10,000 to 1,000,000 data records

JSONL

JSONL format: The data consists of Q&A pairs. context and target indicate the question and answer, respectively. The following is an example:
{"context":"Hello, please introduce yourself.","target":"I am a Pangu model."}

DeepSeek-R1-32K

Fine-tuning (single-turn Q&A)

10,000 to 1,000,000 data records

JSONL

JSONL format: The data consists of Q&A pairs. context and target indicate the question and answer, respectively. In addition to the model answer, target should include <think>\nReasoning process\n</think>, which indicates the reasoning process of the model.

{"context":"Question","target":"<think>\nReasoning process\n</think>\n\nModel answer"}

Example:

{"context":"What is 2 + 2?","target":"<think>\nAccording to the basic arithmetic definition, the sum of 2 and 2 is the combination of two values.\nVerification is performed through object counting (e.g., two apples plus two apples) or by shifting values on a number line (e.g., from 0 to 2 to 4).\nThis value adheres to the natural number axiomatic system. For example, the successor of 2 is 3, and the successor of 3 is 4.\n</think>\n\n**Answer:** 2 + 2 = 4"}