Using Data Engineering to Build a DeepSeek Model Dataset
Process of Building a DeepSeek Model Dataset
Table 1 describes how to use data engineering to build a third-party model dataset on ModelArts Studio.
Procedure |
Step |
Description |
Reference |
---|---|---|---|
Importing data to the Pangu platform |
Creating an import task |
Import data stored in OBS or local data into the platform for centralized management, facilitating subsequent processing or publishing.
NOTE:
When importing a dataset, set the dataset type to Single Round QA. |
|
Processing other datasets |
Processing other datasets |
Use custom processing operators to preprocess data, ensuring it meets the model training standards and service requirements. |
|
Publishing other datasets |
Publishing other datasets |
Data publishing refers to publishing a single dataset in a specific format as a published dataset for subsequent model training operations. |
Requirements on DeepSeek Datasets
Model Type |
Training Type |
Data Volume |
Dataset Format |
Description |
---|---|---|---|---|
DeepSeek-R1-32K DeepSeek-V3-32k |
Pre-training |
> 15B tokens |
JSONL |
JSONL format: text indicates the text data used for pre-training. The following is an example:
{"text":"Pangu models include the NLP model, multimodal model, CV model, scientific computing model, and prediction model."} |
DeepSeek-V3-32k |
Fine-tuning (single-turn Q&A) |
10,000 to 1,000,000 data records |
JSONL |
JSONL format: The data consists of Q&A pairs. context and target indicate the question and answer, respectively. The following is an example:
{"context":"Hello, please introduce yourself.","target":"I am a Pangu model."} |
DeepSeek-R1-32K |
Fine-tuning (single-turn Q&A) |
10,000 to 1,000,000 data records |
JSONL |
JSONL format: The data consists of Q&A pairs. context and target indicate the question and answer, respectively. In addition to the model answer, target should include <think>\nReasoning process\n</think>, which indicates the reasoning process of the model. {"context":"Question","target":"<think>\nReasoning process\n</think>\n\nModel answer"} Example: {"context":"What is 2 + 2?","target":"<think>\nAccording to the basic arithmetic definition, the sum of 2 and 2 is the combination of two values.\nVerification is performed through object counting (e.g., two apples plus two apples) or by shifting values on a number line (e.g., from 0 to 2 to 4).\nThis value adheres to the natural number axiomatic system. For example, the successor of 2 is 3, and the successor of 3 is 4.\n</think>\n\n**Answer:** 2 + 2 = 4"} |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot