Obtaining Source Data

Common Dataset Types

The fine-tuning dataset is Q&A data, which is classified into general dataset (language understanding, programming ability, mathematical ability, and logical reasoning) and industry dataset (law, healthcare, finance, etc.).

Data Acquisition Method

Data acquisition method

Open-source datasets:
- General datasets
  - Chinese SmolTalk dataset
    smoltalk-chinese is a Chinese fine-tuning dataset constructed based on the SmolTalk dataset. It is designed to provide high-quality synthetic data support for training LLMs. The dataset consists of synthetic data and contains more than 700,000 records. It is specially designed to improve the performance of Chinese LLMs in various tasks, and enhance their diversity and adaptability.
    
    Download link:
    
    https://modelscope.cn/datasets/opencsg/smoltalk-chinese/summary
  - OpenThoughts3-1.2M
    OpenThoughts3-1.2M is the result of a rigorous experimental pipeline, that ablates over design choices surrounding question sourcing and selection, as well as answer generation. The final dataset consists of 850,000 math questions, 250,000 code questions, and 100,000 science questions.
    
    Download link:
    
    https://modelscope.cn/datasets/open-thoughts/OpenThoughts3-1.2M
  - SYNTHETIC-1
    SYNTHETIC-1 is a reasoning dataset obtained from DeepSeek-R1, generated with crowdsourced compute and annotated with diverse verifiers (such as LLM determiners or symbolic mathematical validators).
    
    Download link:
    
    https://modelscope.cn/datasets/PrimeIntellect/SYNTHETIC-1
- Industry datasets
  - Fino1_Reasoning_Path_FinQA
    Fino1 is a financial reasoning dataset based on FinQA, with GPT-4o-generated reasoning paths to enhance structured financial question answering.
    
    Download link:
    
    https://modelscope.cn/datasets/TheFinAI/Fino1_Reasoning_Path_FinQA
  - OpenFinData
    OpenFinData is an open-source financial evaluation dataset jointly released by EastMoney.com and Shanghai AI Lab. This dataset represents the most realistic industrial scenario needs and is currently the most comprehensive and professional financial evaluation dataset. It provides high-quality data resources for researchers and developers in the field of financial technology based on the diverse financial services of EastMoney.com.
    
    Download link:
    
    https://modelscope.cn/datasets/Shanghai_AI_Laboratory/open-compass-OpenFinData/summary
Self-Instruct: generalizes diverse data or similar data based on seed instructions by using a language model.
Evolve-Instruct: generalizes existing seed instructions to construct more complex instructions.
SelfQA: automatically constructs Q&A pairs based on unsupervised text.
Web page Q&A pair mining: mines user questions from Q&A web pages.

In industry-specific incremental training, the most common requirements are to enhance domain knowledge or accomplish specific tasks. Domain knowledge can be learned through SelfQA on professional books or mining of real user questions from related industry forums. For industry-specific tasks, Self-Instruct can be used to generalize seed instructions.

Parent topic: Building a Fine-Tuning Dataset for the NLP Model

Previous topic: Building a Fine-Tuning Dataset for the NLP Model

Next topic: Preprocessing Data