Updated on 2025-11-19 GMT+08:00

Obtaining Source Data

Common Data Sources

Common data sources (for example, PDF and Word):

  • Web pages: Web data is abundant and widely available on the Internet. You can use a crawler tool to crawl the data. However, web page data often contains noise and may be in disordered formats, requiring thorough processing and filtering to extract high-quality, usable information.
  • Dialogs: helps improve the dialog capability of the model. You can obtain the information from written dialogs, chat records, forum posts, and social media comments. However, it is more challenging to collect and process.
  • Books: Text from books is typically more formal, detailed, and lengthy, with a generally higher quality. This helps models accumulate rich linguistic knowledge and improve modeling of long-range semantic relationships. Such data can be obtained from e-book websites.
  • Code: Compared with natural language text, code is mainly presented in a structured programming language form. Training on code data can enhance a model's understanding of structured semantics and its ability to perform logical reasoning. You can download relevant datasets from programming Q&A communities such as Stack Exchange or open-source code websites such as GitHub and Gitee.
  • Academic papers: Academic papers help enhance the understanding of scientific knowledge in LLMs. They can be downloaded from authoritative sources such as academic journals and ZhiNet.
  • Open-source datasets:
    • General datasets: General datasets provide large-scale Internet text data, making them highly suitable for pre-training models across a wide range of NLP tasks.
      • FineWeb Edu

        FineWeb Edu is launched by Hugging Face. It is a subset of FineWeb. It classifies and filters synthetic comments generated by the Llama-3-70B-Instruct model to form an educational web dataset with 1.3 trillion tokens, which is better than all publicly accessible web datasets. The total size is about 1.3T tokens, and datasets with 10B, 100B, or 350B tokens are provided for quick use.

        Released: June 2024

        Download link: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main

      • OpenNewsArchive (open news library)

        OpenNewsArchive is developed by multiple alliances, such as OpenDataLab, Midu, and SenseTime. It contains 8.8 million news articles, covering news content of different topics and sources. Each news article contains fields such as the title, content, release date, and language, and the content of the dataset is processed and deduplicated. The total size is about 11 GB, mainly Chinese data.

        Released: May 2024

        Download link: https://openxlab.org.cn/datasets/OpenDataLab/OpenNewsArchive

      • ChineseFinewebEdu

        The Chinese FineWeb Edu dataset is a well-built high-quality Chinese pre-trained corpus dataset designed for natural language processing tasks in the education field. The dataset is extracted from massive raw data through strict filtering and deduplication processes, and evaluated using a scoring model trained with a small amount of data. High-value educational content is extracted from massive raw data to ensure the quality and diversity of the data. Finally, the dataset contains about 90 million high-quality Chinese text data records, and the total size is about 300 GB.

        Released: August 2024

        Download link: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu/tree/main

      • CCI 3.0

        The Chinese Corpora Internet (CCI) 3.0 dataset was open-sourced to address the scarcity of high-quality safety datasets in Chinese. Building on the CCI dataset, the dataset developer expanded the data source, adopted stricter data cleaning methods, and completed the construction of the CCI 3.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. It has undergone strict data cleaning and deduplication, with targeted detection and filtering carried out for content quality and safety. The CCI 3.0 corpus released is about 1000 GB in size.

        Released: September 2024

        Download link: https://huggingface.co/datasets/BAAI/CCI3-Data/tree/main

      • CCI 3.0-HQ

        CCI3.0-HQ is a high-quality 500 GB subset of the CCI 3.0, developed by Beijing Academy of Artificial Intelligence (BAAI) using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI 3.0, SkyPile, and WanjuanV1.

        Released: September 2024

        Download link: https://huggingface.co/datasets/BAAI/CCI3-HQ/tree/main

    • Domain-specific datasets:
      • IndustryCorpus

        IndustryCorpus is a Chinese pre-training dataset curated by BAAI that spans over 18 industries including medical, education, finance, law, etc. This dataset aims to improve the performance of industry-specific models. The total size of the database is 3.4 TB. IndustryCorpus combines resources from multiple large-scale datasets such as WuDao. After applying 22 domain-specific processing techniques, the resulting dataset comprises 1 TB of high-quality Chinese data and 2.4 TB of English data.

        Released: June 2024

        Download link: https://huggingface.co/datasets/BAAI/IndustryCorpus/tree/main/IndustryCorpus

      • IndustryCorpus2

        IndustryCorpus2 is an upgraded and iterated version of IndustryCorpus. Based on the original data, more high-quality data sources are introduced, such as pile, bigcode, open-web-math and other mathematical and code data. In order to better fit the industry classification system, the dataset developer combined the national economic industry classification system (20 categories) formulated by the National Bureau of Statistics and the world knowledge system to redesign the industry categories, setting up 31 industry categories, basically covering the current mainstream industries. Additionally, they used the rule filtering + model filtering solution, which greatly improved the overall data quality. The resulting dataset comprises 1 TB of high-quality Chinese data and 2.2 TB of English data.

        Released: November 2024

        Download link: https://www.modelscope.cn/datasets/BAAI/IndustryCorpus2/files

      • YiZhao financial dataset

        The YiZhao dataset is a 2 TB high-quality multimodal model training dataset. It includes a broad range of financial events, market dynamics, financial products, and transaction models. After the original data is processed using synchronized open-source cleaning tools, financial data classifiers, and security risk identification models, cleaned Chinese and English datasets are constructed, featuring strong financial relevance and alignment with socialist core values. The resulting datasets include a 936-GB Chinese text dataset, a 100-GB English text dataset, and a 1-TB high-quality multimodal dataset.

        Released: December 2024

        Download link: https://www.modelscope.cn/datasets/CMB_AILab/YiZhao-FinDataSet/files

      • Duxiaoman-DI/FinCorpus

        The Duxiaoman-DI/FinCorpus dataset is constructed based on an in-depth understanding of information requirements in the financial field. It collects and integrates various financial information in Chinese, such as bulletins of listed companies, financial news, financial articles, and financial test questions. It covers multiple aspects of the financial sector, including but not limited to market dynamics, corporate operations, and financial policies. The total size of the dataset is about 20 GB.

        Released: September 2023

        Download link: https://hf-mirror.com/datasets/Duxiaoman-DI/FinCorpus/tree/main

Data Acquisition Method

  • Open APIs: Many websites and platforms provide APIs, through which structured text data can be efficiently obtained. Examples are Twitter APIs, News APIs, and Reddit APIs.
  • Crawling: For content that is not available through open APIs, crawling technology can be used to extract the information. However, it is essential to comply with relevant laws and ethical standards.
  • Purchasing/Authorizing data: Some companies or organizations may offer data from specific fields. Acquiring such data through purchase or authorization is an effective and legitimate way to obtain it.