Obtaining Source Data

Common Data Sources

Common data sources (for example, PDF and Word):

Web pages: Web data is abundant and widely available on the Internet. You can use a crawler tool to crawl the data. However, web page data often contains noise and may be in disordered formats, requiring thorough processing and filtering to extract high-quality, usable information.
Dialogs: helps improve the dialog capability of the model. You can obtain the information from written dialogs, chat records, forum posts, and social media comments. However, it is more challenging to collect and process.
Books: Text from books is typically more formal, detailed, and lengthy, with a generally higher quality. This helps models accumulate rich linguistic knowledge and improve modeling of long-range semantic relationships. Such data can be obtained from e-book websites.
Code: Compared with natural language text, code is mainly presented in a structured programming language form. Training on code data can enhance a model's understanding of structured semantics and its ability to perform logical reasoning. You can download relevant datasets from programming Q&A communities such as Stack Exchange or open-source code websites such as GitHub and Gitee.
Academic papers: Academic papers help enhance the understanding of scientific knowledge in LLMs. They can be downloaded from authoritative sources such as academic journals and ZhiNet.
Open-source datasets:
- General datasets: General datasets provide large-scale Internet text data, making them highly suitable for pre-training models across a wide range of NLP tasks.
  - FineWeb Edu
    FineWeb Edu is launched by Hugging Face. It is a subset of FineWeb. It classifies and filters synthetic comments generated by the Llama-3-70B-Instruct model to form an educational web dataset with 1.3 trillion tokens, which is better than all publicly accessible web datasets. The total size is about 1.3T tokens, and datasets with 10B, 100B, or 350B tokens are provided for quick use.
    
    Released: June 2024
    
    Download link: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu/tree/main
  - OpenNewsArchive (open news library)
    OpenNewsArchive is developed by multiple alliances, such as OpenDataLab, Midu, and SenseTime. It contains 8.8 million news articles, covering news content of different topics and sources. Each news article contains fields such as the title, content, release date, and language, and the content of the dataset is processed and deduplicated. The total size is about 11 GB, mainly Chinese data.
    
    Released: May 2024
    
    Download link: https://openxlab.org.cn/datasets/OpenDataLab/OpenNewsArchive
  - ChineseFinewebEdu
    The Chinese FineWeb Edu dataset is a well-built high-quality Chinese pre-trained corpus dataset designed for natural language processing tasks in the education field. The dataset is extracted from massive raw data through strict filtering and deduplication processes, and evaluated using a scoring model trained with a small amount of data. High-value educational content is extracted from massive raw data to ensure the quality and diversity of the data. Finally, the dataset contains about 90 million high-quality Chinese text data records, and the total size is about 300 GB.
    
    Released: August 2024
    
    Download link: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu/tree/main
  - CCI 3.0
    The Chinese Corpora Internet (CCI) 3.0 dataset was open-sourced to address the scarcity of high-quality safety datasets in Chinese. Building on the CCI dataset, the dataset developer expanded the data source, adopted stricter data cleaning methods, and completed the construction of the CCI 3.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. It has undergone strict data cleaning and deduplication, with targeted detection and filtering carried out for content quality and safety. The CCI 3.0 corpus released is about 1000 GB in size.
    
    Released: September 2024
    
    Download link: https://huggingface.co/datasets/BAAI/CCI3-Data/tree/main
  - CCI 3.0-HQ
    CCI3.0-HQ is a high-quality 500 GB subset of the CCI 3.0, developed by Beijing Academy of Artificial Intelligence (BAAI) using a novel two-stage hybrid filtering pipeline that significantly enhances data quality. To evaluate its effectiveness, we trained a 0.5B parameter model from scratch on 100B tokens across various datasets, achieving superior performance on 10 benchmarks in a zero-shot setting compared to CCI 3.0, SkyPile, and WanjuanV1.
    
    Released: September 2024
    
    Download link: https://huggingface.co/datasets/BAAI/CCI3-HQ/tree/main
- Domain-specific datasets:
  - IndustryCorpus
    IndustryCorpus is a Chinese pre-training dataset curated by BAAI that spans over 18 industries including medical, education, finance, law, etc. This dataset aims to improve the performance of industry-specific models. The total size of the database is 3.4 TB. IndustryCorpus combines resources from multiple large-scale datasets such as WuDao. After applying 22 domain-specific processing techniques, the resulting dataset comprises 1 TB of high-quality Chinese data and 2.4 TB of English data.
    
    Released: June 2024
    
    Download link: https://huggingface.co/datasets/BAAI/IndustryCorpus/tree/main/IndustryCorpus
  - IndustryCorpus2
    IndustryCorpus2 is an upgraded and iterated version of IndustryCorpus. Based on the original data, more high-quality data sources are introduced, such as pile, bigcode, open-web-math and other mathematical and code data. In order to better fit the industry classification system, the dataset developer combined the national economic industry classification system (20 categories) formulated by the National Bureau of Statistics and the world knowledge system to redesign the industry categories, setting up 31 industry categories, basically covering the current mainstream industries. Additionally, they used the rule filtering + model filtering solution, which greatly improved the overall data quality. The resulting dataset comprises 1 TB of high-quality Chinese data and 2.2 TB of English data.
    
    Released: November 2024
    
    Download link: https://www.modelscope.cn/datasets/BAAI/IndustryCorpus2/files
  - YiZhao financial dataset
    The YiZhao dataset is a 2 TB high-quality multimodal model training dataset. It includes a broad range of financial events, market dynamics, financial products, and transaction models. After the original data is processed using synchronized open-source cleaning tools, financial data classifiers, and security risk identification models, cleaned Chinese and English datasets are constructed, featuring strong financial relevance and alignment with socialist core values. The resulting datasets include a 936-GB Chinese text dataset, a 100-GB English text dataset, and a 1-TB high-quality multimodal dataset.
    
    Released: December 2024
    
    Download link: https://www.modelscope.cn/datasets/CMB_AILab/YiZhao-FinDataSet/files
  - Duxiaoman-DI/FinCorpus
    The Duxiaoman-DI/FinCorpus dataset is constructed based on an in-depth understanding of information requirements in the financial field. It collects and integrates various financial information in Chinese, such as bulletins of listed companies, financial news, financial articles, and financial test questions. It covers multiple aspects of the financial sector, including but not limited to market dynamics, corporate operations, and financial policies. The total size of the dataset is about 20 GB.
    
    Released: September 2023
    
    Download link: https://hf-mirror.com/datasets/Duxiaoman-DI/FinCorpus/tree/main

Data Acquisition Method

Open APIs: Many websites and platforms provide APIs, through which structured text data can be efficiently obtained. Examples are Twitter APIs, News APIs, and Reddit APIs.
Crawling: For content that is not available through open APIs, crawling technology can be used to extract the information. However, it is essential to comply with relevant laws and ethical standards.
Purchasing/Authorizing data: Some companies or organizations may offer data from specific fields. Acquiring such data through purchase or authorization is an effective and legitimate way to obtain it.

Parent topic: Building an Incremental Pre-training Dataset for the NLP Model

Previous topic: Building an Incremental Pre-training Dataset for the NLP Model

Next topic: Preprocessing Data