Updated on 2025-07-28 GMT+08:00

Format Requirements for Text Datasets

ModelArts Studio supports the creation of text datasets. During the creation, you can import data in various formats. Table 1 lists the format requirements.

Table 1 Format requirements for text datasets

File Content

File Format

File Requirements

Document

txt, mobi, epub, docx, and pdf

Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Web page

html

Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Pre-trained text

jsonl

  • JSONL format: text indicates the text data used for pre-training. The following is an example:
    {"text":"Pangu Models are Pangu series AI models launched by Huawei, including the NLP model, multimodal model, CV model, scientific computing model, and prediction model."}
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Single-turn Q&A

jsonl and csv

  • JSONL format: The data consists of Q&A pairs. context and target indicate the question and answer, respectively. The following is an example:
    {"context": "Hello, please introduce yourself.","target": "I am a Pangu model."}
  • CSV format: The first column in the CSV file corresponds to context, and the second column corresponds to target. The following is an example:
    "Hello, please introduce yourself.","I am a Pangu model."
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Single-turn Q&A (with a system persona)

jsonl and csv

  • JSONL format: system indicates the persona, context indicates the question, and target indicates the answer.
    {"system": "You're a smart and humorous Q&A assistant.","context": "Hello, please introduce yourself.","target":"Hello. I'm your smart assistant."}
  • CSV format: In the CSV file, the first column corresponds to system, and the second and third columns correspond to context and target, respectively.
    "You're a smart and humorous Q&A assistant.","Hello, please introduce yourself.","Hello. I'm your smart assistant."
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Multi-turn Q&A

jsonl

  • JSONL format: an array consisting of at least one Q&A pair. The format is [{"context":"context content 1","target":"target content 1"},{"context":"context content 2","target":"target content 2"}]. context and target indicate the question and answer, respectively.
    [{"context":"Hello","target":"Hello, what can I do for you?"},{"context":"Please introduce Huawei Cloud products.","target":"Huawei Cloud products include but are not limited to compute, storage, and network products."}]
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Multi-turn Q&A (with a system persona)

jsonl

  • JSONL format: an array consisting of at least one Q&A pair. system indicates the persona, context indicates the question, and target indicates the answer.
    [{"system": "You are a book recommendation expert."},{"context":"Hi","target":"Hi. What can I do for you?"},{"context":"Can you recommend some books to me?","target":"Of course. Based on your interest, I recommend you the Future of Autonomous Driving."}]
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Q&A ranking

jsonl and csv

  • JSONL format: context indicates the question. The order of targets 1, 2, and 3 represent the order of human-preferred answers. The most preferred answer is placed at the forefront.
    {"context":"context content ","targets":["Answer 1,""Answer 2,""Answer 3"]}
  • CSV format: The first column in the CSV file corresponds to context, and the other columns are answers.
    "Question,""Answer 1","Answer 2,","Answer 3"
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

Direct Preference Optimization (DPO)

jsonl

  • JSONL format: context indicates the question, target indicates the expected correct answer, and bad_target indicates an incorrect or unexpected answer.
    Single-turn Q&A
    {"context": ["Hello, please introduce yourself."],"target":"I'm a Pangu model.","bad_target":"Sorry, I can't assist with that."}
    Multi-turn Q&A
    {"context": ["Hello, please introduce yourself.", "I'm a Pangu model.", "Please introduce Huawei Cloud products."],"target":"Huawei Cloud products include but are not limited to compute, storage, and network products."}
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.

DPO (with a system persona)

jsonl

  • JSONL format: system indicates the persona, context indicates the question, target indicates the expected correct answer, and bad_target indicates an incorrect or unexpected answer.
    Single-turn Q&A (with a system persona)
    {"system":"You are a humorous Q&A assistant.","context": ["Hello, please introduce yourself."],"target":"Hello. I am your smart assistant. How can I help you?","bad_target":"Sorry, I can't assist with that."}
    Multi-turn Q&A (with a system persona)
    {"system":"You are a humorous Q&A assistant.","context": ["Hello, please introduce yourself.", "Hello, I am your smart assistant. How can I help you?", "Please introduce Huawei Cloud products."], "target":"Huawei Cloud provides compute, storage, and network products, as well as many other products.","bad_target":"Sorry, I can't assist with that."
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

    Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.