Format Requirements for Text Datasets

ModelArts Studio supports the creation of text datasets. During the creation, you can import data in various formats. Table 1 lists the format requirements.

**Table 1** Format requirements for text datasets
File Content	File Format	File Requirements
Document	txt, mobi, epub, docx, and pdf	Import from OBS: The size of a single file cannot exceed 1 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Web page	html	Import from OBS: The size of a single file cannot exceed 1 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Pre-trained text	jsonl	JSONL format: text indicates the text data used for pre-training. The following is an example: {"text":"Pangu Models are Pangu series AI models launched by Huawei, including the NLP model, multimodal model, CV model, scientific computing model, and prediction model."} Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Single-turn Q&A	jsonl and csv	JSONL Pangu format - Non-Thinking Chain: The data consists of Q&A pairs. context and target indicate the question and answer, respectively. The following is an example: {"context": "Hello, please introduce yourself.","target": "I am a Pangu model."} JSONL Pangu format - CoT: Data consists of Q&A pairs. context and target indicate the question and answer, respectively. target must contain the think tag pair to indicate the thinking process. The following is an example: {"context":["Hello, please introduce yourself"], "target": "<think> The user asks me to introduce himself/herself. First, I need to specify the user's identity and usage scenario.</think>I am the Pangu model."} CSV Pangu format - non-CoT: The first column in the CSV file corresponds to context, and the second column corresponds to target. The following is an example: "Hello, please introduce yourself.","I am a Pangu model." CSV Pangu format - CoT: In the CSV file, the first column corresponds to context, the second column corresponds to target, and target must contain the think tag pair. The following is an example: "Hello, please introduce yourself","<think> The user asks me to introduce myself. First, I need to clarify the user's identity and usage scenario.</think>I am the Pangu model." Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Single-turn Q&A (with a system persona)	jsonl and csv	JSONL Pangu format - non-CoT: system indicates the persona, context indicates the question, and target indicates the answer. The following is an example: {"system":"You're a smart and humorous Q&A assistant.","context": ["Hello, please introduce yourself."],"target":"Hello. I'm your smart assistant."} JSONL Pangu format - CoT: system indicates the persona. context and target indicate the question and answer, respectively. target must contain the think tag pair to indicate the thinking process. The following is an example: {"system":"You are a smart and humorous Q&A assistant.","context":["Hello, please introduce yourself."],"target":"<think>The users asks me to introduce myself. First, I need to clarify the user's identity and usage scenario.</think>Hello. I'm your smart assistant."} CSV Pangu format - non-CoT: The first column in the CSV file corresponds to system. The second column corresponds to context, and the third column corresponds to target. The following is an example: "You're a smart and humorous Q&A assistant.","Hello, please introduce yourself.","Hello. I'm your smart assistant." CSV Pangu format - CoT: In the CSV file, the first column corresponds to system, the second column corresponds to context, the third column corresponds to target, and target must contain the think tag pair (thinking process). The following is an example: "You are a smart and humorous Q&A assistant.","context":"Hello, please introduce yourself.","<think>The user asks me to introduce yourself. First, I need to clarify the user's identity and usage scenario.</think>Hello. I'm your smart assistant." Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Multi-turn Q&A	jsonl	JSONL Pangu format - non-CoT: Array format, consisting of one or more turns of Q&A pairs. context indicates the question, and target indicates the answer. The following is an example: [{"context":"[Hello"],"target":"Hello, what can I do for you?"},{"context":["Please introduce Huawei Cloud products."],"target":"Huawei Cloud products include but are not limited to compute, storage, and network products."}] JSONL Pangu format - CoT: Array format, consisting of one or more turns of Q&A pairs. context and target indicate the question and answer, respectively. target of at least one turn of Q&A contains the think tag pair, indicating the thinking process. The following is an example: [{"context":["Hello"],"target":"<think>The user asks me to introduce myself. First, I need to clarify the user's identity and usage scenario.</think>Hello, is there anything I can help you with?"},{"context":["Please introduce Huawei Cloud products."],"target":"Huawei Cloud provides product services including but not limited to computing, storage, and network."}] Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Multi-turn Q&A (with a system persona)	jsonl	JSONL Pangu format - non-CoT: Array format, consisting of one or more turns of Q&A pairs. system indicates the persona, context indicates the question, and target indicates the answer. The following is an example: [{"system": "You are a book recommendation expert."},{"context":["Hi"],"target":"Hi. What can I do for you?"},{"context":["Can you recommend some books to me?"],"target":"Of course. Based on your interest, I recommend you the Future of Autonomous Driving."}] JSONL Pangu format - non-CoT: Array format, consisting of human settings and one or more turns of Q&A pairs. system indicates the persona, context indicates the question, and target indicates the answer. target of at least one turn of Q&A contains the think tag pair, indicating the thinking process. The format is as follows: [{"system":"You are a book recommendation expert."},{"context":["Hi"],"target":"<think> The user is greeting. I need to reply and ask for more information.</think>Hi. What can I do for you?"},{"context":["Can you recommend some books for me?"],"target":"<think>I need to recommend books to the user as an expert.</think>Of course. Based on your interests, I recommend you read The Future of Autonomous Driving."}] Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Q&A sorting	jsonl and csv	JSONL format: context indicates the question. The order of targets 1, 2, and 3 represent the order of human-preferred answers. The most preferred answer is placed at the forefront. {"context":"context content ","targets":["Answer 1,""Answer 2,""Answer 3"]} CSV format: The first column in the CSV file corresponds to context, and the other columns are answers. "Question,""Answer 1","Answer 2,","Answer 3" Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
Direct Preference Optimization (DPO)	jsonl	JSONL Pangu format - non-CoT: context indicates the question, target indicates the expected correct answer, and bad_target indicates the incorrect answer that does not meet the expectation. The following is an example: Single-turn Q&A{"context": ["Hello, please introduce yourself."],"target":"I'm a Pangu model.","bad_target":"Sorry, I can't assist with that."}Multi-turn Q&A{"context": ["Hello, please introduce yourself.", "I'm a Pangu model.", "Please introduce Huawei Cloud products."],"target":"Huawei Cloud products include but are not limited to compute, storage, and network products."} JSONL Pangu format - CoT: context indicates the question, target indicates the expected correct answer, and bad_target indicates the incorrect answer that does not meet the expectation. At least one answer contains the think tag pair, indicating the thinking process. The following is an example: Single-turn Q&A{"context": ["Hello, please introduce yourself."],"target":"I'm a Pangu model.","bad_target":"Sorry, I can't assist with that."}Multi-turn Q&A{"context": ["Hello, please introduce yourself.", "I'm a Pangu model.", "Please introduce Huawei Cloud products."],"target":"<think>The user wants to learn about the Huawei Cloud Products.</think>Huawei Cloud products include but are not limited to compute, storage, and network products."} Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.
DPO (with a system persona)	jsonl	JSONL Pangu format - non-CoT: system indicates the persona, context indicates the question, target indicates the expected correct answer, and bad_target indicates the incorrect answer that does not meet the expectation. The following is an example: Single-turn Q&A with a persona{"system":"You are a humorous Q&A assistant.","context": ["Hello, please introduce yourself."],"target":"Hello. I am your smart assistant. How can I help you?","bad_target":"Sorry, I can't assist with that."}Multi-turn Q&A with a persona{"system":"You are a humorous Q&A assistant.","context": ["Hello, please introduce yourself.", "Hello, I am your smart assistant. How can I help you?", "Please introduce Huawei Cloud products."], "target":"Huawei Cloud provides compute, storage, and network products, as well as many other products.","bad_target":"Sorry, I can't assist with that." JSONL Pangu format - CoT: system indicates the persona, context indicates the question, target indicates the expected correct answer, and bad_target indicates the incorrect answer that does not meet the expectation. At least one answer contains the think tag pair, indicating the thinking process. The following is an example: Single-turn Q&A with a persona{"system":"You are a humorous Q&A assistant.","context": ["Hello, please introduce yourself."],"target":"Hello. I am your smart assistant. How can I help you?","bad_target":"Sorry, I can't assist with that."}Multi-turn Q&A with a persona{"system":"You are a humorous Q&A assistant.","context": ["Hello, please introduce yourself.", "Hello, I am your smart assistant. How can I help you?", "Please introduce Huawei Cloud products."], "target":"<think>The customer wants to know more about products.</think>Huawei Cloud provides compute, storage, and network products, as well as many other products.","bad_target":"Sorry, I can't assist with that." Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited. Local upload: The size of a single file cannot exceed 10 MB, and the number of files cannot exceed 100.