Updated on 2025-11-19 GMT+08:00

Preprocessing Data

According to the method described in 3.2.1-Obtaining Source Data, you can obtain the general and industry fine-tuning datasets. The data format is the same as that of the industry fine-tuning data. You need to convert the text into the JSONL format. Each line of text is a JSON string, containing at least two JSON keys indicating the question and answer. The corresponding keys are context and target. The system field can be customized and is optional.

Data example:

Single-turn dialog:{"system":"xxx","context":["Q"],"target":"A"}. context can be a string or a list of a single element.Multi-turn dialog:{"system":"xxx","context":["Q1","A1",......"Qn"],"target":"An"}The system field is optional. If the system field needs to be configured, only one string can be configured. Currently, only one persona is supported.Example of logical inference data with a thinking chain:{"system": "You are an AI assistant that is good at article writing, text analysis, code writing, mathematical operations, and logical reasoning. You focus on providing high-quality answers to users' questions.", "context": "Premise: I like Masterpiece Theatre.\nHypothesis: Masterpiece Theatre is one of my favorite programs since I was a child.\nDoes the premise include the hypothesis?\nOptions:\n- Yes\n- Uncertain\n- No", "target": "<think>\nOK. I'll analyze this question now. The premise is that "I like Masterpiece Theatre," and the hypothesis is that "Masterpiece Theatre is one of my favorite programs since I was a child." Then, the question is whether the premise includes the hypothesis. The options are Yes, Uncertain, and No.\n\nFirst of all, I need to clarify the premise and hypothesis. A premise is usually a statement of a fact or condition, while a hypothesis is a inference or inference based on the premise. Next, I'd like to see if the premise already includes the content in the hypothesis.\n\n"I like Masterpiece Theatre" is mentioned in the premise, indicating that the speaker likes the program. The hypothesis further demonstrates that this favorite starts from childhood and is one of his/her favorite programs. So, the premise is just a statement of the status quo, that is, what he/she likes now, without mentioning the past or whether it is one of his/her favorite shows.\n\nTherefore, the premise does not includes the information in the hypothesis. The hypothesis is that more background and details are added on the basis of the premise, which is not mentioned in the premise. Therefore, the conclusion should be that the premise does not include hypothesis and the option is "No."\n</think>Premise: "I like Masterpiece Theatre."\nHypothesis: "Masterpiece Theatre is one of my favorite programs since I was a child."\n\nThe premise mentions "I like Masterpiece Theatre," but does not specify when this favorite starts, nor does it mention that it is one of his/her favorite programs. Therefore, the premise does not include the information in the hypothesis.\n\nAnswer: No"}