Updated on 2025-07-28 GMT+08:00

Text Dataset Processing Operators

The data processing operators provide multiple data operation capabilities, including data extraction, filtering, conversion, and labeling. These operators help you extract useful information from massive data and perform deep processing to generate high-quality training data.

The platform supports processing of text datasets, including data extraction, data conversion, data filtering, and data labeling. Table 1 lists the capabilities of text processing operators.

Table 1 Text dataset processing operator capabilities

Category

Operator Name

Operator Description

Data extraction

Word document content extraction

Extracts text from a Word document and retain the contents, titles, and body of the original document, but does not retain images, tables, formulas, headers, and footers.

TXT content extraction

Extracts all text content from a TXT file.

CSV content extraction

Reads all text content from a CSV file and generates data in JSON format based on the key value of the file content type template.

PDF content extraction

Extracts text from PDF files and converts the text into structured data. Texts, tables, and formulas can be extracted.

JSON content extraction

Extracts key-value pair information from a JSON file.

HTML content extraction

Extracts HTML data content based on the tag path, and deletes other content irrelevant to the tag path to be extracted.

E-book content extraction

Extracts all text content from the eBook.

Data conversion

Personal data anonymization

Anonymizes or directly deletes sensitive personal information, such as mobile numbers, identity documents, email addresses, URLs, license plate numbers in China, IP addresses, MAC addresses, IMEIs, passports, and vehicle identification numbers.

Symbol standardization

Searches for non-standardized symbols carried in the text for standardization and unified conversion.

  • Unified space: All Unicode spaces (such as U+00A0 and U+200A) are converted to standard spaces (U+0020).
  • DBC to SBC: Converts full-width characters in documents to half-width characters.
  • Punctuation normalization: The following symbols support a unified format:
    • {"? ": "\?\? "}
  • Number symbol normalization

Custom regular expression replacement

Uses the customized regular expression to replace the text content if the data items remain unchanged.

The following is an example:

  • Remove References and the content following References: \nReferences[\s\S]*
  • For the PDF content, remove the content before "0 Introduction". The content before the introduction is irrelevant to knowledge: [\s\S]{0,10000}0 Introduction
  • Delete the content irrelevant to knowledge before "1.1 Introduction to Java" from the PDF file: [\s\S]{0,10000} 1\. 1 Introduction to Java

Date and time format conversion

Automatically identifies the date, time, and week, and converts the date, time, and week based on the selected format.

Data filtering

Filtering abnormal characters

Searches for abnormal characters carried in each data record in the dataset and replaces the abnormal characters with null values. The data items remain unchanged.

  • Invisible characters, for example, U+0000-U+001F
  • Web page tag symbols: <style></style>
  • Special space: [\u2000-\u2009]

Custom regular expression filtering

Deletes the data that complies with the customized regular expression.

User-defined keyword filtering

Deletes data that contains keywords.

Sensitive word filtering

Automatically detects and filters sensitive data such as pornography, violence, and politics in text.

Filtering based on the text length

Retains the data within the specified length range based on the configured text length.

Redundant information filtering

Deletes redundant information from the text based on the paragraph granularity without changing the data items.

Examples include figure captions, table captions, and references.

N-gram feature filtering

Determines the document repetition degree. The repetition of words in a document is calculated based on the N values of features. In this case, the following two algorithms can be used to check whether the result is greater than the feature threshold. If the result is greater than the threshold, the document is deleted.

  • Top-gram filtering: Calculates the proportion of the most repeated grams to the total length. If the proportion is greater than the feature threshold, those grams are deleted.
  • Gram repetition rate filtering: Calculates the proportion of all repeated grams to the total length. If the proportion is greater than the feature threshold, those grams are deleted.

Paragraph feature filtering

The filtering is based on the following:

  • Paragraph repetition rate
  • Proportion of the length of repeated paragraphs
  • Proportion of non-Chinese characters

Sentence feature filtering

Uses punctuations in a document as sentence separators and collects statistics on the length of each sentence. If the average length of a document is greater than the configured length, the document is retained. Otherwise, the entire document is deleted. The filtering is based on the following:

  • Average length of sentences to be retained

Word feature filtering

Calculates the number of words after a document is segmented based on the system word library. After word segmentation, the total number of words is counted. The average word length is the total length of all words divided by the total number of words. If both the total length and the total number of words are met, the current document is retained. The filtering is based on the following:

  • Number of words to be reserved
  • Average length of words to be retained

Filtering the incomplete sentence at the end of a paragraph

Checks whether the content at the end of a paragraph is complete based on the sentence-level filtering granularity, and filters out the content if the content is incomplete.

Advertisement data filtering

Deletes a sentence that includes advertisement data from the text, based on a filtering granularity of a sentence.

Q&A pair filtering

Filters Q&A pairs that meet the following conditions:

  • The question is not in the string format.
  • The answer is empty.
  • The answer is meaningless.

Language filtering

Obtains the language type of the document based on the language detection model and filters the document in the required language.

Global Text Deduplication

Detects and removes duplicate or highly similar text from data to prevent model overfitting or reduced generalization.

General semantics scoring of SFT data

Uses the LLM to check and score the general semantics of SFT data and filter data based on the scoring threshold.

Rule-based inspection and filtering of SFT data

Checks and filters SFT data based on the selected rule.

CoT scoring of SFT data

Uses the LLM to check and score the CoT of SFT data and filter data based on the scoring threshold.

Data labeling

Pre-trained text classification

Classifies the pre-trained text, such as news, education, and health. The supported languages include Chinese and English.

General quality assessment

Assesses the general quality of text, such as fluency, clarity, and diversity.

Problem timeliness evaluation

Checks whether the problem is time-sensitive and provides the reason.

Answer quality scoring

Scores the quality of answers to fine-tuning datasets, such as logical consistency and fact correctness.

Syntax quality assessment

Assesses the syntax quality of texts, such as relevance and standardization.

Word Document Content Extraction

  • Applicable file format: document > docx
  • Parameter description:

    Type of content to be extracted: Extracts text from a Word document and retain the contents, titles, and body of the original document, but does not retain images, tables, formulas, headers, and footers.

  • Parameter configuration example:

    No parameters need to be set. By default, the contents, title, and body of the original document are retained, and the images, tables, formulas, headers, and footers are not retained.

  • Extraction example

    {"fileName":"JAVA from Beginner to Master.docx","text":"JAVA is a cross-platform..."}

TXT Content Extraction

  • Applicable file format: document > txt
  • Parameter description:

    Type of content to be extracted: By default, the full text is extracted into one line. Text can also be extracted by paragraph. The text is divided into multiple lines based on the entered separator. Each separator is separated by a vertical bar (|). The separator can contain a maximum of 100 characters.

  • Extraction example

    {fileName":"TXT file name .txt","text":"This is the first line. "}

CSV Content Extraction

  • Applicable file formats: Text > single-turn Q&A, single-turn Q&A (with a persona), and Q&A ranking.
  • Parameter description

    Type of content to be extracted: Reads all text content from a CSV file and generates data in JSON format based on the key value of the file content type template.

  • Parameter configuration example

    No parameters need to be set.

  • Extraction example

    If the extracted CSV content is "Hello, please introduce yourself. I am Pangu model.", the extracted content is {"context":"Hello, please introduce yourself","target":"I am Pangu model."}

PDF Content Extraction

  • Applicable file format: document > PDF
  • Parameter description

    Type of content to be extracted: By default, the text, tables, formulas, and titles are retained. You can select the type to be saved. The types that are not selected will be removed.

    Refined content extraction: indicates whether to support image content extraction after layout analysis.

  • Parameter configuration example

  • Extraction example

    {"fileName":"JAVA from Beginner to Master.pdf","text":"JAVA is a cross-platform..."}

JSON Content Extraction

  • Applicable file formats: Text > pre-trained text, single turn Q&A, single-turn Q&A (with a system persona), multi-turn Q&A, multi-turn Q&A (with a system persona), Q&A ranking, DPO, and DPO (with a system persona).
  • Parameter description

    Type of content to be extracted: Reads the key-value pair information in all text.

  • Parameter configuration example

    No parameters need to be set.

  • Extraction example

    {"context": "Hello, please introduce yourself.","target":"I am a Pangu model."}

HTML Content Extraction

  • Applicable file format: text > web page
  • Parameter description

    Type of content to be extracted: The default file encoding format is UTF-8. The GB2312 format is supported. By default, the body is extracted. You can customize the content to be extracted. Multiple labels can be extracted. Labels are separated by commas (,), for example, A, B, C. That is, the content of label A, B, or C is extracted.

  • Parameter configuration example

  • Extraction example

    {"text":"#\nI am Pangu model.\nPangu model is an advanced AI model that is dedicated to providing intelligent solutions for various industries.\n","fileName":"Web page.htmI"}

E-book Content Extraction

  • Applicable file format: document > mobi/epub
  • Parameter description

    Type of content to be extracted: Extracts all text content from MOBI or EPUB e-books.

  • Parameter configuration example

    No parameters need to be set.

  • Extraction example

    {"fileName":"JAVA from Beginner to Master.epub","text":"JAVA is a cross-platform..."}

Personal Data Anonymization

  • Applicable file format: text
  • Parameter description

    Type of content to be converted: Anonymizes sensitive personal information in the text, such as mobile numbers, ID cards, email addresses, URLs, license plate numbers in China, IP addresses, MAC addresses, IMEIs, passports, and vehicle identification numbers. By default, all options are selected. You can also select some of them.

  • Parameter configuration example

  • Conversion example

    Before processing: "Data is from www.test.com"

    After processing: "Data comes from*******"

Symbol standardization

  • Applicable file format: text
  • Parameter description

    Type of content to be converted: Non-standard symbols in the text can be converted to standard symbols. The non-standard symbols include spaces, DBC symbols, punctuations, and number symbols. By default, all non-standard symbols are selected. The filtering granularity is character.

  • Parameter configuration example

Custom Regular Expression Replacement

  • Applicable file format: text
  • Parameter description

    Type of content to be converted: Uses the customized regular expression to replace the text content if the data items remain unchanged.

  • Parameter configuration example

  • Conversion example

    Before processing: {"text":"This is the main content aeiou in the test aeiou. "}

    After processing: {"text":"This is the main content 11111 in the test 11111. "}

Date and time format conversion

  • Applicable file format: text
  • Parameter description

    Type of content to be converted: Automatically identifies the date, time, and week, and converts the date, time, and week based on the selected format. The conversion types include date format, time format, and week format. By default, all of them are selected. You can also select some of them.

  • Parameter configuration example

  • Conversion example

    Before processing: {"text":"Today is Monday, March 3, 2025. The rain is heavy in the morning. "}

    After processing: {"text":"Today is Monday, 2025-03-03 00:00:00. The rain is heavy in the morning. "}

Filtering abnormal characters

  • Applicable file format: text
  • Parameter description

    Type of content to be converted: Searches for abnormal characters carried in each data record in the dataset and replaces the abnormal characters with null values. The data items remain unchanged. Types of abnormal characters include invisible characters, emojis, web page labels, special characters, garbled characters, and special spaces. By default, all types are selected. You can also select some of them.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"Test exception. <style></style>Haha. Limited-time offer! ☺"}

    After processing: {"text":"Test exception. Haha. Limited-time offer!"}

Custom Regular Expression Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Filters content based on a custom regular expression. The filtering granularity can be character (default) or paragraph.

  • Parameter configuration example

  • Filtering example

    Filtering out the content following "References"

    Before processing: {"text":"This is the body content. References [1] Author 1, Article 1, Journal 1, 2021. [2] Author 2, Article 2, Journal 2, 2022."}

    After processing: {"text":"This is the body content. "}

Custom Keyword Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: The filtering granularity can be character (default), paragraph, or document. The path of the keyword to be deleted supports keyword import from OBS and text input.

  • Parameter configuration example

  • Filtering example

    Filtering by keyword "test"

    Before processing: {"text":"Keyword test. This is a test data record. "}

    After processing: {"text":"Keyword. This is a test data record. "}

Sensitive Word Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Automatically detects and filters sensitive data such as pornographic, violent, and political content in the text. Sensitive words need to be preset. The filtering granularity can be character (default), paragraph, or document.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":" prostitute test"}

    After processing: {"text":"test"}

Filtering Based on the Text Length

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Retains data within the specified text length. By default, the length of the characters to be reserved ranges from 100 to 1000 characters, which can be modified.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"Test length"}

    After processing: {"text":""}

Redundant Information Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Deletes redundant information from the text based on the paragraph granularity without changing the data items. The content types that can be filtered include figures, notes, and references. By default, all content types are selected. You can also select some of them.

N-gram Feature Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: The filtering granularity is document. You can select top-gram filtering or gram repetition rate filtering. By default, top-gram filtering is selected. In top-gram filtering mode, the default value of feature N is 2, and the default value of feature threshold is 0.18. In gram repetition rate filtering mode, the default value of feature N is 2, and the default value of feature threshold is 0.15. The values can be changed.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"Wake up. Today is Sunday. Today is a holiday. Tomorrow is Monday. Tomorrow is a working day. "}

    After processing: {"text":""}

Paragraph Feature Filtering

  • Applicable file format: text
  • Parameter description

    Types of content to be filtered: Filters the content based on the document filtering granularity, paragraph repetition rate, proportion of the length of repeated paragraphs, and proportion of non-Chinese characters. If any of the specified conditions is not met, the content is filtered out. The default values are as follows: The paragraph repetition rate is less than or equal to 65%, the proportion of the length of repeated paragraphs is less than or equal to 65%, and the proportion of non-Chinese characters ranges from 1% to 50%. The values can be changed.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"It is said that the fox only appears to those with pure hearts and sincere wishes. Under the light of the moon, it will gracefully emerge, gazing at the visitor with eyes that shimmer with wisdom. Only when the fox senses the visitor's sincerity and purity will it speak, asking about their desires. Yet, the granting of a wish is never without cost. Each fulfilled wish demands a corresponding price—perhaps a cherished memory, a beloved possession, or even a piece of one's life. Therefore, villagers must carefully ponder before making a wish, considering whether they are truly willing to pay such a price."} "}

    After processing:

Sentence Feature Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Filters the content based on the document filtering granularity and the average sentence length to be retained. If the content does not meet the requirements, the content is filtered out. The default average sentence length to be retained is greater than or equal to 10 characters, which can be modified.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"In a small village, there is a legend. According to the legend, a mysterious fox appears in the village forest every full moon night."}

    After processing: {"text":""}

Word Feature Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Filters the content by document based on the number of words to be retained (50-100,000 characters by default) and the average length of words to be retained (50-100,000 characters by default). If any of two conditions is not met, the content is filtered out. The default value can be changed.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"It is said that the fox only appears to those with pure hearts and sincere wishes. " }

    After processing: {"text":""}

Filtering of the Incomplete Sentence at the End of a Paragraph

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Checks whether the content at the end of a paragraph is complete based on the sentence-level filtering granularity, and filters out the content if the content is incomplete.

  • Parameter configuration example

  • Filtering example

    Before processing: "JAVA is an object-oriented programming language. Use JAVA to,"

    After processing: "JAVA is an object-oriented programming language."

Advertisement Data Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Deletes a sentence that includes advertisement data from the text, based on a filtering granularity of a sentence.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":"Specific discount! Buy our products and enjoy a discount of up to 50%! Click the link below to avail the discount at https://example.com. Seize this opportunity now and take action!}

    After processing: {"text":""}

Q&A Pair Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Filters the Q&A pairs whose questions are not in the string format, answers are empty, or answers are meaningless.

  • Parameter configuration example

    No parameters need to be set.

  • Filtering example

    Before processing: {"text":"[{"context":"Hello","target":"Yes"),{"context":"list","target":""}]"}

    After processing: {"text":"[{"context":"Hello","target":"Yes"}]"}

Language Filtering

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Filters the content by document based on the language to be retained and the threshold of documents to be deleted. If documents meet filtering criteria, they are filtered out. The language to be retained is Chinese by default. You can select English. The default value of the threshold of documents to be deleted is less than 0.65. The values can be modified.

  • Parameter configuration example

  • Filtering example

    Before processing: {"text":" Hello, my name is Li Ming. I am excited to introduce myself and share a bit about who l am."}

    After processing: {"text":""}

Global Text Deduplication

  • Applicable file format: text
  • Parameter description

    Type of content to be filtered: Detects and removes duplicate or highly similar text from data to prevent model overfitting or reduced generalization.

  • Parameter configuration example

    No parameters need to be set.

  • Filtering example

    Before processing: [{"fileName":"text1.txt","text":"It is said that the fox only appears to those with pure hearts and sincere wishes."},{"fileName":"text2.txt","text":"It is said that the fox only appears to those with pure hearts and sincere wishes."}

    After processing: {{"fileName":"text1.txt","text":"It is said that the fox only appears to those with pure hearts and sincere wishes."}}

General Semantics Scoring of SFT

  • Applicable file formats: Text > single-turn Q&A, single-turn Q&A (with a persona), and Q&A ranking.
  • Parameter description

    Type of content to be filtered: Uses the LLM to check and score the general semantics of SFT data and filter data based on the scoring threshold. You need to select a model. The value of the filter field can be Yes or No. The default value is No. The threshold value ranges from 0.0 to 10.0. Samples whose value is less than the threshold will be filtered out. The default threshold is 6, and the threshold can be changed.

  • Filtering example

    Before processing: {"system":"You are an expert in calculations.","context":"Calculate the travel time based on the distance between two cities. Starting city: Beijing, destination city: Shanghai, distance: 1,318 km. \n","target":"To calculate the travel time between two cities based on the distance, we need to consider the mode of travel, as each has a different average speed. Here's a general calculation for common modes of travel between Beijing and Shanghai, which are 1,318 km apart. High-speed train: Approximately 5-6 hours; Plane: Approximately 2 hours; Car: Approximately 13-14 hours. "}{"system":"You are a Q&A expert","context":"Find the number of occurrences of a specific keyword in a given article. \nArticle: Many people will stay up late to watch the matches during this year's World Cup. \nKeyword: World Cup\n","target":"Keyword\"World Cup\" appears once in the article. "}

    After processing: {"context":"Find the number of occurrences of a specific keyword in the given article. \nArticle: Many people will stay up late to watch the matches during this year's World Cup. \nKeyword: World Cup\n","filter":0.0,"qa_quality score":{"reason":"The large model correctly identified that the keyword \"World Cup\" appeared once in the article and provided the correct answer. ","score":10.0},"system":"You are a Q&A expert","target":"Keyword \"World Cup\" appears once in the article. "}

Rule-based Inspection and Filtering of SFT Data

  • Applicable file formats: Text > single-turn Q&A, single-turn Q&A (with a persona), and Q&A ranking.
  • Parameter description

    Type of content to be filtered: Checks the SFT data quality and filters the data based on the selected rule. The filtering rules include character strings, truncated text, incomplete content, mixed Chinese and English characters, mixed Traditional to Simplified Chinese characters, repeated content, special characters, unaligned parentheses, repeated patterns, garbled characters, inconsistent Chinese and English replies, sensitive model identity, and no slow thinking. By default, all filtering rules are selected, you can also select some of them. The filtering field can be set to Yes or No. The default value is No.

  • Filtering example

    Before processing: {"context":"Hello, please introduce yourself.","target":"I am Pangu hello world."}{"context:"Which poet is referred to as the Poet Immortal?","target":"Hello! There are Traditional Chinese."}

    After processing:

CoT Scoring of SFT Data

  • Applicable file formats: Text > single-turn Q&A, single-turn Q&A (with a persona), and Q&A ranking.
  • Parameter description

    Type of content to be filtered: Uses the LLM to check and score the CoT of SFT data and filter data based on the scoring threshold. You need to select a model. The value of the filter field can be Yes or No. The default value is No. The threshold value ranges from 0.0 to 10.0. Samples whose value is less than the threshold will be filtered out. The default threshold is 6, and the threshold can be changed. The data judgment rule and data scoring rule can be edited. The value can contain a maximum of 1000 characters.

  • Filtering example

    Before processing: {"context":"Context","targets":["hello","hi","hello"]}

    After processing: {"context":"Context","targets":["hello","hi","hello"],"qa_cot_score":{{"result":"Incorrect","score":0.0,"reason"["The model's answer is irrelevant."]},{"result":"Incorrect","score":0.0,"reason":"The answer is irrelevant."},{"result":"Incorrect","score":0.0,"reason":"[The model's answer is irrelevant.]"}]}

Pre-trained Text Classification

  • Applicable file format: text > pre-trained text
  • Parameter description

    Type of content to be labeled: Classifies the content of the pre-trained text, for example, news, education, and health. The supported languages include Chinese and English. The default language is Chinese.

  • Parameter configuration example

  • Example

    {"fileName":"News Labeling Test .docx","text":"Beijing, March 3 (Xinhua) According to the People's Bank of China, the financial market operation in January showed that China issued a total of CNY5,102.75 billion in bonds in January. Of which, CNY1,018.5 billion in government bonds, CNY557.57 billion in local government bonds, CNY704.21 billion in financial bonds, CNY1,279.17 billion in corporate credit bonds, CNY2.73 billion in asset-backed securities, and CNY1,514.78 billion in interbank certificates of deposit. \n As of the end of January, the bond market custody balance in China was CNY178.2 trillion. Of which, the custody balance of the interbank market was CNY156.9 trillion, and that of the exchange market was CNY21.3 trillion. \nAs of the end of January, the custody balance of foreign institutions in the Chinese bond market was CNY4.2 trillion, accounting for 2.3% of the custody balance of the Chinese bond market. Of which, the bond custody balance of foreign institutions in the interbank bond market was CNY4.1 trillion. By bond type, foreign institutions held CNY2.0 trillion of government bonds (48.8%), CNY1.1 trillion of certificates of deposit (25.8%), and CNY0.9 trillion of policy bank bonds (20.8%). \n","pre_classification":"Economy"}

General Quality Assessment

  • Applicable file format: text > pre-trained text
  • Parameter description

    Type of content to be labeled: Assesses the general quality of text, such as fluency, clarity, and diversity. You need to select a model and an industry. The industry can be manually entered.

  • Parameter configuration example

  • Example

    {"fileName":"News Labeling Test .docx","text":"Beijing, March 3 (Xinhua) According to the People's Bank of China, the financial market operation in January showed that China issued a total of CNY5,102.75 billion in bonds in January. Of which, CNY1,018.5 billion in government bonds, CNY557.57 billion in local government bonds, CNY704.21 billion in financial bonds, CNY1,279.17 billion in corporate credit bonds, CNY2.73 billion in asset-backed securities, and CNY1,514.78 billion in interbank certificates of deposit. \n As of the end of January, the bond market custody balance in China was CNY178.2 trillion. Of which, the custody balance of the interbank market was CNY156.9 trillion, and that of the exchange market was CNY21.3 trillion. \nAs of the end of January, the custody balance of foreign institutions in the Chinese bond market was CNY4.2 trillion, accounting for 2.3% of the custody balance of the Chinese bond market. Of which, the bond custody balance of foreign institutions in the interbank bond market was CNY4.1 trillion. By bond type, foreign institutions held CNY2.0 trillion of government bonds (48.8%), CNY1.1 trillion of certificates of deposit (25.8%), and CNY0.9 trillion of policy bank bonds (20.8%). \n","generalscore":{"Instructiveness":"5","Cleanliness":"5","isIncorrect":"false","Toxicity":"false","Richness":"5","Fluency":"5","knowledge":"5"}}

Problem Timeliness Evaluation

  • Applicable file format: text > single-turn Q&A
  • Parameter description

    Type of content to be labeled: Determines whether the issue is time-sensitive and provides the reason. You need to select a model.

  • Parameter configuration example

  • Example

    {"context":"1-1-2 Where is the capital of Jiangsu Province?","target":"Nanjing","timeliness_classification":0}

Answer Quality Scoring

  • Applicable file format: text > single-turn Q&A
  • Parameter description

    Type of content to be labeled: Scores the quality of answers to fine-tuning datasets, such as logical consistency and fact correctness. You need to select a model.

  • Parameter configuration example

  • Example

    {"context":"1-2-1 Where is the capital of China? ","target":"Beijing","answer score":{"Logical coherence":10,"Comprehensive score":10,"User requirements satisfied":10,"Completeness":10,"Fact correctness":10}}

Syntax Quality Evaluation

  • Applicable file format: text > single-turn Q&A
  • Parameter description

    Type of content to be labeled: evaluates the syntax quality of the text, for example, relevance and standardization. You need to select a model.

  • Parameter configuration example

  • Example

    {"target":"Beijing","context":"1-2-1 Where is the capital of China?","grammar score":{"Unrelated reply":0,"Fact error":0,"Logical error":0,"Non-standard language":0,"Sentence truncation":0,"Improper multi-language mixing":0,"Meaningless repetition":0}}