Image Dataset Processing Operators

The data processing operators provide multiple data operation capabilities, including data extraction, filtering, conversion, and labeling. These operators help you extract useful information from massive data and perform deep processing to generate high-quality training data.

The platform provides image and text processing operators. For details about the operator capabilities, see Table 1.

**Table 1** Image processing operator capabilities
Category	Operator Name	Operator Description
Data extraction	Image and text extraction	Extracts JSON text and images from the compressed image-text package and performs structured parsing (Base64 encoding) on the images to facilitate the use of image-text processing operators.
Data filtering	Image Metadata Filtering	Cleans image/text data based on the image width and height, file size, and aspect ratio threshold.
	Text Length Filtering in Image-Text Pairs	Filters out the image-text pairs whose text length is not within the specified text length range. The length of a Chinese character or an English letter is counted as 1.
	Text language filtering in image-text pairs	The language type of image-text pair data is obtained through the language detection model. The image-text pair data that is not in the language to be retained will be filtered out. Note: There is a low probability that the language detection model misjudges.
	Image/text deduplication	Deduplicates images and text based on structured images. Checks whether the number of images corresponding to the same text exceeds the threshold. If yes, the system randomly deletes redundant images and retains only the images and text within the threshold.
	Image deduplication	Filters out duplicate image-text pairs after after image structuring.
Data labeling	Pornographic Image Detection	Labels image operators.
	Dangerous Situation Image Detection	Labels dangerous situation images.
	Violent and Terrorism Image Detection	Filter out violent and terrorism images.
Data conversion	Filtering of abnormal characters in images and text	Replaces abnormal characters in the text data with null values, with data entries unchanged. Invisible characters, for example, U+0000-U+001F Web page labels: <p>
Data conversion	Adding Watermarks	Covers the image with translucent watermark text with a fixed spacing and arranged at a specific angle.

Image and text extraction

Applicable file formats:
tar+jsonl: All images are saved as a TAR package. Images can be in JPG, JPEG, PNG, or BMP format. The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
Parameter description:
Type of content to be extracted: Extract the JSON text and images from the image-text package and perform structured parsing on the images.
Parameter configuration example:
No parameters need to be set.

Image Metadata Filtering

Applicable file formats:
JPG, JPEG, PNG, and BMP

tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
Parameter description:
Type of content to be filtered:

Minimum width and height: If the width or height of an image is less than the value of this parameter, the image will be filtered out.

Minimum file size (B): If the file size is less than the minimum file size, the file will be filtered out.

Text Length Filtering in Image-Text Pairs

Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.

The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
Parameter description:
Type of content to be filtered: Filter out the text-text pairs whose text length is not within the text length range. The length of a Chinese character or an English letter is counted as 1.

Text Language Filtering in Image-Text Pairs

Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.

The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
Parameter description:
Type of content to be filtered: Extracts JSON text and images from the compressed image-text package and performs structured parsing (Base64 encoding) on the images to facilitate the use of image-text processing operators.

Image/Text Deduplication

Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.

The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
Parameter description:
Type of content to be filtered:
1. Deduplicates images and text based on structured images.
2. Checks whether the number of images corresponding to the same text exceeds the threshold. If yes, the system randomly deletes redundant images and retains only the images and text within the threshold.

Image Deduplication

Applicable file formats:
JPG, JPEG, PNG, and BMP

tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
Parameter description:
Type of content to be filtered: After image structuring, duplicate image/text pairs are filtered out.
Parameter configuration example:
No parameters need to be set.

Pornographic Image Detection

Applicable file formats:
JPG, JPEG, PNG, and BMP

tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
Parameter description:
Type of content to be labeled: Score the pornographic content of the image. A higher score indicates a higher risk. The score range is (0, 100). Videos whose score is greater than or equal to 50 are considered pornographic videos.
Parameter configuration example:
No parameters need to be set.
Detection example:
The results are stored in the annotation file as the image_porn object.

suggestion: indicates whether the file passes the check. pass indicates that the file passes the check and no problem occurs. review indicates that manual review is required. You can choose to bypass or block the file based on your review policy. block indicates that the file to be reviewed is problematic.

confidence: detection confidence of the model. (Note that the confidence indicates the confidence of the model-provided suggestions.) If suggestion is pass, the value is 0. If suggestion is review or block, the value ranges from 0 to 1.

label: label of the pornographic content detected by the model. If no pornographic content is detected, the value is empty.

Dangerous Situation Image Detection

Applicable file formats:
JPG, JPEG, PNG, and BMP

tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
Parameter description:
Type of content to be labeled: Labels the content of dangerous situation images.
Parameter configuration example:
No parameters need to be set.
Detection example: The results are stored in the annotation file as the image_danger object.
suggestion: indicates whether the file passes the check. pass indicates that the file passes the check and no problem occurs. review indicates that manual review is required. You can choose to bypass or block the file based on your review policy. block indicates that the file to be reviewed is problematic.

confidence: detection confidence of the model. (Note that the confidence indicates the confidence of the model-provided suggestions.) If suggestion is pass, the value is 0. If suggestion is review or block, the value ranges from 0 to 1.

label: label of the dangerous situation content detected by the model. If no dangerous situation content is detected, the value is empty.

Violent and Terrorism Image Detection

Applicable file formats:
JPG, JPEG, PNG, and BMP

tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
Parameter description:
Type of content to be labeled: Filters out violent and terrorism images.
Parameter configuration example:
No parameters need to be set.
Detection example: The results are stored in the annotation file as the image_terrorism object.
suggestion: indicates whether the file passes the check. pass indicates that the file passes the check and no problem occurs. review indicates that manual review is required. You can choose to bypass or block the file based on your review policy. block indicates that the file to be reviewed is problematic.

confidence: detection confidence of the model. (Note that the confidence indicates the confidence of the model-provided suggestions.) If suggestion is pass, the value is 0. If suggestion is review or block, the value ranges from 0 to 1.

label: label of the violent and terrorism content detected by the model. If no violent or terrorism content is detected, the value is empty.

Filtering of Abnormal Characters in Images and Text

Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.

The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
Parameter description:
Type of content to be converted: Replaces abnormal characters in the text data with null values, with data entries unchanged.
Parameter configuration example:
No parameters need to be set.

Adding Watermarks

Applicable file formats:
JPG, JPEG, PNG, and BMP

tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
Parameter description:
Watermark text: The value is of the string type.
Conversion example
The labeling score is stored in the element field in the JSONL file.