Image Dataset Processing Operators
The data processing operators provide multiple data operation capabilities, including data extraction, filtering, conversion, and labeling. These operators help you extract useful information from massive data and perform deep processing to generate high-quality training data.
The platform provides image and text processing operators. For details about the operator capabilities, see Table 1.
Category |
Operator Name |
Operator Description |
---|---|---|
Data extraction |
Extracts JSON text and images from the compressed image-text package and performs structured parsing (Base64 encoding) on the images to facilitate the use of image-text processing operators. |
|
Data filtering |
Cleans image/text data based on the image width and height, file size, and aspect ratio threshold. |
|
Filters out the image-text pairs whose text length is not within the specified text length range. The length of a Chinese character or an English letter is counted as 1. |
||
The language type of image-text pair data is obtained through the language detection model. The image-text pair data that is not in the language to be retained will be filtered out. Note: There is a low probability that the language detection model misjudges. |
||
|
||
Filters out duplicate image-text pairs after after image structuring. |
||
Data labeling |
Labels image operators. |
|
Labels dangerous situation images. |
||
Filter out violent and terrorism images. |
||
Data conversion |
Replaces abnormal characters in the text data with null values, with data entries unchanged.
|
|
Covers the image with translucent watermark text with a fixed spacing and arranged at a specific angle. |
Image and text extraction
- Applicable file formats:
tar+jsonl: All images are saved as a TAR package. Images can be in JPG, JPEG, PNG, or BMP format. The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
- Parameter description:
Type of content to be extracted: Extract the JSON text and images from the image-text package and perform structured parsing on the images.
- Parameter configuration example:
Image Metadata Filtering
- Applicable file formats:
tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
- Parameter description:
Type of content to be filtered:
Minimum width and height: If the width or height of an image is less than the value of this parameter, the image will be filtered out.
Minimum file size (B): If the file size is less than the minimum file size, the file will be filtered out.
Text Length Filtering in Image-Text Pairs
- Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.
The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
- Parameter description:
Type of content to be filtered: Filter out the text-text pairs whose text length is not within the text length range. The length of a Chinese character or an English letter is counted as 1.
Text Language Filtering in Image-Text Pairs
- Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.
The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
- Parameter description:
Type of content to be filtered: Extracts JSON text and images from the compressed image-text package and performs structured parsing (Base64 encoding) on the images to facilitate the use of image-text processing operators.
Image/Text Deduplication
- Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.
The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
- Parameter description:
Type of content to be filtered:
- Deduplicates images and text based on structured images.
- Checks whether the number of images corresponding to the same text exceeds the threshold. If yes, the system randomly deletes redundant images and retains only the images and text within the threshold.
Image Deduplication
Pornographic Image Detection
- Applicable file formats:
tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
- Parameter description:
Type of content to be labeled: Score the pornographic content of the image. A higher score indicates a higher risk. The score range is (0, 100). Videos whose score is greater than or equal to 50 are considered pornographic videos.
- Parameter configuration example:
- Detection example:
The results are stored in the annotation file as the image_porn object.
suggestion: indicates whether the file passes the check. pass indicates that the file passes the check and no problem occurs. review indicates that manual review is required. You can choose to bypass or block the file based on your review policy. block indicates that the file to be reviewed is problematic.
confidence: detection confidence of the model. (Note that the confidence indicates the confidence of the model-provided suggestions.) If suggestion is pass, the value is 0. If suggestion is review or block, the value ranges from 0 to 1.
label: label of the pornographic content detected by the model. If no pornographic content is detected, the value is empty.
Dangerous Situation Image Detection
- Applicable file formats:
tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
- Parameter description:
Type of content to be labeled: Labels the content of dangerous situation images.
- Parameter configuration example:
- Detection example: The results are stored in the annotation file as the image_danger object.
suggestion: indicates whether the file passes the check. pass indicates that the file passes the check and no problem occurs. review indicates that manual review is required. You can choose to bypass or block the file based on your review policy. block indicates that the file to be reviewed is problematic.
confidence: detection confidence of the model. (Note that the confidence indicates the confidence of the model-provided suggestions.) If suggestion is pass, the value is 0. If suggestion is review or block, the value ranges from 0 to 1.
label: label of the dangerous situation content detected by the model. If no dangerous situation content is detected, the value is empty.
Violent and Terrorism Image Detection
- Applicable file formats:
tar: All images are saved as a TAR package. The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.
- Parameter description:
Type of content to be labeled: Filters out violent and terrorism images.
- Parameter configuration example:
- Detection example: The results are stored in the annotation file as the image_terrorism object.
suggestion: indicates whether the file passes the check. pass indicates that the file passes the check and no problem occurs. review indicates that manual review is required. You can choose to bypass or block the file based on your review policy. block indicates that the file to be reviewed is problematic.
confidence: detection confidence of the model. (Note that the confidence indicates the confidence of the model-provided suggestions.) If suggestion is pass, the value is 0. If suggestion is review or block, the value ranges from 0 to 1.
label: label of the violent and terrorism content detected by the model. If no violent or terrorism content is detected, the value is empty.
Filtering of Abnormal Characters in Images and Text
- Applicable file formats:
tar+jsonl: All images are saved as a TAR package. The images can be in JPG, JPEG, PNG, or BMP format.
The image text is saved as a JSONL file. The image name in the JSONL file must be the same as that in the TAR package.
- Parameter description:
Type of content to be converted: Replaces abnormal characters in the text data with null values, with data entries unchanged.
- Parameter configuration example:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot