Format Requirements for Image Datasets

ModelArts Studio supports the creation of image datasets. During the creation, you can import data in various formats. Table 1 lists the format requirements.

**Table 1** Image dataset format requirements
File Content	File Format	File Requirements
Image only	TAR and image directory	Image: JPG, JPEG, PNG, and BMP TAR: The images in the TAR package can be in JPG, JPEG, PNG, or BMP format. Import from OBS: The size of a single compressed package cannot exceed 50 GB (only .tar packages are supported). The size of a single file cannot exceed 50 GB. The number of files is not limited. Local upload: The size of a single compressed package cannot exceed 10 MB (only .tar packages are supported). The size of a single file cannot exceed 10 MB. A maximum of 100 files are supported.
Image + Caption	Image: TAR; Caption: JSONL	Image: TAR. Multiple TAR packages are supported. The TAR package stores original images. Each image name must be unique, for example, abc.jpg. Image: JPG, JPEG, PNG, and BMP JSONL: The image description JSONL file is stored in the outermost directory. One TAR package corresponds to one JSONL file. Each line in the file content represents a segment of text. The format is as follows: {"image_name":"Image name (abc.jpg)","tar_name":"TAR package name (1.tar)","caption":"Text description of the image"} Import from OBS: The size of a single compressed package cannot exceed 50 GB (only .tar packages are supported). The size of a single file cannot exceed 50 GB. The number of files is not limited. Local upload: The size of a single compressed package cannot exceed 10 MB (only .tar packages are supported). The size of a single file cannot exceed 10 MB. A maximum of 100 files are supported.
Image + QA Pair	Image: TAR; QA pair: JSONL	Image: TAR. Multiple TAR packages are supported. The TAR package stores original images. Each image name must be unique, for example, abc.jpg. Image: JPG, JPEG, PNG, and BMP JSONL: The image description JSONL file is stored in the outermost directory. One TAR package corresponds to one JSONL file. Each line in the file content represents a segment of text. The format is as follows: {"image_name":"Image name (abc.jpg)","tar_name":"TAR package name (1.tar)","conversations":[{"question":"Question 1","answer":"Answer 1"},{"question":"Question 2","answer","Answer 2"}]} Import from OBS: The size of a single compressed package cannot exceed 50 GB (only .tar packages are supported). The size of a single file cannot exceed 50 GB. The number of files is not limited. Local upload: The size of a single compressed package cannot exceed 10 MB (only .tar packages are supported). The size of a single file cannot exceed 10 MB. A maximum of 100 files are supported.
Object detection	PASCAL VOC	The dataset consists of image files and corresponding annotation files. The annotation files must be in PASCAL VOC format. Labeled objects and their annotation files (mapped to the labeled objects) must be in the same directory. For example, if the name of the labeled object file is IMG_2.jpg, the name of the annotation file must be IMG_2.xml. Images can be in JPG, JPEG, PNG, BMP, TIF, or TIFF format. Annotation files must be in XML format. For details, see Specifications of Annotation Files in an Object Detection Dataset. Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.
Image classification	Image + TXT	The dataset consists of image files and corresponding annotation files. Labeled objects and their annotation files (mapped to the labeled objects) must be in the same directory. Images can be in JPG, JPEG, PNG, BMP, TIF, or TIFF format. Annotation files must be in TXT format. For details, see Description of an Annotation File for an Image Classification Dataset. Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.
Instance segmentation	Image + XML	The file storage mode must meet the format required by Segment Anything/Instance Segmentation. Supported image formats: JPG, JPEG, PNG, and BMP; Supported annotation file format: XML. Annotations use bounding boxes in the PASCAL VOC format. Annotations and images must have the same name and must be stored in the same folder. For details about annotation files in XML format, see Description of an Annotation File for an Instance Segmentation Dataset. Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

Specifications of Annotation Files in an Object Detection Dataset

The following description follows the annotation file format for object detection in Table 1.

The object detection dataset supports annotation files in ModelArts PASCAL VOC 1.0 format.

Labeled objects and their annotation files (in one-to-one relationship with the labeled objects) must be in the same directory. For example, if the name of the labeled object file is IMG_20180919_114745.jpg, the name of the annotation file must be IMG_20180919_114745.xml.

The annotation files must be in PASCAL VOC format, a standardized XML annotation format used for labeling image datasets. A PASCAL_VOC file contains information on the image directory, image file code, image size, and object information. For details about the format, see Table 2.

Example of a file uploaded to OBS:

├─dataset-import-example 
│      IMG_20180919_114732.jpg 
│      IMG_20180919_114732.xml 
│      IMG_20180919_114745.jpg 
│      IMG_20180919_114745.xml 
│      IMG_20180919_114945.jpg 
│      IMG_20180919_114945.xml

An XML annotation file example is as follows:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<annotation>
    <folder>NA</folder>
    <filename>bike_1_1593531469339.png</filename>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>554</width>
        <height>606</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>Dog</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <occluded>0</occluded>
        <bndbox>
            <xmin>279</xmin>
            <ymin>52</ymin>
            <xmax>474</xmax>
            <ymax>278</ymax>
        </bndbox>
    </object>
    <object>
        <name>Cat</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <occluded>0</occluded>
        <bndbox>
            <xmin>279</xmin>
            <ymin>198</ymin>
            <xmax>456</xmax>
            <ymax>421</ymax>
        </bndbox>
    </object>
</annotation>

**Table 2** PASCAL VOC format description
Field	Mandatory (Yes/No)	Description
folder	Yes	Name of the directory where the image is located
filename	Yes	Name of the labeled file
size	Yes	Image pixel width: image width. This parameter is mandatory. height: image height. This parameter is mandatory. depth: number of image channels. This parameter is mandatory.
segmented	Yes	Segmented or not. The value can be 0 or 1. The value 0 means no segmentation, and 1 means segmentation.
object	Yes	Target object information, which includes the category, pose, truncation status, identification difficulty, and bounding box of an object. An image may contain more than one object. name: type of the labeled object. This parameter is mandatory. pose: shooting angle of the labeled object. This parameter is mandatory. truncated: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is truncated, and 1 indicates the opposite. occluded: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object content is occluded, and 1 indicates the opposite. difficult: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is easy to recognize, and 1 indicates the opposite. confidence: This parameter is optional. The value ranges from 0 to 1. A value closer to 1 indicates a higher level of confidence. bndbox: bounding box type. This parameter is mandatory. For details about the possible values, see Table 3.

**Table 3** Bounding box types
type	Shape	Labeling Information
point	Point	Coordinates of a point <x>100<x> <y>100<y>
line	Line	Coordinates of points <x1>100<x1> <y1>100<y1> <x2>200<x2> <y2>200<y2>
bndbox	Rectangle	Coordinates of the upper left and lower right points <xmin>100<xmin> <ymin>100<ymin> <xmax>200<xmax> <ymax>200<ymax>
polygon	Polygon	Coordinates of points <x1>100<x1> <y1>100<y1> <x2>200<x2> <y2>100<y2> <x3>250<x3> <y3>150<y3> <x4>200<x4> <y4>200<y4> <x5>100<x5> <y5>200<y5> <x6>50<x6> <y6>150<y6>
circle	Circle	Center coordinates and radius <cx>100<cx> <cy>100<cy> <r>50<r>

Description of an Annotation File for an Image Classification Dataset

The following description follows the annotation file format for image classification in Table 1.

The image classification dataset supports annotation files in ModelArts image classification 1.0 format.

Labeled objects and their annotation files (in one-to-one relationship with the labeled objects) must be in the same directory. An annotation file in TXT format can contain a single label or multiple labels.

The image and annotation files must be stored in the same directory, with the content in the annotation file used as the label of the image.
In the following example, import-dir-1 and import-dir-2 are the imported subdirectories.
```
dataset-import-example 
├─import-dir-1
│      10.jpg
│      10.txt    
│      11.jpg 
│      11.txt
│      12.jpg 
│      12.txt
└─import-dir-2
        1.jpg 
        1.txt
        2.jpg 
        2.txt
```
The following shows an annotation file for a single label, for example, the 1.txt file:
```
Cat
```
The following shows an annotation file for multiple labels, for example, the 2.txt file:
```
Cat
Dog
```

Specifications of Annotation Files in an Anomaly Detection Dataset

The following description follows the annotation file format for anomaly detection in Table 1.

The labeling files and images must be stored in the same folder.

The image and annotation files must be stored in the same directory, with the content in the annotation file used as the label of the image (normal or abnormal).
Example:
```
dataset-import-example 
│      IMG_20180919_114732.jpg
│      IMG_20180919_114732.txt    
│      IMG_20180919_114745.jpg 
│      IMG_20180919_114745.txt
```
The following shows an annotation file for the "abnormal" label, for example, the IMG_20180919_114732.txt file:
```
abnormal
```
The following shows an annotation file for the "normal" label, for example, the IMG_20180919_114745.txt file:
```
normal
```

Description of JSON Annotation Files for a Posture Estimation Dataset

The following description follows the annotation file format for post estimation in Table 1.

Posture estimation dataset labeling is based on the open-source character keypoint labeling format (COCO). The annotations, train, and val folders must be included. In the annotations folder, train.json and val.json contain the annotations of the training set and validation set. The train and val folders store images. The following is an example:

├─annotations
│      train.json 
│      val.json
├─train
│      IMG_20180919_114745.jpg 
├─val
│      IMG_20180919_114945.jpg

The following is an example of a JSON annotation file:

{
    "images": [
        {
            "license": 2,
            "file_name": "000000000139.jpg",
            "coco_url": "",
            "height": 426,
            "width": 640,
            "date_captured": "2013-11-21 01:34:01",
            "flickr_url": "",
            "id": 139
        }
    ],
    "annotations": [
        {
            "num_keypoints": 15,
            "area": 2913.1104,
            "iscrowd": 0,
            "keypoints": [
                427,
                170,
                1,
                429,
                169,
                2,
                0,
                0,
                0,
                434,
                168,
                2,
                0,
                0,
                0,
                441,
                177,
                2,
                446,
                177,
                2,
                437,
                200,
                2,
                430,
                206,
                2,
                430,
                220,
                2,
                420,
                215,
                2,
                445,
                226,
                2,
                452,
                223,
                2,
                447,
                260,
                2,
                454,
                257,
                2,
                455,
                290,
                2,
                459,
                286,
                2
            ],
            "image_id": 139,
            "bbox": [
                412.8,
                157.61,
                53.05,
                138.01
            ],
            "category_id": 1,
            "id": 230831
        },
    ],
    "categories": [
        {
            "supercategory": "person",
            "id": 1,
            "name": "person",
            "keypoints": [
                "nose",
                "left_eye",
                "right_eye",
                "left_ear",
                "right_ear",
                "left_shoulder",
                "right_shoulder",
                "left_elbow",
                "right_elbow",
                "left_wrist",
                "right_wrist",
                "left_hip",
                "right_hip",
                "left_knee",
                "right_knee",
                "left_ankle",
                "right_ankle"
            ],
            "skeleton": [
                [
                    16,
                    14
                ],
                [
                    14,
                    12
                ],
                [
                    17,
                    15
                ],
                [
                    15,
                    13
                ],
                [
                    12,
                    13
                ],
                [
                    6,
                    12
                ],
                [
                    7,
                    13
                ],
                [
                    6,
                    7
                ],
                [
                    6,
                    8
                ],
                [
                    7,
                    9
                ],
                [
                    8,
                    10
                ],
                [
                    9,
                    11
                ],
                [
                    2,
                    3
                ],
                [
                    1,
                    2
                ],
                [
                    1,
                    3
                ],
                [
                    2,
                    4
                ],
                [
                    3,
                    5
                ],
                [
                    4,
                    6
                ],
                [
                    5,
                    7
                ]
            ]
        }
    ]
}

**Table 4** COCO format description
Field	Mandatory (Yes/No)	Description
images	Yes	Image information.
license	No	License identifier of an image.
file_name	Yes	Image file name.
coco_url	No	URL of an image in the official COCO dataset.
height	Yes	Image height (in pixels).
width	Yes	Image width (in pixels).
date_captured	No	Date and time when an image is captured.
flickr_url	No	URL of an image on the Flickr website.
id	Yes	Unique identifier of an image.
annotations	Yes	Labeling information.
num_keypoints	Yes	Number of labeled key points.
area	Yes	Area of the bounding box, in pixel squares.
iscrowd	Yes	Whether the scenario is a complex group scenario (for example, crowded people). The value 0 indicates that the scenario is not a crowded scenario, and the value 1 indicates that the scenario is a crowded scenario.
keypoints	Yes	Coordinates and visibility of labeled key points. All key points are listed in sequence. Each key point is represented by three numbers: [x, y, v]. x and y are pixel coordinates of the key point, and v is visibility (0: invisible and not in the image; 1: invisible but in the image; 2: visible and in the image).
image_id	Yes	ID of the image associated with the annotation. The value must be the same as the value of id in the images field.
bbox	Yes	Bounding box of the target object, represented by [x, y, width, height], where x and y are the coordinates of the upper left corner of the bounding box, and width and height are the width and height of the bounding box.
category_id	Yes	ID of a label category. For human posture estimation, the value is usually 1 (indicating person).
id	Yes	Unique identifier of an image.
categories	Yes	Label type information.
supercategory	Yes	Upper-level category of a category, which is usually person.
id	Yes	Unique identifier of a category, usually 1 for human posture estimation.
name	Yes	Name of a category, which is usually person.
keypoints	Yes	List of key point names. Generally, 17 key points are defined in the COCO format, such as nose, left_eye, right_eye, left_ear, right_ear, left_shoulder, right_shoulder, left_elbow, right_elbow, left_wrist, right_wrist, left_hip, right_hip, left_knee, right_knee, left_ankle, and right_ankle.
skeleton	Yes	List of skeleton connections, which are used to indicate the connection relationships between key points. Each connection is represented by a pair of key point indexes, for example, [1, 2], indicating a connection line from a nose (nose) to a left eye (left_eye).

Description of an Annotation File for an Instance Segmentation Dataset

The following description follows the annotation file format for instance segmentation in Table 1.

Example of a file uploaded to OBS:

├─dataset-import-example 
│      IMG_20180919_114732.jpg 
│      IMG_20180919_114732.xml 
│      IMG_20180919_114745.jpg 
│      IMG_20180919_114745.xml

Example of an XML annotation file:

<annotation>
 <folder>NA</folder>
 <filename>0001.jpg</filename>
 <source>
  <database>Unknown</database>
 </source>
 <size>
  <width>2560</width>
  <height>1440</height>
  <depth>3</depth>
 </size>
 <segmented>1</segmented>
 <mask_source></mask_source>
 <object>
  <name>aggregate</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <mask_color>238,130,238</mask_color>
  <occluded>0</occluded>
  <polygon>
   <x1>657.0</x1>
   <y1>357.0</y1>
   <x2>645.0</x2>
   <y2>351.0</y2>
   <x3>624.0</x3>
   <y3>352.0</y3>
   <x4>616.0</x4>
   <y4>353.0</y4>
  </polygon>
 </object>
</annotation>

**Table 5** PASCAL VOC format description
Field	Mandatory (Yes/No)	Description
folder	Yes	Name of the directory where the image is located
filename	Yes	Name of the labeled file
size	Yes	Image pixel width: image width. This parameter is mandatory. height: image height. This parameter is mandatory. depth: number of image channels. This parameter is mandatory.
segmented	Yes	Segmented or not. The value can be 0 or 1. The value 0 means no segmentation, and 1 means segmentation.
object	Yes	Target object information, which includes the category, pose, truncation status, identification difficulty, and bounding box of an object. An image may contain more than one object. name: type of the labeled object. This parameter is mandatory. pose: shooting angle of the labeled object. This parameter is mandatory. truncated: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is truncated, and 1 indicates the opposite. occluded: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object content is occluded, and 1 indicates the opposite. difficult: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is easy to recognize, and 1 indicates the opposite. confidence: This parameter is optional. The value ranges from 0 to 1. A value closer to 1 indicates a higher level of confidence. bndbox: bounding box type. This parameter is mandatory. For details about the possible values, see Table 6.

**Table 6** Bounding box types
type	Shape	Labeling Information
point	Point	Coordinates of a point <x>100<x> <y>100<y>
line	Line	Coordinates of points <x1>100<x1> <y1>100<y1> <x2>200<x2> <y2>200<y2>
bndbox	Rectangle	Coordinates of the upper left and lower right points <xmin>100<xmin> <ymin>100<ymin> <xmax>200<xmax> <ymax>200<ymax>
polygon	Polygon	Coordinates of points <x1>100<x1> <y1>100<y1> <x2>200<x2> <y2>100<y2> <x3>250<x3> <y3>150<y3> <x4>200<x4> <y4>200<y4> <x5>100<x5> <y5>200<y5> <x6>50<x6> <y6>150<y6>
circle	Circle	Center coordinates and radius <cx>100<cx> <cy>100<cy> <r>50<r>