Updated on 2025-07-28 GMT+08:00

Format Requirements for Image Datasets

ModelArts Studio supports the creation of image datasets. During the creation, you can import data in various formats. Table 1 lists the format requirements.

Table 1 Image dataset format requirements

File Content

File Format

File Requirements

Image only

TAR and image directory

  • Image: JPG, JPEG, PNG, and BMP

  • TAR: The images in the TAR package can be in JPG, JPEG, PNG, or BMP format.

  • Import from OBS: The size of a single compressed package cannot exceed 50 GB (only .tar packages are supported). The size of a single file cannot exceed 50 GB. The number of files is not limited.

    Local upload: The size of a single compressed package cannot exceed 10 MB (only .tar packages are supported). The size of a single file cannot exceed 10 MB. A maximum of 100 files are supported.

Image + Caption

Image: TAR; Caption: JSONL

  • Image: TAR. Multiple TAR packages are supported. The TAR package stores original images. Each image name must be unique, for example, abc.jpg. Image: JPG, JPEG, PNG, and BMP
  • JSONL: The image description JSONL file is stored in the outermost directory. One TAR package corresponds to one JSONL file. Each line in the file content represents a segment of text. The format is as follows:
    {"image_name":"Image name (abc.jpg)","tar_name":"TAR package name (1.tar)","caption":"Text description of the image"}
  • Import from OBS: The size of a single compressed package cannot exceed 50 GB (only .tar packages are supported). The size of a single file cannot exceed 50 GB. The number of files is not limited.

    Local upload: The size of a single compressed package cannot exceed 10 MB (only .tar packages are supported). The size of a single file cannot exceed 10 MB. A maximum of 100 files are supported.

Image + QA Pair

Image: TAR; QA pair: JSONL

  • Image: TAR. Multiple TAR packages are supported. The TAR package stores original images. Each image name must be unique, for example, abc.jpg. Image: JPG, JPEG, PNG, and BMP
  • JSONL: The image description JSONL file is stored in the outermost directory. One TAR package corresponds to one JSONL file. Each line in the file content represents a segment of text. The format is as follows:
    {"image_name":"Image name (abc.jpg)","tar_name":"TAR package name (1.tar)","conversations":[{"question":"Question 1","answer":"Answer 1"},{"question":"Question 2","answer","Answer 2"}]}
  • Import from OBS: The size of a single compressed package cannot exceed 50 GB (only .tar packages are supported). The size of a single file cannot exceed 50 GB. The number of files is not limited.

    Local upload: The size of a single compressed package cannot exceed 10 MB (only .tar packages are supported). The size of a single file cannot exceed 10 MB. A maximum of 100 files are supported.

Object detection

PASCAL VOC

  • The dataset consists of image files and corresponding annotation files. The annotation files must be in PASCAL VOC format. Labeled objects and their annotation files (mapped to the labeled objects) must be in the same directory. For example, if the name of the labeled object file is IMG_2.jpg, the name of the annotation file must be IMG_2.xml.
  • Images can be in JPG, JPEG, PNG, BMP, TIF, or TIFF format. Annotation files must be in XML format. For details, see Specifications of Annotation Files in an Object Detection Dataset.
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

Image classification

Image + TXT

  • The dataset consists of image files and corresponding annotation files. Labeled objects and their annotation files (mapped to the labeled objects) must be in the same directory.
  • Images can be in JPG, JPEG, PNG, BMP, TIF, or TIFF format. Annotation files must be in TXT format. For details, see Description of an Annotation File for an Image Classification Dataset.
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

Instance segmentation

Image + XML

  • The file storage mode must meet the format required by Segment Anything/Instance Segmentation.
  • Supported image formats: JPG, JPEG, PNG, and BMP; Supported annotation file format: XML.
  • Annotations use bounding boxes in the PASCAL VOC format. Annotations and images must have the same name and must be stored in the same folder.
  • For details about annotation files in XML format, see Description of an Annotation File for an Instance Segmentation Dataset.
  • Import from OBS: The size of a single file cannot exceed 50 GB, and the number of files is not limited.

Specifications of Annotation Files in an Object Detection Dataset

The following description follows the annotation file format for object detection in Table 1.

The object detection dataset supports annotation files in ModelArts PASCAL VOC 1.0 format.

Labeled objects and their annotation files (in one-to-one relationship with the labeled objects) must be in the same directory. For example, if the name of the labeled object file is IMG_20180919_114745.jpg, the name of the annotation file must be IMG_20180919_114745.xml.

The annotation files must be in PASCAL VOC format, a standardized XML annotation format used for labeling image datasets. A PASCAL_VOC file contains information on the image directory, image file code, image size, and object information. For details about the format, see Table 2.

Example of a file uploaded to OBS:

├─dataset-import-example 
│      IMG_20180919_114732.jpg 
│      IMG_20180919_114732.xml 
│      IMG_20180919_114745.jpg 
│      IMG_20180919_114745.xml 
│      IMG_20180919_114945.jpg 
│      IMG_20180919_114945.xml

An XML annotation file example is as follows:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<annotation>
    <folder>NA</folder>
    <filename>bike_1_1593531469339.png</filename>
    <source>
        <database>Unknown</database>
    </source>
    <size>
        <width>554</width>
        <height>606</height>
        <depth>3</depth>
    </size>
    <segmented>0</segmented>
    <object>
        <name>Dog</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <occluded>0</occluded>
        <bndbox>
            <xmin>279</xmin>
            <ymin>52</ymin>
            <xmax>474</xmax>
            <ymax>278</ymax>
        </bndbox>
    </object>
    <object>
        <name>Cat</name>
        <pose>Unspecified</pose>
        <truncated>0</truncated>
        <difficult>0</difficult>
        <occluded>0</occluded>
        <bndbox>
            <xmin>279</xmin>
            <ymin>198</ymin>
            <xmax>456</xmax>
            <ymax>421</ymax>
        </bndbox>
    </object>
</annotation>
Table 2 PASCAL VOC format description

Field

Mandatory (Yes/No)

Description

folder

Yes

Name of the directory where the image is located

filename

Yes

Name of the labeled file

size

Yes

Image pixel

  • width: image width. This parameter is mandatory.
  • height: image height. This parameter is mandatory.
  • depth: number of image channels. This parameter is mandatory.

segmented

Yes

Segmented or not. The value can be 0 or 1. The value 0 means no segmentation, and 1 means segmentation.

object

Yes

Target object information, which includes the category, pose, truncation status, identification difficulty, and bounding box of an object. An image may contain more than one object.

  • name: type of the labeled object. This parameter is mandatory.
  • pose: shooting angle of the labeled object. This parameter is mandatory.
  • truncated: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is truncated, and 1 indicates the opposite.
  • occluded: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object content is occluded, and 1 indicates the opposite.
  • difficult: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is easy to recognize, and 1 indicates the opposite.
  • confidence: This parameter is optional. The value ranges from 0 to 1. A value closer to 1 indicates a higher level of confidence.
  • bndbox: bounding box type. This parameter is mandatory. For details about the possible values, see Table 3.
Table 3 Bounding box types

type

Shape

Labeling Information

point

Point

Coordinates of a point

<x>100<x>

<y>100<y>

line

Line

Coordinates of points

<x1>100<x1>

<y1>100<y1>

<x2>200<x2>

<y2>200<y2>

bndbox

Rectangle

Coordinates of the upper left and lower right points

<xmin>100<xmin>

<ymin>100<ymin>

<xmax>200<xmax>

<ymax>200<ymax>

polygon

Polygon

Coordinates of points

<x1>100<x1>

<y1>100<y1>

<x2>200<x2>

<y2>100<y2>

<x3>250<x3>

<y3>150<y3>

<x4>200<x4>

<y4>200<y4>

<x5>100<x5>

<y5>200<y5>

<x6>50<x6>

<y6>150<y6>

circle

Circle

Center coordinates and radius

<cx>100<cx>

<cy>100<cy>

<r>50<r>

Description of an Annotation File for an Image Classification Dataset

The following description follows the annotation file format for image classification in Table 1.

The image classification dataset supports annotation files in ModelArts image classification 1.0 format.

Labeled objects and their annotation files (in one-to-one relationship with the labeled objects) must be in the same directory. An annotation file in TXT format can contain a single label or multiple labels.

  • The image and annotation files must be stored in the same directory, with the content in the annotation file used as the label of the image.

    In the following example, import-dir-1 and import-dir-2 are the imported subdirectories.

    dataset-import-example 
    ├─import-dir-1
    │      10.jpg
    │      10.txt    
    │      11.jpg 
    │      11.txt
    │      12.jpg 
    │      12.txt
    └─import-dir-2
            1.jpg 
            1.txt
            2.jpg 
            2.txt

    The following shows an annotation file for a single label, for example, the 1.txt file:

    Cat

    The following shows an annotation file for multiple labels, for example, the 2.txt file:

    Cat
    Dog

Specifications of Annotation Files in an Anomaly Detection Dataset

The following description follows the annotation file format for anomaly detection in Table 1.

The labeling files and images must be stored in the same folder.

  • The image and annotation files must be stored in the same directory, with the content in the annotation file used as the label of the image (normal or abnormal).

    Example:

    dataset-import-example 
    │      IMG_20180919_114732.jpg
    │      IMG_20180919_114732.txt    
    │      IMG_20180919_114745.jpg 
    │      IMG_20180919_114745.txt

    The following shows an annotation file for the "abnormal" label, for example, the IMG_20180919_114732.txt file:

    abnormal

    The following shows an annotation file for the "normal" label, for example, the IMG_20180919_114745.txt file:

    normal

Description of JSON Annotation Files for a Posture Estimation Dataset

The following description follows the annotation file format for post estimation in Table 1.

Posture estimation dataset labeling is based on the open-source character keypoint labeling format (COCO). The annotations, train, and val folders must be included. In the annotations folder, train.json and val.json contain the annotations of the training set and validation set. The train and val folders store images. The following is an example:

├─annotations
│      train.json 
│      val.json
├─train
│      IMG_20180919_114745.jpg 
├─val
│      IMG_20180919_114945.jpg 

The following is an example of a JSON annotation file:

{
    "images": [
        {
            "license": 2,
            "file_name": "000000000139.jpg",
            "coco_url": "",
            "height": 426,
            "width": 640,
            "date_captured": "2013-11-21 01:34:01",
            "flickr_url": "",
            "id": 139
        }
    ],
    "annotations": [
        {
            "num_keypoints": 15,
            "area": 2913.1104,
            "iscrowd": 0,
            "keypoints": [
                427,
                170,
                1,
                429,
                169,
                2,
                0,
                0,
                0,
                434,
                168,
                2,
                0,
                0,
                0,
                441,
                177,
                2,
                446,
                177,
                2,
                437,
                200,
                2,
                430,
                206,
                2,
                430,
                220,
                2,
                420,
                215,
                2,
                445,
                226,
                2,
                452,
                223,
                2,
                447,
                260,
                2,
                454,
                257,
                2,
                455,
                290,
                2,
                459,
                286,
                2
            ],
            "image_id": 139,
            "bbox": [
                412.8,
                157.61,
                53.05,
                138.01
            ],
            "category_id": 1,
            "id": 230831
        },
    ],
    "categories": [
        {
            "supercategory": "person",
            "id": 1,
            "name": "person",
            "keypoints": [
                "nose",
                "left_eye",
                "right_eye",
                "left_ear",
                "right_ear",
                "left_shoulder",
                "right_shoulder",
                "left_elbow",
                "right_elbow",
                "left_wrist",
                "right_wrist",
                "left_hip",
                "right_hip",
                "left_knee",
                "right_knee",
                "left_ankle",
                "right_ankle"
            ],
            "skeleton": [
                [
                    16,
                    14
                ],
                [
                    14,
                    12
                ],
                [
                    17,
                    15
                ],
                [
                    15,
                    13
                ],
                [
                    12,
                    13
                ],
                [
                    6,
                    12
                ],
                [
                    7,
                    13
                ],
                [
                    6,
                    7
                ],
                [
                    6,
                    8
                ],
                [
                    7,
                    9
                ],
                [
                    8,
                    10
                ],
                [
                    9,
                    11
                ],
                [
                    2,
                    3
                ],
                [
                    1,
                    2
                ],
                [
                    1,
                    3
                ],
                [
                    2,
                    4
                ],
                [
                    3,
                    5
                ],
                [
                    4,
                    6
                ],
                [
                    5,
                    7
                ]
            ]
        }
    ]
}
Table 4 COCO format description

Field

Mandatory (Yes/No)

Description

images

Yes

Image information.

license

No

License identifier of an image.

file_name

Yes

Image file name.

coco_url

No

URL of an image in the official COCO dataset.

height

Yes

Image height (in pixels).

width

Yes

Image width (in pixels).

date_captured

No

Date and time when an image is captured.

flickr_url

No

URL of an image on the Flickr website.

id

Yes

Unique identifier of an image.

annotations

Yes

Labeling information.

num_keypoints

Yes

Number of labeled key points.

area

Yes

Area of the bounding box, in pixel squares.

iscrowd

Yes

Whether the scenario is a complex group scenario (for example, crowded people). The value 0 indicates that the scenario is not a crowded scenario, and the value 1 indicates that the scenario is a crowded scenario.

keypoints

Yes

Coordinates and visibility of labeled key points. All key points are listed in sequence. Each key point is represented by three numbers: [x, y, v]. x and y are pixel coordinates of the key point, and v is visibility (0: invisible and not in the image; 1: invisible but in the image; 2: visible and in the image).

image_id

Yes

ID of the image associated with the annotation. The value must be the same as the value of id in the images field.

bbox

Yes

Bounding box of the target object, represented by [x, y, width, height], where x and y are the coordinates of the upper left corner of the bounding box, and width and height are the width and height of the bounding box.

category_id

Yes

ID of a label category. For human posture estimation, the value is usually 1 (indicating person).

id

Yes

Unique identifier of an image.

categories

Yes

Label type information.

supercategory

Yes

Upper-level category of a category, which is usually person.

id

Yes

Unique identifier of a category, usually 1 for human posture estimation.

name

Yes

Name of a category, which is usually person.

keypoints

Yes

List of key point names. Generally, 17 key points are defined in the COCO format, such as nose, left_eye, right_eye, left_ear, right_ear, left_shoulder, right_shoulder, left_elbow, right_elbow, left_wrist, right_wrist, left_hip, right_hip, left_knee, right_knee, left_ankle, and right_ankle.

skeleton

Yes

List of skeleton connections, which are used to indicate the connection relationships between key points. Each connection is represented by a pair of key point indexes, for example, [1, 2], indicating a connection line from a nose (nose) to a left eye (left_eye).

Description of an Annotation File for an Instance Segmentation Dataset

The following description follows the annotation file format for instance segmentation in Table 1.

Labeled objects and their annotation files (in one-to-one relationship with the labeled objects) must be in the same directory. For example, if the name of the labeled object file is IMG_20180919_114745.jpg, the name of the annotation file must be IMG_20180919_114745.xml.

The annotation files must be in PASCAL VOC format, a standardized XML annotation format used for labeling image datasets. A PASCAL_VOC file contains information on the image directory, image file code, image size, and object information. For details about the format, see Table 5.

Example of a file uploaded to OBS:

├─dataset-import-example 
│      IMG_20180919_114732.jpg 
│      IMG_20180919_114732.xml 
│      IMG_20180919_114745.jpg 
│      IMG_20180919_114745.xml 

Example of an XML annotation file:

<annotation>
 <folder>NA</folder>
 <filename>0001.jpg</filename>
 <source>
  <database>Unknown</database>
 </source>
 <size>
  <width>2560</width>
  <height>1440</height>
  <depth>3</depth>
 </size>
 <segmented>1</segmented>
 <mask_source></mask_source>
 <object>
  <name>aggregate</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <mask_color>238,130,238</mask_color>
  <occluded>0</occluded>
  <polygon>
   <x1>657.0</x1>
   <y1>357.0</y1>
   <x2>645.0</x2>
   <y2>351.0</y2>
   <x3>624.0</x3>
   <y3>352.0</y3>
   <x4>616.0</x4>
   <y4>353.0</y4>
  </polygon>
 </object>
</annotation>
Table 5 PASCAL VOC format description

Field

Mandatory (Yes/No)

Description

folder

Yes

Name of the directory where the image is located

filename

Yes

Name of the labeled file

size

Yes

Image pixel

  • width: image width. This parameter is mandatory.
  • height: image height. This parameter is mandatory.
  • depth: number of image channels. This parameter is mandatory.

segmented

Yes

Segmented or not. The value can be 0 or 1. The value 0 means no segmentation, and 1 means segmentation.

object

Yes

Target object information, which includes the category, pose, truncation status, identification difficulty, and bounding box of an object. An image may contain more than one object.

  • name: type of the labeled object. This parameter is mandatory.
  • pose: shooting angle of the labeled object. This parameter is mandatory.
  • truncated: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is truncated, and 1 indicates the opposite.
  • occluded: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object content is occluded, and 1 indicates the opposite.
  • difficult: This parameter is mandatory. The value can be 0 or 1. The value 0 indicates that the labeled object is easy to recognize, and 1 indicates the opposite.
  • confidence: This parameter is optional. The value ranges from 0 to 1. A value closer to 1 indicates a higher level of confidence.
  • bndbox: bounding box type. This parameter is mandatory. For details about the possible values, see Table 6.
Table 6 Bounding box types

type

Shape

Labeling Information

point

Point

Coordinates of a point

<x>100<x>

<y>100<y>

line

Line

Coordinates of points

<x1>100<x1>

<y1>100<y1>

<x2>200<x2>

<y2>200<y2>

bndbox

Rectangle

Coordinates of the upper left and lower right points

<xmin>100<xmin>

<ymin>100<ymin>

<xmax>200<xmax>

<ymax>200<ymax>

polygon

Polygon

Coordinates of points

<x1>100<x1>

<y1>100<y1>

<x2>200<x2>

<y2>100<y2>

<x3>250<x3>

<y3>150<y3>

<x4>200<x4>

<y4>200<y4>

<x5>100<x5>

<y5>200<y5>

<x6>50<x6>

<y6>150<y6>

circle

Circle

Center coordinates and radius

<cx>100<cx>

<cy>100<cy>

<r>50<r>