Updated on 2024-04-30 GMT+08:00

Data Feature Analysis

Images or target bounding boxes are analyzed based on image features, such as blurs and brightness to draw visualized curves to help process datasets.

You can also select multiple versions of a dataset to view their curves for comparison and analysis.

Background

  • Data feature analysis is only available for image datasets of the image classification and object detection types.
  • Data feature analysis is only available for the published datasets. The published dataset versions in Default format support data feature analysis.
  • A data scope for feature analysis varies depending on the dataset type.
    • In a dataset of the object detection type, if the number of labeled samples is 0, the View Data Feature tab page is unavailable and data features are not displayed after a version is published. After the images are labeled and the version is published, the data features of the labeled images are displayed.
    • In a dataset of the image classification type, if the number of labeled samples is 0, the View Data Feature tab page is unavailable and data features are not displayed after a version is published. After the images are labeled and the version is published, the data features of all images are displayed.
  • The analysis result is valid only when the number of images in a dataset reaches a certain level. Generally, more than 1,000 images are required.
  • Image classification supports the following data feature metrics: Resolution, Aspect Ratio, Brightness, Saturation, Blur Score, and Colorfulness Object detection supports all data feature metrics. Supported Data Feature Metrics provides all data feature metrics supported by ModelArts.

Data Feature Analysis

  1. Log in to the ModelArts management console.. In the navigation pane, choose Data Management > Datasets.
  2. Locate the target dataset, click More in the Operation column, and select View Data Feature. The View Data Feature tab of the dataset is displayed.

    You can also click a dataset name to go to the dataset page and click the View Data Feature tab.

  3. By default, feature analysis is not started for published datasets. You need to manually start feature analysis tasks for each dataset version. On the View Data Feature tab, click Analyze Features.
  4. In the dialog box that is displayed, configure the dataset version for feature analysis and click Yes to start analysis.
    Version: Select a published version of the dataset.
    Figure 1 Starting a data feature analysis task
  5. After a data feature analysis task is started, it takes a certain period of time to complete, depending on the data volume. If the selected version is displayed in the Version drop-down list and can be selected, the analysis is complete.
  6. View the data feature analysis result.

    Version: Select the version to be compared from the drop-down list You can also select only one version.

    Type: Select the type to be analyzed. The value can be all, train, eval, or inference.

    Data Feature Metric: Select metrics to be displayed from the drop-down list. For details, see Supported Data Feature Metrics.

    Then, the selected version and metrics are displayed on the page. The displayed chart helps you understand data distribution for better data processing.

  7. View historical records of the analysis task.

    After data feature analysis is complete, you can click Task History on the right of the Data Features tab page to view historical analysis tasks and their statuses in the dialog box that is displayed.

Supported Data Feature Metrics

Table 1 Data feature metrics

Metric

Description

Explanation

Resolution

Image resolution. An area value is used as a statistical value.

Metric analysis results are used to check whether there is an offset point. If an offset point exists, you can resize or delete the offset point.

Aspect Ratio

An aspect ratio is a proportional relationship between an image's width and height.

The chart of the metric is in normal distribution, which is generally used to compare the difference between the training set and the dataset used in the real scenario.

Brightness

Brightness is the perception elicited by the luminance of a visual target. A larger value indicates better image brightness.

The chart of the metric is in normal distribution. You can determine whether the brightness of the entire dataset is high or low based on the distribution center. You can adjust the brightness based on your application scenario. For example, if the application scenario is night, the brightness should be lower.

Saturation

Color saturation of an image. A larger value indicates that the entire image color is easier to distinguish.

The chart of the metric is in normal distribution, which is generally used to compare the difference between the training set and the dataset used in the real scenario.

Blur Score

Clarity

Image clarity, which is calculated using the Laplace operator. A larger value indicates clearer edges and higher clarity.

You can determine whether the clarity meets the requirements based on the application scenario. For example, if data is collected from HD cameras, the clarity must be higher. You can sharpen or blur the dataset and add noises to adjust the clarity.

Colorfulness

Horizontal coordinate: Colorfulness of an image. A larger value indicates richer colors.

Vertical coordinate: Number of images

Colorfulness on the visual sense, which is generally used to compare the difference between the training set and the dataset used in the real scenario.

Bounding Box Number

Horizontal coordinate: Number of bounding boxes in an image

Vertical coordinate: Number of images

It is difficult for a model to detect a large number of bounding boxes in an image. Therefore, more images containing many bounding boxes are required for training.

Std of Bounding Boxes Area Per Image

Standard Deviation of Bounding Boxes Per Image

Horizontal coordinate: Standard deviation of bounding boxes in an image. If an image has only one bounding box, the standard deviation is 0. A larger standard deviation indicates higher bounding box size variation in an image.

Vertical coordinate: Number of images

It is difficult for a model to detect a large number of bounding boxes with different sizes in an image. You can add data for training based on scenarios or delete data if such scenarios do not exist.

Aspect Ratio of Bounding Boxes

Horizontal coordinate: Aspect ratio of the target bounding boxes

Vertical coordinate: Number of bounding boxes in all images

The chart of the metric is generally in Poisson distribution, which is closely related to application scenarios. This metric is mainly used to compare the differences between the training set and the validation set. For example, if the training set is a rectangle, the result will be significantly affected if the validation set is close to a square.

Area Ratio of Bounding Boxes

Horizontal coordinate: Area ratio of the target bounding boxes, that is, the ratio of the bounding box area to the entire image area. A larger value indicates a higher ratio of the object in the image.

Vertical coordinate: Number of bounding boxes in all images

The metric is used to determine the distribution of anchors used in the model. If the target bounding box is large, set the anchor to a large value.

Marginalization Value of Bounding Boxes

Horizontal coordinate: Marginalization degree, that is, the ratio of the distance between the center point of the target bounding box and the center point of the image to the total distance of the image. A larger value indicates that the object is closer to the edge. (The total distance of an image is the distance from the intersection point of a ray (that starts from the center point of the image and passes through the center point of the bounding box) and the image border to the center point of the image.)

Vertical coordinate: Number of bounding boxes in all images

Generally, the chart of the metric is in normal distribution. The metric is used to determine whether an object is at the edge of an image. If a part of an object is at the edge of an image, you can add a dataset or do not label the object.

Overlap Score of Bounding Boxes

Overlap Score of Bounding Boxes

Horizontal coordinate: Overlap degree, that is, the part of a single bounding box overlapped by other bounding boxes. The value ranges from 0 to 1. A larger value indicates that more parts are overlapped by other bounding boxes.

Vertical coordinate: Number of bounding boxes in all images

The metric is used to determine the overlapping degree of objects to be detected. Overlapped objects are difficult to detect. You can add a dataset or do not label some objects based on your needs.

Brightness of Bounding Boxes

Brightness of Bounding Boxes

Horizontal coordinate: Brightness of the image in the target bounding box. A larger value indicates brighter image.

Vertical coordinate: Number of bounding boxes in all images

Generally, the chart of the metric is in normal distribution. The metric is used to determine the brightness of an object to be detected. In some special scenarios, the brightness of an object is low and may not meet the requirements.

Blur Score of Bounding Boxes

Clarity of Bounding Boxes

Horizontal coordinate: Clarity of the image in the target bounding box. A larger value indicates higher image clarity.

Vertical coordinate: Number of bounding boxes in all images

The metric is used to determine whether the object to be detected is blurred. For example, a moving object may become blurred during collection and its data needs to be collected again.