Data Feature Analysis

Images or target bounding boxes are analyzed based on image features, such as blurs and brightness to draw visualized curves to help process datasets.

You can also select multiple versions of a dataset to view their curves for comparison and analysis.

Background

Data feature analysis is only available for image datasets of the image classification and object detection types.
Data feature analysis is only available for the published datasets. The published dataset versions in Default format support data feature analysis.
A data scope for feature analysis varies depending on the dataset type.
- In a dataset of the object detection type, if the number of labeled samples is 0, the View Data Feature tab page is unavailable and data features are not displayed after a version is published. After the images are labeled and the version is published, the data features of the labeled images are displayed.
- In a dataset of the image classification type, if the number of labeled samples is 0, the View Data Feature tab page is unavailable and data features are not displayed after a version is published. After the images are labeled and the version is published, the data features of all images are displayed.
The analysis result is valid only when the number of images in a dataset reaches a certain level. Generally, more than 1,000 images are required.
Image classification supports the following data feature metrics: Resolution, Aspect Ratio, Brightness, Saturation, Blur Score, and Colorfulness Object detection supports all data feature metrics. Supported Data Feature Metrics provides all data feature metrics supported by ModelArts.

Data Feature Analysis

Log in to the ModelArts management console.. In the navigation pane, choose Data Management > Datasets.
Locate the target dataset, click More in the Operation column, and select View Data Feature. The View Data Feature tab of the dataset is displayed.
You can also click a dataset name to go to the dataset page and click the View Data Feature tab.
By default, feature analysis is not started for published datasets. You need to manually start feature analysis tasks for each dataset version. On the View Data Feature tab, click Analyze Features.
In the dialog box that is displayed, configure the dataset version for feature analysis and click Yes to start analysis.
Version: Select a published version of the dataset.
Figure 1 Starting a data feature analysis task
After a data feature analysis task is started, it takes a certain period of time to complete, depending on the data volume. If the selected version is displayed in the Version drop-down list and can be selected, the analysis is complete.
View the data feature analysis result.
Version: Select the version to be compared from the drop-down list You can also select only one version.

Type: Select the type to be analyzed. The value can be all, train, eval, or inference.

Data Feature Metric: Select metrics to be displayed from the drop-down list. For details, see Supported Data Feature Metrics.

Then, the selected version and metrics are displayed on the page. The displayed chart helps you understand data distribution for better data processing.
View historical records of the analysis task.
After data feature analysis is complete, you can click Task History on the right of the Data Features tab page to view historical analysis tasks and their statuses in the dialog box that is displayed.

Supported Data Feature Metrics

**Table 1** Data feature metrics
Metric	Description	Explanation
Resolution	Image resolution. An area value is used as a statistical value.	Metric analysis results are used to check whether there is an offset point. If an offset point exists, you can resize or delete the offset point.
Aspect Ratio	An aspect ratio is a proportional relationship between an image's width and height.	The chart of the metric is in normal distribution, which is generally used to compare the difference between the training set and the dataset used in the real scenario.
Brightness	Brightness is the perception elicited by the luminance of a visual target. A larger value indicates better image brightness.	The chart of the metric is in normal distribution. You can determine whether the brightness of the entire dataset is high or low based on the distribution center. You can adjust the brightness based on your application scenario. For example, if the application scenario is night, the brightness should be lower.
Saturation	Color saturation of an image. A larger value indicates that the entire image color is easier to distinguish.	The chart of the metric is in normal distribution, which is generally used to compare the difference between the training set and the dataset used in the real scenario.
Blur Score Clarity	Image clarity, which is calculated using the Laplace operator. A larger value indicates clearer edges and higher clarity.	You can determine whether the clarity meets the requirements based on the application scenario. For example, if data is collected from HD cameras, the clarity must be higher. You can sharpen or blur the dataset and add noises to adjust the clarity.
Colorfulness	Horizontal coordinate: Colorfulness of an image. A larger value indicates richer colors. Vertical coordinate: Number of images	Colorfulness on the visual sense, which is generally used to compare the difference between the training set and the dataset used in the real scenario.
Bounding Box Number	Horizontal coordinate: Number of bounding boxes in an image Vertical coordinate: Number of images	It is difficult for a model to detect a large number of bounding boxes in an image. Therefore, more images containing many bounding boxes are required for training.
Std of Bounding Boxes Area Per Image Standard Deviation of Bounding Boxes Per Image	Horizontal coordinate: Standard deviation of bounding boxes in an image. If an image has only one bounding box, the standard deviation is 0. A larger standard deviation indicates higher bounding box size variation in an image. Vertical coordinate: Number of images	It is difficult for a model to detect a large number of bounding boxes with different sizes in an image. You can add data for training based on scenarios or delete data if such scenarios do not exist.
Aspect Ratio of Bounding Boxes	Horizontal coordinate: Aspect ratio of the target bounding boxes Vertical coordinate: Number of bounding boxes in all images	The chart of the metric is generally in Poisson distribution, which is closely related to application scenarios. This metric is mainly used to compare the differences between the training set and the validation set. For example, if the training set is a rectangle, the result will be significantly affected if the validation set is close to a square.
Area Ratio of Bounding Boxes	Horizontal coordinate: Area ratio of the target bounding boxes, that is, the ratio of the bounding box area to the entire image area. A larger value indicates a higher ratio of the object in the image. Vertical coordinate: Number of bounding boxes in all images	The metric is used to determine the distribution of anchors used in the model. If the target bounding box is large, set the anchor to a large value.
Marginalization Value of Bounding Boxes	Horizontal coordinate: Marginalization degree, that is, the ratio of the distance between the center point of the target bounding box and the center point of the image to the total distance of the image. A larger value indicates that the object is closer to the edge. (The total distance of an image is the distance from the intersection point of a ray (that starts from the center point of the image and passes through the center point of the bounding box) and the image border to the center point of the image.) Vertical coordinate: Number of bounding boxes in all images	Generally, the chart of the metric is in normal distribution. The metric is used to determine whether an object is at the edge of an image. If a part of an object is at the edge of an image, you can add a dataset or do not label the object.
Overlap Score of Bounding Boxes Overlap Score of Bounding Boxes	Horizontal coordinate: Overlap degree, that is, the part of a single bounding box overlapped by other bounding boxes. The value ranges from 0 to 1. A larger value indicates that more parts are overlapped by other bounding boxes. Vertical coordinate: Number of bounding boxes in all images	The metric is used to determine the overlapping degree of objects to be detected. Overlapped objects are difficult to detect. You can add a dataset or do not label some objects based on your needs.
Brightness of Bounding Boxes Brightness of Bounding Boxes	Horizontal coordinate: Brightness of the image in the target bounding box. A larger value indicates brighter image. Vertical coordinate: Number of bounding boxes in all images	Generally, the chart of the metric is in normal distribution. The metric is used to determine the brightness of an object to be detected. In some special scenarios, the brightness of an object is low and may not meet the requirements.
Blur Score of Bounding Boxes Clarity of Bounding Boxes	Horizontal coordinate: Clarity of the image in the target bounding box. A larger value indicates higher image clarity. Vertical coordinate: Number of bounding boxes in all images	The metric is used to determine whether the object to be detected is blurred. For example, a moving object may become blurred during collection and its data needs to be collected again.