Help Center/ ModelArts/ User Guide for Senior AI Engineers (To Be Offline)/ Data Management (Old Version to Be Terminated)/ Data Processing/ Built-in Operators/ Data Selection (Hard Examples)

Updated on 2023-09-06 GMT+08:00

View PDF

Data Selection (Hard Examples)

Algorithm Overview

In actual service scenarios, model maintenance is a long-term process. For example, data retraining is performed weekly or monthly, or periodic retraining is required when data accumulates to a certain volume. Full-data retraining often consumes a large amount of labeling manpower and training time. To improve the model maintenance efficiency, hard examples-based retraining can be adopted.

A hard example filtering algorithm is used to analyze and filter full data and output only a small amount of valuable data for model maintenance. In this case, retraining with the filtered data effectively reduces the labeling manpower and training time.

Multiple methods are integrated in the hard example filtering algorithm. To achieve the optimal effect, you need to select some or all methods and adjust their weights based on your needs.

Parameters

Parameter	Mandatory	Default Value	Description
source_service	Y	inference	Preset data source of a hard example filtering task. Only inference is supported. The parameter value cannot be changed.
filter_func	Y	comprehensive_mining	Set the hard example filtering algorithm to comprehensive_mining. The parameter value cannot be changed.
checkpoint_path	Y	/home/work/user-job-dir/data_filter/resnet_v1_50	Model directory used for feature extraction. Only pre-trained resnet_v1_50 model based on ImageNet is supported. The parameter value cannot be changed.
model_serving_url	N	None	Inference model path, that is, the output path of a training job. This model is used for inference after data augmentation in the aug_consistent_mining algorithm. Enter an existing OBS directory, for example, obs://obs_bucket_name/folder_name/.
train_data_path	N	None	Training dataset, which is the training data used by the model_serving_url model. The manifest file generated by the dataset version needs to be entered. Enter an existing OBS directory, for example, obs://obs_bucket_name/folder_name/v001.manifest.
comprehensive_algo_config	N	clustering_mining:0.2020+aug_consistent_mining:0.4265+feature_distribution_mining:0.0451+sequential_mining:0.425+image_similarity_mining:0.0949+predict_score_mining:0.3900+anomaly_detection_mining:0.2020	Algorithm and its weight. By default, the optimal parameter after the system experiment is used. You can also configure this parameter with different data. Example: predict_score_mining:0.3900+anomaly_detection_mining:0.2020
algo_hard_threshold	N	0.1	Threshold of the filtering coefficient. The value ranges from 0 to 1. If the threshold is set too high, the output result may be 0. Set this parameter to a proper value.
aug_op_config	N	crop:0.1+fliplr:0.1+gaussianblur:0.1	Data augmentation method used in the aug_consistent_mining algorithm. The value can be crop, fliplr, gaussianblur, flipud, scale, translate, shear, superpixels, sharpen, add, or invert.
feature_op_config	N	image_aspect_ratio:0.5+image_brightness:1.0+image_saturation:0.5+image_resolution:0.5+image_colorfulness:0.5+ambiguity:1.0+bbox_num:1.0+bbox_iou:1.0+bbox_std:0.5+bbox_bright:0.5+bbox_ambiguity:0.5+bbox_aspect_ratio:1.0+bbox_area_ratio:0.5+bbox_edge_value:0.5	Feature defined in the feature_distribution_mining algorithm. The weight can be modified.
score_threshold_up	N	0.6	Maximum confidence value defined in the predict_score_mining algorithm. The value ranges from 0 to 1.
score_threshold_low	N	0.3	Minimum confidence value defined in the predict_score_mining algorithm. The value ranges from 0 to 1.
margin	N	0.8	Top 2 confidence difference. If the difference is greater than this parameter value, this sample is a hard example. The value ranges from 0 to 1. The default value is 0.8.
similarity_sample_ratio	N	1.0	Similarity ratio in the image_similarity_mining algorithm. The value ranges from 0 to 1. The default value is 1.0.
task_summary_file	N	None	Log output path and log file of the algorithm simplification. Enter an existing OBS directory that starts with obs://. The file name can be custom. Example: obs://obs_bucket_name/folder_name/xxx.log
output_dataset_type	N	manifest	The options are as follows: directory: outputs the raw image and label to the Data folder in the result directory. manifest: only outputs the manifest file. This parameter is automatically filled on the data processing page based on your configurations.

Operator Input Requirements

The following two types of operator input are available:

Datasets: Select a dataset and its version created on the ModelArts console from the drop-down list. Ensure that the dataset type be the same as the scenario type selected in this task.

OBSCatalog: The directory must contain the raw image for inference and the inference result file.

The directory structure is as follows:

input_path/
   --images/ #The folder name must be images.
        ----1.jpg
        ----2.jpg
    --inference_results/ # The folder name must be inference_results.
        ----1.jpg_result.txt
        ----2.jpg_result.txt

The .txt inference result file must meet the following requirements: If you use the model trained by the ModelArts built-in algorithm for inference, the default inference result meets the requirements.

Image classification

{
    "predicted_label": "dog",
    "scores": [
        [
            "dog",
            "0.589"
        ],
        [
            "cat",
            "0.411"
        ]
    ]
}

Object detection

{
    "detection_classes": [
        "cat",
        "cat"
        ],
    "detection_boxes": [
        [
            117.56356048583984,
            335.9902648925781,
            270.50848388671875,
            469.0136413574219
        ],
        [
            18.747316360473633,
            13.10757064819336,
            217.25146484375,
            108.3551025390625
        ]
    ],
    "detection_scores": [
            0.5179755091667175,
            0.46941104531288147
        ]
}

Output Description

Object detection

The output directory structure is as follows:

output_path :
    --Data
        ----1.jpg
        ----1.xml     # Export the filtering result to this directory.
    --output.manifest

A manifest file example is as follows:

{"source":"/tmp/test_out/object_detection/images/be462ea9c5abc09f.jpg",
"hard":"True",
"hard-reasons":"0", # Reason why the sample is determined as a hard example. The specific reason is displayed only in the auto labeling module.
"hard-coefficient":"1.0", # Hard example coefficient obtained using the hard example algorithm. A larger value indicates a higher probability that the sample is a hard example.
"annotation":[
{"annotation-loc":"/tmp/test_out/object_detection/annotations/be462ea9c5abc09f.xml",
"type":"modelarts/object_detection",
"annotation-format":"PASCAL VOC",
"annotated-by":"modelarts/hard_example_algo"}]}

Image classification

The output directory structure is as follows:

output_path :
    --Data
        ----class1
            ------1.jpg
        ----class2
            ------2.jpg
    --output.manifest

A manifest file example is as follows:

{"source":"obs://obs_bucket_name/folder_name/catDog/5.jpg",
"hard":true,
"hard-reasons":"1-20-2-19-21-3",
"hard-coefficient":1.0,
"annotation":[
{"name":"cat",
"type":"modelarts/image_classification",
"confidence":0.599,
"annotated-by":"modelarts/hard_example_algo"}]}

Log File Description

task_summary_file is the output file path of the simplified log. The content is as follows:

{
"task_status": 'SUCCEED', # Algorithm execution status
"total_sample": integer,  # Total input samples
"hard_sample": integer # Total output samples
}

{
"task_status": 'FAILED',
"error_message": 'xxxxxx' # Error information that causes the algorithm execution failure
}

Parent topic: Built-in Operators

Previous topic: Data Selection

Next topic: Data Augmentation (Data Amplification)

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot