Data Selection (Hard Examples)
Algorithm Overview
In actual service scenarios, model maintenance is a long-term process. For example, data retraining is performed weekly or monthly, or periodic retraining is required when data accumulates to a certain volume. Full-data retraining often consumes a large amount of labeling manpower and training time. To improve the model maintenance efficiency, hard examples-based retraining can be adopted.
A hard example filtering algorithm is used to analyze and filter full data and output only a small amount of valuable data for model maintenance. In this case, retraining with the filtered data effectively reduces the labeling manpower and training time.
Multiple methods are integrated in the hard example filtering algorithm. To achieve the optimal effect, you need to select some or all methods and adjust their weights based on your needs.
Parameters
Parameter |
Mandatory |
Default Value |
Description |
---|---|---|---|
source_service |
Y |
inference |
Preset data source of a hard example filtering task. Only inference is supported. The parameter value cannot be changed. |
filter_func |
Y |
comprehensive_mining |
Set the hard example filtering algorithm to comprehensive_mining. The parameter value cannot be changed. |
checkpoint_path |
Y |
/home/work/user-job-dir/data_filter/resnet_v1_50 |
Model directory used for feature extraction. Only pre-trained resnet_v1_50 model based on ImageNet is supported. The parameter value cannot be changed. |
model_serving_url |
N |
None |
Inference model path, that is, the output path of a training job. This model is used for inference after data augmentation in the aug_consistent_mining algorithm. Enter an existing OBS directory, for example, obs://obs_bucket_name/folder_name/. |
train_data_path |
N |
None |
Training dataset, which is the training data used by the model_serving_url model. The manifest file generated by the dataset version needs to be entered. Enter an existing OBS directory, for example, obs://obs_bucket_name/folder_name/v001.manifest. |
comprehensive_algo_config |
N |
clustering_mining:0.2020+aug_consistent_mining:0.4265+feature_distribution_mining:0.0451+sequential_mining:0.425+image_similarity_mining:0.0949+predict_score_mining:0.3900+anomaly_detection_mining:0.2020 |
Algorithm and its weight. By default, the optimal parameter after the system experiment is used. You can also configure this parameter with different data. Example: predict_score_mining:0.3900+anomaly_detection_mining:0.2020 |
algo_hard_threshold |
N |
0.1 |
Threshold of the filtering coefficient. The value ranges from 0 to 1. If the threshold is set too high, the output result may be 0. Set this parameter to a proper value. |
aug_op_config |
N |
crop:0.1+fliplr:0.1+gaussianblur:0.1 |
Data augmentation method used in the aug_consistent_mining algorithm. The value can be crop, fliplr, gaussianblur, flipud, scale, translate, shear, superpixels, sharpen, add, or invert. |
feature_op_config |
N |
image_aspect_ratio:0.5+image_brightness:1.0+image_saturation:0.5+image_resolution:0.5+image_colorfulness:0.5+ambiguity:1.0+bbox_num:1.0+bbox_iou:1.0+bbox_std:0.5+bbox_bright:0.5+bbox_ambiguity:0.5+bbox_aspect_ratio:1.0+bbox_area_ratio:0.5+bbox_edge_value:0.5 |
Feature defined in the feature_distribution_mining algorithm. The weight can be modified. |
score_threshold_up |
N |
0.6 |
Maximum confidence value defined in the predict_score_mining algorithm. The value ranges from 0 to 1. |
score_threshold_low |
N |
0.3 |
Minimum confidence value defined in the predict_score_mining algorithm. The value ranges from 0 to 1. |
margin |
N |
0.8 |
Top 2 confidence difference. If the difference is greater than this parameter value, this sample is a hard example. The value ranges from 0 to 1. The default value is 0.8. |
similarity_sample_ratio |
N |
1.0 |
Similarity ratio in the image_similarity_mining algorithm. The value ranges from 0 to 1. The default value is 1.0. |
task_summary_file |
N |
None |
Log output path and log file of the algorithm simplification. Enter an existing OBS directory that starts with obs://. The file name can be custom. Example: obs://obs_bucket_name/folder_name/xxx.log |
output_dataset_type |
N |
manifest |
The options are as follows:
This parameter is automatically filled on the data processing page based on your configurations. |
Operator Input Requirements
The following two types of operator input are available:
- Datasets: Select a dataset and its version created on the ModelArts console from the drop-down list. Ensure that the dataset type be the same as the scenario type selected in this task.
- OBSCatalog: The directory must contain the raw image for inference and the inference result file.
The directory structure is as follows:
input_path/ --images/ #The folder name must be images. ----1.jpg ----2.jpg --inference_results/ # The folder name must be inference_results. ----1.jpg_result.txt ----2.jpg_result.txt
The .txt inference result file must meet the following requirements: If you use the model trained by the ModelArts built-in algorithm for inference, the default inference result meets the requirements.
- Image classification
{ "predicted_label": "dog", "scores": [ [ "dog", "0.589" ], [ "cat", "0.411" ] ] }
- Object detection
{ "detection_classes": [ "cat", "cat" ], "detection_boxes": [ [ 117.56356048583984, 335.9902648925781, 270.50848388671875, 469.0136413574219 ], [ 18.747316360473633, 13.10757064819336, 217.25146484375, 108.3551025390625 ] ], "detection_scores": [ 0.5179755091667175, 0.46941104531288147 ] }
- Image classification
Output Description
- Object detection
The output directory structure is as follows:
output_path : --Data ----1.jpg ----1.xml # Export the filtering result to this directory. --output.manifest
A manifest file example is as follows:
{"source":"/tmp/test_out/object_detection/images/be462ea9c5abc09f.jpg", "hard":"True", "hard-reasons":"0", # Reason why the sample is determined as a hard example. The specific reason is displayed only in the auto labeling module. "hard-coefficient":"1.0", # Hard example coefficient obtained using the hard example algorithm. A larger value indicates a higher probability that the sample is a hard example. "annotation":[ {"annotation-loc":"/tmp/test_out/object_detection/annotations/be462ea9c5abc09f.xml", "type":"modelarts/object_detection", "annotation-format":"PASCAL VOC", "annotated-by":"modelarts/hard_example_algo"}]}
- Image classification
The output directory structure is as follows:
output_path : --Data ----class1 ------1.jpg ----class2 ------2.jpg --output.manifest
A manifest file example is as follows:
{"source":"obs://obs_bucket_name/folder_name/catDog/5.jpg", "hard":true, "hard-reasons":"1-20-2-19-21-3", "hard-coefficient":1.0, "annotation":[ {"name":"cat", "type":"modelarts/image_classification", "confidence":0.599, "annotated-by":"modelarts/hard_example_algo"}]}
Log File Description
task_summary_file is the output file path of the simplified log. The content is as follows:
{ "task_status": 'SUCCEED', # Algorithm execution status "total_sample": integer, # Total input samples "hard_sample": integer # Total output samples }
or
{ "task_status": 'FAILED', "error_message": 'xxxxxx' # Error information that causes the algorithm execution failure }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot