Updated on 2023-09-06 GMT+08:00

Data Selection (Hard Examples)

Algorithm Overview

In actual service scenarios, model maintenance is a long-term process. For example, data retraining is performed weekly or monthly, or periodic retraining is required when data accumulates to a certain volume. Full-data retraining often consumes a large amount of labeling manpower and training time. To improve the model maintenance efficiency, hard examples-based retraining can be adopted.

A hard example filtering algorithm is used to analyze and filter full data and output only a small amount of valuable data for model maintenance. In this case, retraining with the filtered data effectively reduces the labeling manpower and training time.

Multiple methods are integrated in the hard example filtering algorithm. To achieve the optimal effect, you need to select some or all methods and adjust their weights based on your needs.

Parameters

Parameter

Mandatory

Default Value

Description

source_service

Y

inference

Preset data source of a hard example filtering task. Only inference is supported. The parameter value cannot be changed.

filter_func

Y

comprehensive_mining

Set the hard example filtering algorithm to comprehensive_mining. The parameter value cannot be changed.

checkpoint_path

Y

/home/work/user-job-dir/data_filter/resnet_v1_50

Model directory used for feature extraction. Only pre-trained resnet_v1_50 model based on ImageNet is supported. The parameter value cannot be changed.

model_serving_url

N

None

Inference model path, that is, the output path of a training job. This model is used for inference after data augmentation in the aug_consistent_mining algorithm.

Enter an existing OBS directory, for example, obs://obs_bucket_name/folder_name/.

train_data_path

N

None

Training dataset, which is the training data used by the model_serving_url model. The manifest file generated by the dataset version needs to be entered.

Enter an existing OBS directory, for example, obs://obs_bucket_name/folder_name/v001.manifest.

comprehensive_algo_config

N

clustering_mining:0.2020+aug_consistent_mining:0.4265+feature_distribution_mining:0.0451+sequential_mining:0.425+image_similarity_mining:0.0949+predict_score_mining:0.3900+anomaly_detection_mining:0.2020

Algorithm and its weight. By default, the optimal parameter after the system experiment is used. You can also configure this parameter with different data.

Example: predict_score_mining:0.3900+anomaly_detection_mining:0.2020

algo_hard_threshold

N

0.1

Threshold of the filtering coefficient. The value ranges from 0 to 1.

If the threshold is set too high, the output result may be 0. Set this parameter to a proper value.

aug_op_config

N

crop:0.1+fliplr:0.1+gaussianblur:0.1

Data augmentation method used in the aug_consistent_mining algorithm. The value can be crop, fliplr, gaussianblur, flipud, scale, translate, shear, superpixels, sharpen, add, or invert.

feature_op_config

N

image_aspect_ratio:0.5+image_brightness:1.0+image_saturation:0.5+image_resolution:0.5+image_colorfulness:0.5+ambiguity:1.0+bbox_num:1.0+bbox_iou:1.0+bbox_std:0.5+bbox_bright:0.5+bbox_ambiguity:0.5+bbox_aspect_ratio:1.0+bbox_area_ratio:0.5+bbox_edge_value:0.5

Feature defined in the feature_distribution_mining algorithm. The weight can be modified.

score_threshold_up

N

0.6

Maximum confidence value defined in the predict_score_mining algorithm. The value ranges from 0 to 1.

score_threshold_low

N

0.3

Minimum confidence value defined in the predict_score_mining algorithm. The value ranges from 0 to 1.

margin

N

0.8

Top 2 confidence difference. If the difference is greater than this parameter value, this sample is a hard example. The value ranges from 0 to 1. The default value is 0.8.

similarity_sample_ratio

N

1.0

Similarity ratio in the image_similarity_mining algorithm. The value ranges from 0 to 1. The default value is 1.0.

task_summary_file

N

None

Log output path and log file of the algorithm simplification. Enter an existing OBS directory that starts with obs://. The file name can be custom.

Example: obs://obs_bucket_name/folder_name/xxx.log

output_dataset_type

N

manifest

The options are as follows:

  • directory: outputs the raw image and label to the Data folder in the result directory.
  • manifest: only outputs the manifest file.

This parameter is automatically filled on the data processing page based on your configurations.

Operator Input Requirements

The following two types of operator input are available:

  • Datasets: Select a dataset and its version created on the ModelArts console from the drop-down list. Ensure that the dataset type be the same as the scenario type selected in this task.
  • OBSCatalog: The directory must contain the raw image for inference and the inference result file.

    The directory structure is as follows:

    input_path/
       --images/ #The folder name must be images.
            ----1.jpg
            ----2.jpg
        --inference_results/ # The folder name must be inference_results.
            ----1.jpg_result.txt
            ----2.jpg_result.txt

    The .txt inference result file must meet the following requirements: If you use the model trained by the ModelArts built-in algorithm for inference, the default inference result meets the requirements.

    • Image classification
      {
          "predicted_label": "dog",
          "scores": [
              [
                  "dog",
                  "0.589"
              ],
              [
                  "cat",
                  "0.411"
              ]
          ]
      }
    • Object detection
      {
          "detection_classes": [
              "cat",
              "cat"
              ],
          "detection_boxes": [
              [
                  117.56356048583984,
                  335.9902648925781,
                  270.50848388671875,
                  469.0136413574219
              ],
              [
                  18.747316360473633,
                  13.10757064819336,
                  217.25146484375,
                  108.3551025390625
              ]
          ],
          "detection_scores": [
                  0.5179755091667175,
                  0.46941104531288147
              ]
      }

Output Description

  • Object detection

    The output directory structure is as follows:

    output_path :
        --Data
            ----1.jpg
            ----1.xml     # Export the filtering result to this directory.
        --output.manifest

    A manifest file example is as follows:

    {"source":"/tmp/test_out/object_detection/images/be462ea9c5abc09f.jpg",
    "hard":"True",
    "hard-reasons":"0", # Reason why the sample is determined as a hard example. The specific reason is displayed only in the auto labeling module.
    "hard-coefficient":"1.0", # Hard example coefficient obtained using the hard example algorithm. A larger value indicates a higher probability that the sample is a hard example.
    "annotation":[
    {"annotation-loc":"/tmp/test_out/object_detection/annotations/be462ea9c5abc09f.xml",
    "type":"modelarts/object_detection",
    "annotation-format":"PASCAL VOC",
    "annotated-by":"modelarts/hard_example_algo"}]}
  • Image classification

    The output directory structure is as follows:

    output_path :
        --Data
            ----class1
                ------1.jpg
            ----class2
                ------2.jpg
        --output.manifest

    A manifest file example is as follows:

    {"source":"obs://obs_bucket_name/folder_name/catDog/5.jpg",
    "hard":true,
    "hard-reasons":"1-20-2-19-21-3",
    "hard-coefficient":1.0,
    "annotation":[
    {"name":"cat",
    "type":"modelarts/image_classification",
    "confidence":0.599,
    "annotated-by":"modelarts/hard_example_algo"}]}

Log File Description

task_summary_file is the output file path of the simplified log. The content is as follows:

{
"task_status": 'SUCCEED', # Algorithm execution status
"total_sample": integer,  # Total input samples
"hard_sample": integer # Total output samples
}

or

{
"task_status": 'FAILED',
"error_message": 'xxxxxx' # Error information that causes the algorithm execution failure
}