Collecting Data

ModelArts provides an auto hard example identification function for you to filter hard example data from inference data inputted to an existing model based on built-in rules. This improves model precision, and effectively reduces labeling manpower required upon a model update. This function helps mine data that benefits model precision improvement as much as possible. You only need to confirm and label useful data and add it to a training dataset. Then, a new model with higher precision can be obtained after training.

You can call a URL or use the console to input data for a model deployed as a real-time service. Then, use the data collection function to collect data or filter out hard examples, and output them to a dataset for future model training.

For real-time services, data collection involves the following scenarios, as shown in Figure 1.

  • Data Collection: Enable a data collection task to collect and store data generated when a real-time service is invoked based on the configured rules.
  • Synchronizing Data to a Dataset: Synchronize the collected data to a dataset for unified management and application.
  • Data Collection and Hard Example Filtering: Enable the hard example filtering function in addition to the data collection task to filter hard examples from the collected data using built-in algorithms. Finally, store hard examples and the collected data in a corresponding dataset for retraining.
  • Hard Example Feedback: When calling a real-time service for prediction, report inaccurately predicted image data as hard examples and store them to the corresponding dataset.
Figure 1 Data collection for real-time services

Prerequisites

  • A trained model has been deployed as a real-time service, and the real-time service is in the Running status.
  • The type of the real-time service is proper. Data collection and hard example filtering are available for the object detection and image classification types only.

Data Collection

You can enable a data collection task when deploying a model as a real-time service or on the service details page after the real-time service is deployed. If only a data collection task is enabled, data generated during service invoking is merely collected and stored to OBS. If you want to filter hard examples, follow instructions in Data Collection and Hard Example Filtering. If you want to synchronize the collected data to a dataset and do not require hard example filtering, follow instructions in Synchronizing Data to a Dataset.

  1. Log in to the ModelArts management console and choose Service Deployment > Real-Time Services.
  2. Enable a data collection task.
    • When deploying a model as a real-time service, enable Data Collection on the Deploy page.
      Figure 2 Enabling the data collection function on the Deploy page
    • After a real-time service is deployed, click the service name to go to the service details page. In the Sample area, enable a data collection task.
      Figure 3 Enabling the data collection function on the details page
  3. Set parameters related the data collection task. Table 1 describes the parameters.
    Table 1 Data collection parameters

    Parameter

    Description

    Sample Collection Rule

    Possible values are Full collection and By confidence score. Currently, only Full collection is available.

    Sample Output Path

    Path for storing collected data. Only OBS paths are supported. Select an existing path or create an OBS path.

    Retention Period

    Possible values are 1 day, 1 week, Permanent, and Custom.

    • 1 day: Only data generated during service running within one day is collected.
    • 1 week: Only data generated during service running within one week is collected.
    • Permanent: All data after a service is started is collected.
    • Custom: The parameter can only be set to X days, indicating that data generated during service running within X days is collected.
    Figure 4 Data collection configuration

    After data collection is enabled, the uploaded data is collected to the corresponding OBS path based on the configured rules when the service is invoked for prediction either through the console or URL APIs.

Synchronizing Data to a Dataset

For real-time services with data collection enabled, the collected data can be synchronized to a dataset. The synchronization operation merely stores the collected data to the dataset without hard examples filtered. You can select an existing dataset or create a dataset to store data.

  1. Enable a data collection task. For details, see Data Collection.

    If no data is collected in the data collection task because you do not invoke an API to implement prediction, synchronizing data to the dataset cannot be implemented.

  2. Click the service name to go to the service details page. In the Synchronize Data area, click Synchronize Data to Dataset.
    Figure 5 Synchronizing data to a dataset
  3. In the displayed dialog box, select a labeling type and a dataset, and click OK to synchronize the collected data to the dataset. The synchronized data will be displayed on the Unlabeled tab page of the dataset.

    Data to be synchronized is the data collected by the system in a data collection task based on the configured rules. If no data is collected, data synchronization cannot be implemented.

    Figure 6 Synchronizing data to a dataset

Data Collection and Hard Example Filtering

If you only enable a data collection task, hard examples cannot be automatically identified. To filter hard examples from the collected data and store filtering results to a dataset, you need to enable both data collection and hard example filtering tasks.

The hard example filtering function has requirements on the prediction output format, which varies depending on the model source.

  • For models trained by ExeML, you do not need to modify them. The prediction output format of ExeML is built in the system, which meets the requirements of hard example filtering.
  • For models trained by built-in algorithms, you do not need to modify them. The prediction output format of built-in algorithms is built in the system, which meets the requirements of hard example filtering.
  • For models you trained, ensure that the output format in the inference code meets the requirements. The requirements for object detection are different from that for image classification. For details, see Prediction Output Format Requirements. For example, if you use a frequently-used framework or a custom image to train a model, the prediction output format must meet the requirements of the corresponding type when you import the model and compile the inference code.
  1. Enable a data collection task. For details, see Data Collection.

    A data collection task must be enabled first before you enable hard example filtering. If a data collection task has been enabled for a real-time service and data is still stored in the OBS path, you can enable only the hard example filtering function. In this case, hard examples are filtered from only the data stored in the OBS path.

  2. Enable a hard example filtering task on the same page for configuring the data collection task. For details about the parameters, see Table 2.
    Table 2 Hard example filtering parameters

    Parameter

    Description

    Model Type

    Model application type. Currently, only Image classification and Object detection are supported.

    Training Dataset

    A model is trained based on a dataset and can be deployed as a real-time service. When filtering hard examples, you can import the dataset corresponding to the real-time service to find data problems underlying the model.

    The model training and deployment process is as follows: Input training scripts and a dataset. > Train the dataset to obtain a model. > Deploy the model as a real-time service.

    This parameter is optional. You are advised to import the dataset to improve training precision. If your dataset is not managed on ModelArts, see Creating a Dataset.

    Filtering Policy

    The options are By duration and By sample quantity.

    • By duration: Filter data that is stored in an OBS path and is not filtered by duration. Possible values are 1 hour, 1 day, 2 days, and Custom. The value of Custom can only be XX hours.
      NOTE:

      If you filter hard examples by duration, the duration must be less than the value of Retention Period specified for data collection. For example, if Retention Period is set to 1 day for data collection, By duration must be set to a value equal to or less than 1 day for Filtering Policy. If the duration is greater than the retention period, the system filters only the data within the retention period.

    • By sample quantity: When the collected data reaches the sample quantity, the system performs once hard example filtering. Possible values are 100, 500, 1000, and Custom. If a sample quantity is less than the value specified in the filtering policy within the data collection period, hard example filtering will not be enabled. For example, if Retention Period is set to 1 day for data collection and only 100 images generated within 1 day, hard example filtering will not be enabled when By sample quantity is set to 500 in Filtering Policy. OBS deletes data that has been stored for more than one day based on the retention period. In this case, the number of samples does not increase and the filtering criteria cannot be met. Therefore, when setting a filtering policy, evaluate the service invoking volume, and set sample quantity based on the site requirements.

    Hard Example Output

    Save the filtered hard example data to a dataset. You can select an existing dataset or create a new dataset.

    A dataset type must match a model type. For example, if the model type is image classification, the dataset to which hard example data is outputted must be image classification.

    Figure 7 Enabling hard example filtering
  3. After data collection and hard example filtering tasks are configured, the system collects data and filters hard examples based on the configured rules. You can view the task status on the Filter tab page of the real-time service. After the task is complete, its status changes to Dataset imported. You can click the dataset link to quickly access the corresponding dataset. The collected data is stored on the Unlabeled tab page. The filtered hard examples are stored on the To Be Confirmed tab page of the dataset.
    Figure 8 Task status
    Figure 9 Hard example filtering result

Hard Example Feedback

On the ModelArts management console, if the prediction result of a real-time service is inaccurate, you can directly report it as a hard example to the corresponding dataset on the Prediction tab page.

  1. Log in to the ModelArts management console and choose Service Deployment > Real-Time Services. Click the service name to go to the service details page.
  2. Click the Prediction tab, upload the image for prediction, and click Predict.
  3. If the prediction result is inaccurate, click Feed Back.
    Figure 10 Hard example feedback for a real-time service
  4. In the displayed dialog box, select a labeling type and a dataset, and click OK to report the hard example to the dataset. The hard example will be displayed on the Unlabeled tab page of the dataset. This helps improve model training precision.
    Figure 11 Hard example feedback

Prediction Output Format Requirements

For a custom model, infer_output in the inference code, that is, the JSON format returned by the inference engine, must be the same as that in the following example.

  • Object detection

    The prediction output format is as follows:

    {
      "detection_classes": [
        "<label-name-1>",
        "<label-name-2>"
      ],
      "detection_boxes": [
        [
          <y_min>,
          <x_min>,
          <y_max>,
          <x_max>
        ],
        [
          <y_min>,
          <x_min>,
          <y_max>,
          <x_max>
        ]
      ],
      "detection_scores": [
        <label-1-score>,
        <label-2-score>
      ]
    }
    Table 3 Parameters in the prediction result

    Parameter

    Description

    detection_classes

    Label of each detection box

    detection_boxes

    Coordinates of four points (y_min, x_min, y_max, and x_max) of each detection box, as shown in Figure 12

    detection_scores

    Confidence of each detection box

    Figure 12 Illustration for coordinates of four points of a detection box
  • Image classification
    The prediction output format is as follows:
    {
      "predicted_label": "<label-name-1>",
      "scores": [
        [
          "<label-name-1>",
          "<label-1-score>"
        ],
        [
          "<label-name-2>",
          "<label-2-score>"
        ]
      ]
    }
    Table 4 Parameters in the prediction result

    Parameter

    Description

    predict_label

    Image prediction label

    scores

    Prediction confidence of top 5 labels