Submitting Feature Engineering Jobs

Function

This API is used to submit feature engineering jobs, including data preprocessing, feature extraction, and the generation of ranking training samples.

URI

POST /v1/{project_id}/etl-job

Table 1 describes the URI parameters.

Table 1 URI parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Project ID, which is used for resource isolation. For details about how to obtain the project ID, see Obtaining a Project ID.

Request

Table 2 describes the request parameters.

Table 2 Request parameters

Parameter

Mandatory

Type

Description

workspace_id

No

String

Workspace ID. The default value is 0.

job_name

Yes

String

Training job name. The value can contain a maximum of 20 characters.

job_description

No

String

Training job description. The value can contain a maximum of 256 characters.

algorithm_type

Yes

String

Algorithm types, which are:

  • INITIAL_PROFILES_GENERATION
  • BUILD_RANK_UNIFORM_DATA_FROM_JSON

algorithm_parameters

Yes

JSON

Algorithm parameter. Each kind of algorithm has specified parameters.

  • Table 8 describes the details about INITIAL_PROFILES_GENERATION.
  • Table 9 describes the details about BUILD_RANK_UNIFORM_DATA_FROM_JSON.

data_source

Yes

List

Algorithm data source

  • INITIAL_PROFILES_GENERATION: Select the general template data as the data source.
  • BUILD_RANK_UNIFORM_DATA_FROM_JSON: Select the general data as the data source.

storage

Yes

JSON

Storage platform. For details, see Table 6.

offline_platform

Yes

JSON

Offline computing platform. For details, see Table 3.

Table 3 offline_platform parameters

Parameter

Mandatory

Type

Description

platform

Yes

String

Platform name. The value can contain a maximum of 64 characters. Currently, only DLI is supported.

platform_parameter

Yes

JSON

Platform parameter. For details, see Table 4.

computing_resource

No

String

Resource specifications required for the normal running of the DLI jobs.

config_load_path

Yes

String

Path to read the configuration sources.

Table 4 platform_parameter parameters

Parameter

Mandatory

Type

Description

cluster_name

Yes

String

Cluster name

cluster_id

No

String

Cluster ID

Table 5 data_source parameters

Parameter

Mandatory

Type

Description

table_type_id

Yes

String

General data templates:

  • USER_META: User feature list
  • ITEM_META: Item feature list
  • USER_BEHAVIOR: User behavior list

For details about the data format, see Offline Data Sources.

General format

  • GENERAL_FORMAT

data_source_url

Yes

String

Data source path. The value can contain a maximum of 1000 characters.

data_format

Yes

String

Input data format. The value can be csv, parquet, json, or orc.

data_param

No

JSON

Data parameter. For details, see Table 7. This parameter is mandatory when the data format is csv and optional for other data formats.

start_time

No

String

Start time for collecting the source data. This parameter is mandatory when the data format is json and optional for other data formats.

end_time

No

String

End time for collecting the source data. This parameter is mandatory when the data format is json and optional for other data formats.

Table 6 storage parameters

Parameter

Mandatory

Type

Description

user_profiles_table

No

JSON

User attribute storage table. For details, see Table 8.

This parameter is mandatory when algorithm_type is set to INITIAL_PROFILES_GENERATION.

item_profiles_table

No

JSON

Item attribute storage table. For details, see Table 8.

This parameter is mandatory when algorithm_type is set to INITIAL_PROFILES_GENERATION.

Table 7 data_param parameters

Parameter

Mandatory

Type

Description

header

Yes

Boolean

Whether to display the table header

delimiter

Yes

String

Delimiter. The value can contain a maximum of 10 characters.

quote

Yes

String

Quotation character. The value can contain a maximum of 10 characters.

escape

Yes

String

Escape character. The value can contain a maximum of 10 characters.

Table 8 algorithm_parameters parameters (INITIAL_PROFILES_GENERATION operator)

Parameter

Mandatory

Type

Description

result_path

Yes

String

Path or folder that stores all output data (user and item attributes, feature maps, field features, training sets, and test sets).

global_features_information_path

Yes

String

Global feature file (JSON) that contains the feature names, feature types, and feature value types. For details about the global feature file, see Viewing Global Feature File Configurations.

writer_parameters

No

JSON

Advanced settings. For details, see Table 10.

Table 9 algorithm_parameters parameters (the BUILD_RANK_UNIFORM_DATA_FROM_JSON operator)

Parameter

Mandatory

Type

Description

result_path

Yes

String

Path or folder that stores all output data (user and item attributes, feature maps, field features, training sets, and test sets).

global_features_information_path

Yes

String

Global feature file (JSON) that contains the feature names, feature types, and feature value types. For details about the global feature file, see Viewing Global Feature File Configurations.

rank_etl_type

Yes

Enum

Operator type for processing ranking data.

Each ranking algorithm requires specific data processing, and the ranking data processing type needs to be selected according to the used ranking algorithms.

Data processing results of the LR, FM, FFM, DeepFM, and PIN algorithms can be shared.

rank_etl_parameters

Yes

JSON

Data preprocessing parameter of the ranking algorithm. For details, see Table 11.

Table 10 writer_parameters parameters

Parameter

Mandatory

Type

Description

save_mode

No

String

Mode of retaining the existing wide table data in the result save path.

  • New: No existing data is retained.
  • Append: All existing data is retained.
  • Overwrite: Data of the same date is overwritten and data of different dates is retained.
Table 11 rank_etl_parameters parameters (LR, FM, FFM, DeepFM, and PIN)

Parameter

Mandatory

Type

Description

(divide_by_time_or_rate)

Yes

String

The training set and the test set are differentiated by TIME or RATE.

The value can be TIME or RATE.

(training_data_start_time)

No

Long

Start time of training data.

This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value is less than the maximum time in the behavior data and the value of training_data_end_time. For example, 1541987933.

(training_data_end_time)

No

Long

End time of training data.

This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value must be less than the maximum time in the behavior data and greater than the value of training_data_end_time. For example, 1541987933.

(test_data_start_time)

No

Long

Start time of test data.

This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value is less than the maximum time in the behavior data and the value of test_data_end_time. For example, 1541987933.

(test_data_end_time)

No

Long

End time of test data.

This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value must be less than the maximum time in the behavior data and greater than the value of test_data_start_time. For example, 1541987933.

(training_data_rate)

No

Double

Percentage of training data in the input data. This parameter is mandatory when divide_by_time_or_rate is set to RATE. The value ranges from 0 to 1.

(test_data_rate)

No

Double

Percentage of test data in the input data. This parameter is mandatory when divide_by_time_or_rate is set to RATE. The value ranges from 0 to 1.

(user_features)

Yes

JSONArray

Input user feature extracted from the global feature file, which can be used for ranking model training after being processed properly.

The feature must be defined in the User Attribute Configuration Table.

[{

"feature_name": "age",

"feature_type": "numerical",

"feature_type":"BASIC_INFO",

"feature_process_parameters": {

"discrete_method": "equal_distance_discrete",

"lower_limit": 0.0,

"upper_limit": 120.0,

"distance": 20

}

},

{

"feature_name": "user_tag",

"feature_type": "map",

"feature_type":"TAGS",

"feature_process_parameters": {

"value_preserve_number": 4

}

}]

(item_features)

Yes

JSONArray

Input item feature extracted from the global feature file, which can be used for ranking model training after being processed properly. The feature must be defined in the Item Attribute Configuration List.

[{

"feature_name": "product_name",

"feature_type": "string",

"feature_type":"BASIC_INFO",

"feature_process_parameters": {

}

},

{

"feature_name": "categories",

"feature_type": "strArray",

"feature_type":"BASIC_INFO",

"feature_process_parameters": {

"value_preserve_number": 3

}

}]

(positive_behaviors)

Yes

List[String]

Sample of the positive behaviors that will be converted into a positive sample in the ranking data. The value must be the same as that of actionType in the User Operation Behavior Table. For example,

[click,collect,purchase,share].

(negative_behaviors)

Yes

List[String]

Sample of the negative behaviors that will be converted into a negative sample in the ranking data. The value must be the same as that of actionType in the User Operation Behavior Table. For example, [view,dislike].

Table 12 Features and their processing modes

Parameter

Mandatory

Type

Description

(feature_name)

Yes

String

Feature name

(feature_type)

Yes

String

User feature types:

  • BASIC_INFO
  • TAGS
  • CONTEXT

Item feature types:

  • BASIC_INFO
  • TAGS

(feature_value _type)

Yes

String

Feature value type. The options are as follows:

  • Single-value enumeration (string): Character string type. Each value is processed as a character string. Most feature values belong to this type.
  • Single-value number (numerical): Numerical type. Generally, feature values of this type need discretization to reduce feature dimensions.
  • Multi-value enumeration (strArray): strArray type. Each feature value has variable length, for example, features of commodity categories and user interests. The ranking preprocessing operator normalizes all feature values to a unified length for subsequent processing.
  • KV number (map): Map[String,Double] type. Each feature value is a variable-length key-value pair, for example, a user profile and an item profile. The ranking preprocessing operator normalizes all feature values to a unified length for subsequent processing.

(feature_process_parameters)

Yes

JSON

Each type of feature has a corresponding processing method whose parameters are provided by users. Example:

{

"discrete_method":"equal_distance_discrete",

"lower_limit":0.0,

"upper_limit":120.0,

"distance":20

}

Table 13 Discrete methods and parameters

Parameter

Mandatory

Type

Description

(discrete_method)

(equal_distance_discrete)

(lower_limit)

No

Double

If the feature value is less than the value of this parameter, the value is regarded as abnormal.

You can specify this parameter based on business experience. If you do not specify this parameter, the minimum feature value in the data will be used. The value is

[Double.Minvalue, Double.MaxValue).

The value must be smaller than the maximum value of the parameter.

(upper_limit)

No

Double

If the feature value is greater than the value of this parameter, the value is regarded as abnormal.

You can specify this parameter based on business experience. If you do not specify this parameter, the maximum feature value in the data will be used. The value is

(Double.Minvalue, Double.Maxvalue].

The value must be greater than the minimum value of the parameter.

(distance)

Yes

Double

The feature range is divided into several segments by using the distance or an interval, and each segment corresponds to a discrete value. The value is (0, Double.Maxvalue).

(equal_frequency_discrete)

(lower_limit)

No

Double

If the feature value is less than the value of this parameter, the value is regarded as abnormal.

You can specify this parameter based on business experience. If you do not specify this parameter, the minimum feature value in the data will be used. The value is

[Double.Minvalue, Double.Maxvalue).

The value must be smaller than the maximum value of the parameter.

(upper_limit)

No

Double

If the feature value is greater than the value of this parameter, the value is regarded as abnormal.

You can specify this parameter based on business experience. If you do not specify this parameter, the maximum feature value in the data will be used. The value is

(Double.Minvalue, Double.Maxvalue].

The value must be greater than the minimum value of the parameter.

(frequency)

Yes

Int

The feature values are ranked in ascending order. Each value is separated as a segment, and each segment corresponds to a discrete value. The value is (0, Int.Maxvalue).

(user_define_discrete)

(period_list)

Yes

JSONArray

The minimum value, maximum value, and discrete value of each period are defined by users.

If a feature value is located between a minimum value and a maximum value of a period, it is the discrete value of this period.

If the feature value is not within any periods defined by the user, it is treated as an abnormal value. Each period is a half-closed half-open interval, that is, a minimum value but not a maximum value is included. Different periods cannot overlap. Example:

[

{

"period_name": "young",

"lower_limit": 0.0,

"upper_limit": 18.0

} ,{

"period_name": "mid",

"lower_limit": 18.0,

"upper_limit": 60.0

} ,{

"period_name": "old",

"lower_limit": 60.0,

"upper_limit": 120.0

}

]

Table 14 Custom discrete parameters

Parameter

Mandatory

Type

Description

(lower_limit)

Yes

Double

Minimum value of a period The value ranges from

Double.Minvalue to Double.Maxvalue.

The value must be smaller than the maximum value of the parameter.

(upper_limit)

Yes

Double

Maximum value of a period The value ranges from

from Double.Minvalue to Double.Maxvalue.

The value must be greater than the minimum value of the parameter.

(period_name)

Yes

String

Name of a period

Table 15 strArray parameters

Parameter

Mandatory

Type

Description

(value_preserve_number)

No

Int

Number of preserved strArray feature values. If the actual value is greater than this value,

the extra values are deleted. If the actual value is less than this value,

all values are reserved. If this parameter is not specified,

the maximum value of the strArray in data is used as the input value. The value ranges from 1 to 100.

Table 16 KV number parameters

Parameter

Mandatory

Type

Description

(value_preserve_number)

No

Int

Number of preserved KV number feature values. If the actual value is greater than this value,

the extra values are deleted. If the actual value is less than this value,

all values are reserved. If this parameter is not specified,

the maximum value of the KV number feature in the data is used as the input value. The value ranges from 1 to 100.

Response

Table 17 describes the response parameters.

Table 17 Response parameters

Parameter

Type

Description

job_name

String

Job name

job_id

String

Job ID

is_success

Boolean

Whether the request is successful

error_message

String

Error message that indicates a request has failed. This parameter is unavailable when a request is successful.

error_code

String

Error code that indicates a request has failed. This parameter is unavailable when a request is successful.

create_time

Long

Time when a job is created

etl_uuid

String

Candidate set ID

Example

  • Example request
    {
      "job_name": "ETL-rank_test1",
      "job_description": "hhx test",
      "algorithm_type": "BUILD_RANK_UNIFORM_DATA_FROM_JSON",
      "data_source": [
        {
          "table_type_id": "GENERAL_FORMAT",
          "data_format": "json",
          "data_source_url": "<Path for storing the data sources>",
    
          "start_time": ""
        }
      ],
      "algorithm_parameters": {
        "result_path": "<Path for storing all output data>",
        "global_features_information_path": "<Path for storing the global feature files>",
        "rank_etl_type": "LR",			
        "rank_etl_parameters": {
          "divide_by_time_or_rate": "RATE",
          "training_data_start_time": "1552117770165",
          "training_data_end_time": "1517414400000",
          "test_data_start_time": "1517414400000",
          "test_data_end_time": "1519217998000",
          "training_data_rate": "0.8",
          "test_data_rate": "0.2",
          "user_features": [
            {
              "feature_name": "provinceId",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "no_discrete"
              }
            },
            {
              "feature_name": "cityId",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "equal_distance_discrete",
                "lower_limit": 0,
                "upper_limit": 10000,
                "distance": 1000
              }
            },
            {
              "feature_name": "districtId",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "no_discrete"
              }
            },
            {
              "feature_name": "payment_type",
              "feature_type": "CONTEXT",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "no_discrete"
              }
            },
            {
              "feature_name": "payment_method",
              "feature_type": "CONTEXT",
              "feature_value_type": "string",
              "feature_process_parameters": {}
            },
            {
              "feature_name": "payment_channel",
              "feature_type": "CONTEXT",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "no_discrete"
              }
            },
            {
              "feature_name": "salary",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "user_define_discrete",
                "period_list": [
                  {
                    "period_name": "low",
                    "lower_limit": 0,
                    "upper_limit": 5000
                  },
                  {
                    "period_name": "mid",
                    "lower_limit": 5000,
                    "upper_limit": 30000
                  },
                  {
                    "period_name": "high",
                    "lower_limit": 30000,
                    "upper_limit": 100000
                  }
                ]
              }
            },
            {
              "feature_name": "user_tags",
              "feature_type": "TAGS",
              "feature_value_type": "map",
              "feature_process_parameters": {
                "process_method": "map_format",
                "value_preserve_number": 4
              }
            },
            {
              "feature_name": "hobbies",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "strArray",
              "feature_process_parameters": {
                "process_method": "string_array_format",
                "value_preserve_number": 3
              }
            }
          ],
          "item_features": [
            {
              "feature_name": "product_name",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "string",
              "feature_process_parameters": {}
            },
            {
              "feature_name": "order_price",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "numerical",
              "feature_process_parameters": {
                "discrete_method": "equal_frequency_discrete",
                "frequency": 10
              }
            },
            {
              "feature_name": "weight",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "string",
              "feature_process_parameters": {}
            },
            {
              "feature_name": "volume",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "string",
              "feature_process_parameters": {}
            },
            {
              "feature_name": "categories",
              "feature_type": "BASIC_INFO",
              "feature_value_type": "strArray",
              "feature_process_parameters": {
                "process_method": "string_array_format",
                "value_preserve_number": 3
              }
            },
            {
              "feature_name": "item_tags",
              "feature_type": "TAGS",
              "feature_value_type": "map",
              "feature_process_parameters": {
                "process_method": "map_format",
                "value_preserve_number": 3
              }
            }
          ],
          "positive_behaviors": [
            "consume"
          ],
          "negative_behaviors": [
            "uncollect",
            "dislike"
          ]
        }
      },
      "offline_platform": {
        "platform": "DLI",
        "platform_parameter": {
          "cluster_name": "res_two"
        },
        "config_load_path": "<Path for storing the configuration sources>"
      },
      "storage": {}
    }
  • Example of a successful response
    {
        "is_success": true,
        "job_id": "d832b07540594ea980c140fea5a10849",
        "job_name": "gggggggggggggggggg",
        "create_time": "1543891781990",
        "etl_uuid": "a53a685c52f4476f833d256620b6fc80"
    }
  • Example of a failed response
    {
    "is_success": false,
    "error_code": "res.2006",
    "error_msg": "The datasourceUrl(<Path for storing the data sources>) is not match Bucket structure."
    }

Status Code

For details about status codes, see Status Codes.