Submitting Feature Engineering Jobs
Function
This API is used to submit feature engineering jobs, including data preprocessing, feature extraction, and the generation of ranking training samples.
URI
POST /v1/{project_id}/etl-job
Table 1 describes the URI parameters.
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| project_id | Yes | String | Project ID, which is used for resource isolation. For details about how to obtain the project ID, see Obtaining a Project ID. |
Request
Table 2 describes the request parameters.
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| workspace_id | No | String | Workspace ID. The default value is 0. |
| job_name | Yes | String | Training job name. The value can contain a maximum of 20 characters. |
| job_description | No | String | Training job description. The value can contain a maximum of 256 characters. |
| algorithm_type | Yes | String | Algorithm types, which are:
|
| algorithm_parameters | Yes | JSON | Algorithm parameter. Each kind of algorithm has specified parameters. |
| data_source | Yes | List | Algorithm data source
|
| storage | Yes | JSON | Storage platform. For details, see Table 6. |
| offline_platform | Yes | JSON | Offline computing platform. For details, see Table 3. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| platform | Yes | String | Platform name. The value can contain a maximum of 64 characters. Currently, only DLI is supported. |
| platform_parameter | Yes | JSON | Platform parameter. For details, see Table 4. |
| computing_resource | No | String | Resource specifications required for the normal running of the DLI jobs. |
| config_load_path | Yes | String | Path to read the configuration sources. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| cluster_name | Yes | String | Cluster name |
| cluster_id | No | String | Cluster ID |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| table_type_id | Yes | String | General data templates:
For details about the data format, see Offline Data Sources. General format
|
| data_source_url | Yes | String | Data source path. The value can contain a maximum of 1000 characters. |
| data_format | Yes | String | Input data format. The value can be csv, parquet, json, or orc. |
| data_param | No | JSON | Data parameter. For details, see Table 7. This parameter is mandatory when the data format is csv and optional for other data formats. |
| start_time | No | String | Start time for collecting the source data. This parameter is mandatory when the data format is json and optional for other data formats. |
| end_time | No | String | End time for collecting the source data. This parameter is mandatory when the data format is json and optional for other data formats. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| user_profiles_table | No | JSON | User attribute storage table. For details, see Table 8. This parameter is mandatory when algorithm_type is set to INITIAL_PROFILES_GENERATION. |
| item_profiles_table | No | JSON | Item attribute storage table. For details, see Table 8. This parameter is mandatory when algorithm_type is set to INITIAL_PROFILES_GENERATION. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| header | Yes | Boolean | Whether to display the table header |
| delimiter | Yes | String | Delimiter. The value can contain a maximum of 10 characters. |
| quote | Yes | String | Quotation character. The value can contain a maximum of 10 characters. |
| escape | Yes | String | Escape character. The value can contain a maximum of 10 characters. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| result_path | Yes | String | Path or folder that stores all output data (user and item attributes, feature maps, field features, training sets, and test sets). |
| global_features_information_path | Yes | String | Global feature file (JSON) that contains the feature names, feature types, and feature value types. For details about the global feature file, see Viewing Global Feature File Configurations. |
| writer_parameters | No | JSON | Advanced settings. For details, see Table 10. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| result_path | Yes | String | Path or folder that stores all output data (user and item attributes, feature maps, field features, training sets, and test sets). |
| global_features_information_path | Yes | String | Global feature file (JSON) that contains the feature names, feature types, and feature value types. For details about the global feature file, see Viewing Global Feature File Configurations. |
| rank_etl_type | Yes | Enum | Operator type for processing ranking data. Each ranking algorithm requires specific data processing, and the ranking data processing type needs to be selected according to the used ranking algorithms. Data processing results of the LR, FM, FFM, DeepFM, and PIN algorithms can be shared. |
| rank_etl_parameters | Yes | JSON | Data preprocessing parameter of the ranking algorithm. For details, see Table 11. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| save_mode | No | String | Mode of retaining the existing wide table data in the result save path.
|
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| (divide_by_time_or_rate) | Yes | String | The training set and the test set are differentiated by TIME or RATE. The value can be TIME or RATE. |
| (training_data_start_time) | No | Long | Start time of training data. This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value is less than the maximum time in the behavior data and the value of training_data_end_time. For example, 1541987933. |
| (training_data_end_time) | No | Long | End time of training data. This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value must be less than the maximum time in the behavior data and greater than the value of training_data_end_time. For example, 1541987933. |
| (test_data_start_time) | No | Long | Start time of test data. This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value is less than the maximum time in the behavior data and the value of test_data_end_time. For example, 1541987933. |
| (test_data_end_time) | No | Long | End time of test data. This parameter is mandatory when divide_by_time_or_rate is set to TIME. The value must be less than the maximum time in the behavior data and greater than the value of test_data_start_time. For example, 1541987933. |
| (training_data_rate) | No | Double | Percentage of training data in the input data. This parameter is mandatory when divide_by_time_or_rate is set to RATE. The value ranges from 0 to 1. |
| (test_data_rate) | No | Double | Percentage of test data in the input data. This parameter is mandatory when divide_by_time_or_rate is set to RATE. The value ranges from 0 to 1. |
| (user_features) | Yes | JSONArray | Input user feature extracted from the global feature file, which can be used for ranking model training after being processed properly. The feature must be defined in the User Attribute Configuration Table. [{ "feature_name": "age", "feature_type": "numerical", "feature_type":"BASIC_INFO", "feature_process_parameters": { "discrete_method": "equal_distance_discrete", "lower_limit": 0.0, "upper_limit": 120.0, "distance": 20 } }, { "feature_name": "user_tag", "feature_type": "map", "feature_type":"TAGS", "feature_process_parameters": { "value_preserve_number": 4 } }] |
| (item_features) | Yes | JSONArray | Input item feature extracted from the global feature file, which can be used for ranking model training after being processed properly. The feature must be defined in the Item Attribute Configuration List. [{ "feature_name": "product_name", "feature_type": "string", "feature_type":"BASIC_INFO", "feature_process_parameters": { } }, { "feature_name": "categories", "feature_type": "strArray", "feature_type":"BASIC_INFO", "feature_process_parameters": { "value_preserve_number": 3 } }] |
| (positive_behaviors) | Yes | List[String] | Sample of the positive behaviors that will be converted into a positive sample in the ranking data. The value must be the same as that of actionType in the User Operation Behavior Table. For example, [click,collect,purchase,share]. |
| (negative_behaviors) | Yes | List[String] | Sample of the negative behaviors that will be converted into a negative sample in the ranking data. The value must be the same as that of actionType in the User Operation Behavior Table. For example, [view,dislike]. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| (feature_name) | Yes | String | Feature name |
| (feature_type) | Yes | String | User feature types:
Item feature types:
|
| (feature_value _type) | Yes | String | Feature value type. The options are as follows:
|
| (feature_process_parameters) | Yes | JSON | Each type of feature has a corresponding processing method whose parameters are provided by users. Example: { "discrete_method":"equal_distance_discrete", "lower_limit":0.0, "upper_limit":120.0, "distance":20 } |
| Parameter | Mandatory | Type | Description | ||
|---|---|---|---|---|---|
| (discrete_method) | (equal_distance_discrete) | (lower_limit) | No | Double | If the feature value is less than the value of this parameter, the value is regarded as abnormal. You can specify this parameter based on business experience. If you do not specify this parameter, the minimum feature value in the data will be used. The value is [Double.Minvalue, Double.MaxValue). The value must be smaller than the maximum value of the parameter. |
| (upper_limit) | No | Double | If the feature value is greater than the value of this parameter, the value is regarded as abnormal. You can specify this parameter based on business experience. If you do not specify this parameter, the maximum feature value in the data will be used. The value is (Double.Minvalue, Double.Maxvalue]. The value must be greater than the minimum value of the parameter. | ||
| (distance) | Yes | Double | The feature range is divided into several segments by using the distance or an interval, and each segment corresponds to a discrete value. The value is (0, Double.Maxvalue). | ||
| (equal_frequency_discrete) | (lower_limit) | No | Double | If the feature value is less than the value of this parameter, the value is regarded as abnormal. You can specify this parameter based on business experience. If you do not specify this parameter, the minimum feature value in the data will be used. The value is [Double.Minvalue, Double.Maxvalue). The value must be smaller than the maximum value of the parameter. | |
| (upper_limit) | No | Double | If the feature value is greater than the value of this parameter, the value is regarded as abnormal. You can specify this parameter based on business experience. If you do not specify this parameter, the maximum feature value in the data will be used. The value is (Double.Minvalue, Double.Maxvalue]. The value must be greater than the minimum value of the parameter. | ||
| (frequency) | Yes | Int | The feature values are ranked in ascending order. Each value is separated as a segment, and each segment corresponds to a discrete value. The value is (0, Int.Maxvalue). | ||
| (user_define_discrete) | (period_list) | Yes | JSONArray | The minimum value, maximum value, and discrete value of each period are defined by users. If a feature value is located between a minimum value and a maximum value of a period, it is the discrete value of this period. If the feature value is not within any periods defined by the user, it is treated as an abnormal value. Each period is a half-closed half-open interval, that is, a minimum value but not a maximum value is included. Different periods cannot overlap. Example: [ { "period_name": "young", "lower_limit": 0.0, "upper_limit": 18.0 } ,{ "period_name": "mid", "lower_limit": 18.0, "upper_limit": 60.0 } ,{ "period_name": "old", "lower_limit": 60.0, "upper_limit": 120.0 } ] | |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| (lower_limit) | Yes | Double | Minimum value of a period The value ranges from Double.Minvalue to Double.Maxvalue. The value must be smaller than the maximum value of the parameter. |
| (upper_limit) | Yes | Double | Maximum value of a period The value ranges from from Double.Minvalue to Double.Maxvalue. The value must be greater than the minimum value of the parameter. |
| (period_name) | Yes | String | Name of a period |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| (value_preserve_number) | No | Int | Number of preserved strArray feature values. If the actual value is greater than this value, the extra values are deleted. If the actual value is less than this value, all values are reserved. If this parameter is not specified, the maximum value of the strArray in data is used as the input value. The value ranges from 1 to 100. |
| Parameter | Mandatory | Type | Description |
|---|---|---|---|
| (value_preserve_number) | No | Int | Number of preserved KV number feature values. If the actual value is greater than this value, the extra values are deleted. If the actual value is less than this value, all values are reserved. If this parameter is not specified, the maximum value of the KV number feature in the data is used as the input value. The value ranges from 1 to 100. |
Response
Table 17 describes the response parameters.
| Parameter | Type | Description |
|---|---|---|
| job_name | String | Job name |
| job_id | String | Job ID |
| is_success | Boolean | Whether the request is successful |
| error_message | String | Error message that indicates a request has failed. This parameter is unavailable when a request is successful. |
| error_code | String | Error code that indicates a request has failed. This parameter is unavailable when a request is successful. |
| create_time | Long | Time when a job is created |
| etl_uuid | String | Candidate set ID |
Example
- Example request
{ "job_name": "ETL-rank_test1", "job_description": "hhx test", "algorithm_type": "BUILD_RANK_UNIFORM_DATA_FROM_JSON", "data_source": [ { "table_type_id": "GENERAL_FORMAT", "data_format": "json", "data_source_url": "<Path for storing the data sources>", "start_time": "" } ], "algorithm_parameters": { "result_path": "<Path for storing all output data>", "global_features_information_path": "<Path for storing the global feature files>", "rank_etl_type": "LR", "rank_etl_parameters": { "divide_by_time_or_rate": "RATE", "training_data_start_time": "1552117770165", "training_data_end_time": "1517414400000", "test_data_start_time": "1517414400000", "test_data_end_time": "1519217998000", "training_data_rate": "0.8", "test_data_rate": "0.2", "user_features": [ { "feature_name": "provinceId", "feature_type": "BASIC_INFO", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "no_discrete" } }, { "feature_name": "cityId", "feature_type": "BASIC_INFO", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "equal_distance_discrete", "lower_limit": 0, "upper_limit": 10000, "distance": 1000 } }, { "feature_name": "districtId", "feature_type": "BASIC_INFO", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "no_discrete" } }, { "feature_name": "payment_type", "feature_type": "CONTEXT", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "no_discrete" } }, { "feature_name": "payment_method", "feature_type": "CONTEXT", "feature_value_type": "string", "feature_process_parameters": {} }, { "feature_name": "payment_channel", "feature_type": "CONTEXT", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "no_discrete" } }, { "feature_name": "salary", "feature_type": "BASIC_INFO", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "user_define_discrete", "period_list": [ { "period_name": "low", "lower_limit": 0, "upper_limit": 5000 }, { "period_name": "mid", "lower_limit": 5000, "upper_limit": 30000 }, { "period_name": "high", "lower_limit": 30000, "upper_limit": 100000 } ] } }, { "feature_name": "user_tags", "feature_type": "TAGS", "feature_value_type": "map", "feature_process_parameters": { "process_method": "map_format", "value_preserve_number": 4 } }, { "feature_name": "hobbies", "feature_type": "BASIC_INFO", "feature_value_type": "strArray", "feature_process_parameters": { "process_method": "string_array_format", "value_preserve_number": 3 } } ], "item_features": [ { "feature_name": "product_name", "feature_type": "BASIC_INFO", "feature_value_type": "string", "feature_process_parameters": {} }, { "feature_name": "order_price", "feature_type": "BASIC_INFO", "feature_value_type": "numerical", "feature_process_parameters": { "discrete_method": "equal_frequency_discrete", "frequency": 10 } }, { "feature_name": "weight", "feature_type": "BASIC_INFO", "feature_value_type": "string", "feature_process_parameters": {} }, { "feature_name": "volume", "feature_type": "BASIC_INFO", "feature_value_type": "string", "feature_process_parameters": {} }, { "feature_name": "categories", "feature_type": "BASIC_INFO", "feature_value_type": "strArray", "feature_process_parameters": { "process_method": "string_array_format", "value_preserve_number": 3 } }, { "feature_name": "item_tags", "feature_type": "TAGS", "feature_value_type": "map", "feature_process_parameters": { "process_method": "map_format", "value_preserve_number": 3 } } ], "positive_behaviors": [ "consume" ], "negative_behaviors": [ "uncollect", "dislike" ] } }, "offline_platform": { "platform": "DLI", "platform_parameter": { "cluster_name": "res_two" }, "config_load_path": "<Path for storing the configuration sources>" }, "storage": {} }
- Example of a successful response
{ "is_success": true, "job_id": "d832b07540594ea980c140fea5a10849", "job_name": "gggggggggggggggggg", "create_time": "1543891781990", "etl_uuid": "a53a685c52f4476f833d256620b6fc80" } - Example of a failed response
{ "is_success": false, "error_code": "res.2006", "error_msg": "The datasourceUrl(<Path for storing the data sources>) is not match Bucket structure." }
Status Code
For details about status codes, see Status Codes.
Last Article: Submitting Feature Engineering Jobs
Next Article: Viewing Global Feature File Configurations
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.