更新时间:2023-05-05 GMT+08:00
分享

LightGBM分类

概述

对mmlspark python包中LightGBM分类的封装

输入

参数

子参数

参数说明

inputs

dataframe

inputs为字典类型,dataframe为pyspark中的DataFrame类型对象

输出

spark pipeline类型的模型

参数说明

参数

子参数

参数说明

input_features_str

-

输入的列名以逗号分隔组成的字符串,例如:

"column_a"

"column_a,column_b"

label_col

-

目标列

classifier_label_index_col

-

目标列经过标签编码后的新的列名,默认为"label_index"

classifier_feature_vector_col

-

算子输入的特征向量列的列名,默认为"model_features"

prediction_index_col

-

算子输出的预测label对应的标签列,默认为"prediction_index"

prediction_col

-

算子输出的预测label的列名,默认为"prediction"

probability_col

-

算子输出的概率列的列名,默认为"probability"

is_unbalance

-

数据集是否不平衡,默认为False

timeout

-

超时时间,默认为1200秒

objective

-

目标函数,支持binary,multiclass,multiclassova,默认为"binary"

max_depth

-

树的最大深度,默认为-1

num_iteration

-

迭代次数,默认为100

learning_rate

-

学习率,默认为0.1

num_leaves

-

叶子数目,默认为31

max_bin

-

最大分箱数,默认为255

bagging_fraction

-

bagging的比例,默认为1

bagging_freq

-

bagging的频率,默认为0

bagging_seed

-

bagging时的随机数种子,默认为3

early_stopping_round

-

提前结束迭代的轮数,默认为0

feature_fraction

-

特征的比例,默认为1.0

min_sum_hessian_in_leaf

-

一个叶子上最小hessian和。取值区间为[0, 1],默认为1e-3

boost_from_average

-

是否将初始分数调整为标签的平均值,以加快收敛速度,,默认为True

boosting_type

-

提升方法的提升类型。

可选值有gbdt、gbrt、rf、dartgoss,默认为"gbdt"

lambda_l1

-

L1正则化系数,默认为0.0

lambda_l2

-

L2正则化系数,,默认为0.0

num_batches

-

如果大于0,在训练中将数据集分割成不同的批次,默认为0

parallelism

-

学习树时的并行方法,支持data_parallel, voting_parallel,默认为"data_parallel"

thresholds_str

-

多分类时使用,表示每个类别对应的概率值预置的数组,字符串用逗号隔开

样例

inputs = {
    "dataframe": None  # @input {"label":"dataframe","type":"DataFrame"}
}
params = {
    "inputs": inputs,
    "b_output_action": True,
    "outer_pipeline_stages": None,
    "input_features_str": "",  # @param {"label":"input_features_str","type":"string","required":"false","helpTip":""}
    "label_col": "",  # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
    "classifier_label_index_col": "label_index",  # @param {"label":"classifier_label_index_col","type":"string","required":"false","helpTip":""}
    "classifier_feature_vector_col": "model_features",  # @param {"label":"classifier_feature_vector_col","type":"string","required":"false","helpTip":""}
    "prediction_index_col": "prediction_index",  # @param {"label":"prediction_index_col","type":"string","required":"false","helpTip":""}
    "prediction_col": "prediction",  # @param {"label":"prediction_col","type":"string","required":"false","helpTip":""}
    "probability_col": "probability",  # @param {"label":"probability_col","type":"string","required":"false","helpTip":""}
    "is_unbalance": False,  # @param {"label":"is_unbalance","type":"boolean","required":"false","helpTip":""}
    "timeout": 1200.0,  # @param {"label":"timeout","type":"number","required":"false","helpTip":""}
    "objective": "binary",  # @param {"label":"objective","type":"string","required":"false","helpTip":""}
    "max_depth": -1,  # @param {"label":"max_depth","type":"integer","required":"false","range":"[-1,2147483647]","helpTip":""}
    "num_iteration": 100,  # @param {"label":"num_iteration","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
    "learning_rate": 0.1,  # @param {"label":"learning_rate","type":"number","required":"false","helpTip":""}
    "num_leaves": 31,  # @param {"label":"num_leaves","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
    "max_bin": 255,  # @param {"label":"max_bin","type":"integer","required":"false","range":"(0,2147483647]","helpTip":""}
    "bagging_fraction": 1.0,  # @param {"label":"bagging_fraction","type":"number","required":"false","helpTip":""}
    "bagging_freq": 0,  # @param {"label":"bagging_freq","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
    "bagging_seed": 3,  # @param {"label":"bagging_seed","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
    "early_stopping_round": 0,  # @param {"label":"early_stopping_round","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
    "feature_fraction": 1.0,  # @param {"label":"feature_fraction","type":"number","required":"false","helpTip":""}
    "min_sum_hessian_in_leaf": 1e-3,  # @param {"label":"min_sum_hessian_in_leaf","type":"number","required":"false","helpTip":""}
    "boost_from_average": True,  # @param {"label":"boost_from_average","type":"boolean","required":"false","helpTip":""}
    "boosting_type": "gbdt",  # @param {"label":"boosting_type","type":"string","required":"false","helpTip":""}
    "lambda_l1": 0.0,  # @param {"label":"lambda_l1","type":"number","required":"false","helpTip":""}
    "lambda_l2": 0.0,  # @param {"label":"lambda_l2","type":"number","required":"false","helpTip":""}
    "num_batches": 0,  # @param {"label":"num_batches","type":"integer","required":"false","range":"[0,2147483647]","helpTip":""}
    "parallelism": "data_parallel",  # @param {"label":"parallelism","type":"string","required":"false","helpTip":""}
    "thresholds_str": ""  # @param {"label":"thresholds_str","type":"string","required":"false","helpTip":""}
}
lightgbm_classifier____id___ = MLSLightGBMClassifier(**params)
lightgbm_classifier____id___.run()
# @output {"label":"pipeline_model","name":"lightgbm_classifier____id___.get_outputs()['output_port_1']","type":"PipelineModel"}

分享:

    相关文档

    相关产品