更新时间:2024-05-27 GMT+08:00
分享

卡方选择

概述

采用卡方检验来进行特征选择。

卡方检验(Chi-Squared Test或χ2 Test)的基本思想是通过特征变量与目标变量之间的偏差大小来选择相关性较大的特征变量。首先假设两个变量是独立的,然后观察实际值与理论值的偏差程度,该偏差程度代表两个变量之间的相关性。如果某个特征变量与目标变量之间偏差程度越大,则它们的相关性越高,最后根据相关性对特征变量进行排序,并选择与目标变量相关性较大的特征变量。卡方检验中假设理论值为E,第i个样本的实际值为xi,则偏差程度的计算公式如下:

输入

参数

子参数

参数说明

inputs

dataframe

inputs为字典类型,dataframe为pyspark中的DataFrame类型对象

输出

数据集

参数说明

参数

子参数

参数说明

input_features_str

-

输入列名组成的格式化字符串,例如:

"column_a"

"column_a,column_b"

label_col

-

目标列,基于该列进行卡方检验

chi_label_index_col

-

目标列通过标签编码得到的新列名,默认为label_index

chi_features_col

-

调用spark卡方选择需要的输入特征向量列名,默认为input_features

chi_output_col

-

调用spark卡方选择需要的输入特征向量列名,默认为output_features

selector_type

-

卡方选择的选择方法,支持numTopFeatures,percentile,fpr,fdr,fwe

num_top_features

-

选择的特征个数,默认为50

percentile

-

选择的特征个数占原始特征数量的比例,默认为0.1

fpr

-

最高的p-value,默认为0.05

fdr

-

期望的错误观察率的最大值,默认为0.05

fwe

-

默认为0.05

max_categories

-

特征的最大类别数,默认为1000

样例

inputs = {
    "dataframe": None  # @input {"label":"dataframe","type":"DataFrame"}
}
params = {
    "inputs": inputs,
    "b_output_action": True,
    "b_use_default_encoder": True,
    "input_features_str": "",  # @param {"label":"input_features_str","type":"string","required":"false","helpTip":""}
    "outer_pipeline_stages": None,
    "label_col": "",  # @param {"label":"label_col","type":"string","required":"true","helpTip":""}
    "chi_label_index_col": "label_index",  # @param {"label":"chi_label_index_col","type":"string","required":"true","helpTip":""}
    "chi_features_col": "input_features",  # @param {"label":"chi_features_col","type":"string","required":"true","helpTip":""}
    "chi_output_col": "output_features",  # @param {"label":"chi_output_col","type":"string","required":"true","helpTip":""}
    "selector_type": "numTopFeatures",  # @param {"label":"selector_type","type":"enum","required":"true","options":"numTopFeatures,percentile,fpr,fdr,fwe","helpTip":""}
    "num_top_features": 50,  # @param {"label":"num_top_features","type":"integer","required":"true","range":"(0,2147483647]","helpTip":""}
    "percentile": 0.1,  # @param {"label":"percentile","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "fpr": 0.05,  # @param {"label":"fpr","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "fdr": 0.05,  # @param {"label":"fdr","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "fwe": 0.05,  # @param {"label":"fwe","type":"number","required":"true","range":"[0,1]","helpTip":""}
    "max_categories": 1000  # @param {"label":"max_categories","type":"number","required":"true","range":"(0,2147483647]","helpTip":""}
}
chi_square_selector____id___ = MLSChiSquareSelector(**params)
chi_square_selector____id___.run()
# @output {"label":"dataframe","name":"chi_square_selector____id___.get_outputs()['output_port_1']","type":"DataFrame"}
分享:

    相关文档

    相关产品