更新时间:2022-04-16 GMT+08:00
分享

准备数据

(可选)准备MRS Hive数据源

如果您的数据需通过MRS Hive发布到TICS,则您需要提前准备MRS Hive数据源。

准备数据步骤如下:

  1. 购买MRS服务,操作步骤参考创建集群章节,且MRS服务的VPC必须与计算节点部署节点处于同一个VPC内。

    注意事项:

    • “区域”必须与CCE集群在同一个VPC下。
      图1 区域配置
    • “Kerberos”认证无论是否勾选,当前的MRS Hive连接器都支持。
    • “虚拟私有云”与后续要建立的CCE集群必须在同一个VPC下。
    • “安全组”建议在同一个安全组内且对同组节点开放必要端口。

  2. 准备MRS Hive用户,操作步骤参考准备开发用户。需要注意的是用户必须具有Hive权限以及对应库表的访问权限。

    如果要创建MRS安全集群的数据连接,不能使用Admin用户。因为Admin用户是默认的管理页面用户,这个用户无法作为安全集群的认证用户来使用。您可以参考以下步骤创建一个新的MRS用户:

    1. 使用Admin帐号登录MRS Manager页面。
    2. 系统设置中,单击“角色管理”,选择添加角色,单击“Hive”,单击“Hive Read Write Privileges”视图,勾选后续需要发布的Hive库表的读或者写权限。
      图2 添加角色权限
    3. 登录MRS Manager,在页面的“系统设置”中,单击“用户管理”,在用户管理页面,添加用户,添加一个专有用户作为Kerberos认证用户,并且为这个用户添加用户组和分配角色权限,用户组至少选择Hive组,角色至少要勾选新建的Hive,然后根据页面提示完成用户的创建。
      图3 新建用户
    4. 使用新建的用户登录MRS Manager页面,更新初始密码。

  3. 将数据资源导入MRS中的Hive,操作步骤参考从零开始使用Hive中关于导入数据的描述。
  4. 配置安全组,操作步骤请参考如何配置安全组

    安全组配置示例

    该步骤是为了确保计算节点的部署节点能够与该MRS集群通信以获取Hive数据。

    一种方式是让计算节点与MRS集群的master节点处于同一个安全组。

    另一个方式,是配置MRS集群的安全组策略,开放部分端口提供给计算节点。

    必须确保互通的ip和端口:

    • KrbServer的ip,以及tcp端口21730 和udp端口(21732,21731)
    • zookeeper的ip和端口(2181)
    • Hive-server的ip和端口(10000)
    • MRS Manager的TCP端口(9022)

    参考如下:

    图4 添加入方向规则

(可选)准备RDS(MySQL)数据源

如果您的数据需通过RDS(MySQL)发布到TICS,则您需要提前准备RDS(MySQL)数据源。

JDBC数据源支持原生MySQL及RDS(MySQL)的连接。这里介绍RDS(MySQL)准备数据的步骤:

  1. 购买RDS服务,操作步骤参考购买RDS(MySQL)数据库实例,且RDS服务的VPC必须与计算节点部署节点处于同一个VPC内。

    参数配置注意事项:

    • “区域”必须与后续要建立的CCE集群在同一个区域下。
    • “虚拟私有云”与CCE集群必须在同一个VPC下。
    • “安全组”建议在同一个安全组内且对同组节点开放数据库端口。
    • 当前暂不支持开启“SSL连接”

  2. 准备数据库数据及访问用户,操作步骤参考数据库与用户创建。需要注意的是的访问用户必须具有对应库表的访问权限。
  3. 将数据导入RDS库表中。
  4. 进入RDS实例 -> 连接管理 -> 安全组规则,配置安全组。确保数据库端口对计算节点开放。

(可选)准备DWS数据源

如果您的数据需通过DWS发布到TICS,则您需要提前准备DWS数据源。

JDBC数据源支持DWS(GaussDB SQL)的连接,目前仅支持默认数据库为postgres的DWS数据源。这里介绍DWS(GaussDB SQL)准备数据的步骤:

  1. 购买DWS服务,选择默认数据库为postgres的数仓,创建DWS集群,操作步骤参考创建DWS集群

    参数配置注意事项:

    • “安全组”建议自动创建安全组,或选择与计算节点在同一个安全组内且对同组节点开放数据库端口。
    • 当前暂不支持开启“SSL连接”。
    • 购买“公网访问”,按照实际带宽需求来进行购买

  2. 准备数据库数据及访问用户。需要注意的是访问用户必须具有对应库表的访问权限。
  3. 将数据导入DWS库表中。
  4. 进入“DWS实例 -> 基本信息 -> 网络 -> 安全组”,检查安全组配置。确保数据库端口对计算节点开放。

准备本地横向联邦数据资源

  1. 上传数据集文件(作业参与方)

    上传数据集文件到计算节点挂载路径下,供计算节点执行的脚本读取。如果是主机挂载,上传到宿主机的挂载路径下。如果是OBS挂载,使用华为云提供的对象存储服务,上传到当前计算节点使用的对象桶中。

    图5 对象桶名称

    此处以主机挂载为例:

    1. 创建一个主机挂载的计算节点Agent1,挂载路径为/tmp/tics1/。
    2. 使用文件上传工具上传包含数据集iris1.csv的dataset文件夹到宿主机/tmp/tics1/目录下。
      iris1.csv内容如下:
      sepal_length,sepal_width,petal_length,petal_width,class
      5.1,3.5,1.4,0.3,Iris-setosa
      5.7,3.8,1.7,0.3,Iris-setosa
      5.1,3.8,1.5,0.3,Iris-setosa
      5.4,3.4,1.7,0.2,Iris-setosa
      5.1,3.7,1.5,0.4,Iris-setosa
      4.6,3.6,1,0.2,Iris-setosa
      5.1,3.3,1.7,0.5,Iris-setosa
      4.8,3.4,1.9,0.2,Iris-setosa
      5,3,1.6,0.2,Iris-setosa
      5,3.4,1.6,0.4,Iris-setosa
      5.2,3.5,1.5,0.2,Iris-setosa
      5.2,3.4,1.4,0.2,Iris-setosa
      4.7,3.2,1.6,0.2,Iris-setosa
      4.8,3.1,1.6,0.2,Iris-setosa
      5.4,3.4,1.5,0.4,Iris-setosa
      5.2,4.1,1.5,0.1,Iris-setosa
      5.5,4.2,1.4,0.2,Iris-setosa
      4.9,3.1,1.5,0.1,Iris-setosa
      5,3.2,1.2,0.2,Iris-setosa
      5.5,3.5,1.3,0.2,Iris-setosa
      4.9,3.1,1.5,0.1,Iris-setosa
      4.4,3,1.3,0.2,Iris-setosa
      5.1,3.4,1.5,0.2,Iris-setosa
      5,3.5,1.3,0.3,Iris-setosa
      4.5,2.3,1.3,0.3,Iris-setosa
      4.4,3.2,1.3,0.2,Iris-setosa
      5,3.5,1.6,0.6,Iris-setosa
      5.1,3.8,1.9,0.4,Iris-setosa
      4.8,3,1.4,0.3,Iris-setosa
      5.1,3.8,1.6,0.2,Iris-setosa
      4.6,3.2,1.4,0.2,Iris-setosa
      5.3,3.7,1.5,0.2,Iris-setosa
      5,3.3,1.4,0.2,Iris-setosa
      6.8,2.8,4.8,1.4,Iris-versicolor
      6.7,3,5,1.7,Iris-versicolor
      6,2.9,4.5,1.5,Iris-versicolor
      5.7,2.6,3.5,1,Iris-versicolor
      5.5,2.4,3.8,1.1,Iris-versicolor
      5.5,2.4,3.7,1,Iris-versicolor
      5.8,2.7,3.9,1.2,Iris-versicolor
      6,2.7,5.1,1.6,Iris-versicolor
      5.4,3,4.5,1.5,Iris-versicolor
      6,3.4,4.5,1.6,Iris-versicolor
      6.7,3.1,4.7,1.5,Iris-versicolor
      6.3,2.3,4.4,1.3,Iris-versicolor
      5.6,3,4.1,1.3,Iris-versicolor
      5.5,2.5,4,1.3,Iris-versicolor
      5.5,2.6,4.4,1.2,Iris-versicolor
      6.1,3,4.6,1.4,Iris-versicolor
      5.8,2.6,4,1.2,Iris-versicolor
      5,2.3,3.3,1,Iris-versicolor
      5.6,2.7,4.2,1.3,Iris-versicolor
      5.7,3,4.2,1.2,Iris-versicolor
      5.7,2.9,4.2,1.3,Iris-versicolor
      6.2,2.9,4.3,1.3,Iris-versicolor
      5.1,2.5,3,1.1,Iris-versicolor
      5.7,2.8,4.1,1.3,Iris-versicolor
      6.3,3.3,6,2.5,Iris-virginica
      5.8,2.7,5.1,1.9,Iris-virginica
      7.1,3,5.9,2.1,Iris-virginica
      6.3,2.9,5.6,1.8,Iris-virginica
      6.5,3,5.8,2.2,Iris-virginica
      7.6,3,6.6,2.1,Iris-virginica
      4.9,2.5,4.5,1.7,Iris-virginica
      7.3,2.9,6.3,1.8,Iris-virginica
      6.7,2.5,5.8,1.8,Iris-virginica
      7.2,3.6,6.1,2.5,Iris-virginica
      6.5,3.2,5.1,2,Iris-virginica
      6.4,2.7,5.3,1.9,Iris-virginica
      6.8,3,5.5,2.1,Iris-virginica
      5.7,2.5,5,2,Iris-virginica
      5.8,2.8,5.1,2.4,Iris-virginica
      6.4,3.2,5.3,2.3,Iris-virginica
      6.5,3,5.5,1.8,Iris-virginica
      7.7,3.8,6.7,2.2,Iris-virginica
      7.7,2.6,6.9,2.3,Iris-virginica
      6,2.2,5,1.5,Iris-virginica
      6.9,3.2,5.7,2.3,Iris-virginica
      5.6,2.8,4.9,2,Iris-virginica
      7.7,2.8,6.7,2,Iris-virginica
      6.3,2.7,4.9,1.8,Iris-virginica
      6.7,3.3,5.7,2.1,Iris-virginica
      7.2,3.2,6,1.8,Iris-virginica
    3. 为了使容器内的计算节点程序有权限能够读取到文件,使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组为1000:1000。
    4. 在第二台主机上创建计算节点Agent2,挂载路径为/tmp/tics2/。上传包含数据集iris2.csv的dataset文件夹到宿主机目录下,修改属主。
      iris2.csv的内容如下:
      sepal_length,sepal_width,petal_length,petal_width,class
      5.1,3.5,1.4,0.2,Iris-setosa
      4.9,3,1.4,0.2,Iris-setosa
      4.7,3.2,1.3,0.2,Iris-setosa
      4.6,3.1,1.5,0.2,Iris-setosa
      5,3.6,1.4,0.2,Iris-setosa
      5.4,3.9,1.7,0.4,Iris-setosa
      4.6,3.4,1.4,0.3,Iris-setosa
      5,3.4,1.5,0.2,Iris-setosa
      4.4,2.9,1.4,0.2,Iris-setosa
      4.9,3.1,1.5,0.1,Iris-setosa
      5.4,3.7,1.5,0.2,Iris-setosa
      4.8,3.4,1.6,0.2,Iris-setosa
      4.8,3,1.4,0.1,Iris-setosa
      4.3,3,1.1,0.1,Iris-setosa
      5.8,4,1.2,0.2,Iris-setosa
      5.7,4.4,1.5,0.4,Iris-setosa
      5.4,3.9,1.3,0.4,Iris-setosa
      7,3.2,4.7,1.4,Iris-versicolor
      6.4,3.2,4.5,1.5,Iris-versicolor
      6.9,3.1,4.9,1.5,Iris-versicolor
      5.5,2.3,4,1.3,Iris-versicolor
      6.5,2.8,4.6,1.5,Iris-versicolor
      5.7,2.8,4.5,1.3,Iris-versicolor
      6.3,3.3,4.7,1.6,Iris-versicolor
      4.9,2.4,3.3,1,Iris-versicolor
      6.6,2.9,4.6,1.3,Iris-versicolor
      5.2,2.7,3.9,1.4,Iris-versicolor
      5,2,3.5,1,Iris-versicolor
      5.9,3,4.2,1.5,Iris-versicolor
      6,2.2,4,1,Iris-versicolor
      6.1,2.9,4.7,1.4,Iris-versicolor
      5.6,2.9,3.6,1.3,Iris-versicolor
      6.7,3.1,4.4,1.4,Iris-versicolor
      5.6,3,4.5,1.5,Iris-versicolor
      5.8,2.7,4.1,1,Iris-versicolor
      6.2,2.2,4.5,1.5,Iris-versicolor
      5.6,2.5,3.9,1.1,Iris-versicolor
      5.9,3.2,4.8,1.8,Iris-versicolor
      6.1,2.8,4,1.3,Iris-versicolor
      6.3,2.5,4.9,1.5,Iris-versicolor
      6.1,2.8,4.7,1.2,Iris-versicolor
      6.4,2.9,4.3,1.3,Iris-versicolor
      6.6,3,4.4,1.4,Iris-versicolor
      6.8,2.8,4.8,1.4,Iris-versicolor
      6.2,2.8,4.8,1.8,Iris-virginica
      6.1,3,4.9,1.8,Iris-virginica
      6.4,2.8,5.6,2.1,Iris-virginica
      7.2,3,5.8,1.6,Iris-virginica
      7.4,2.8,6.1,1.9,Iris-virginica
      7.9,3.8,6.4,2,Iris-virginica
      6.4,2.8,5.6,2.2,Iris-virginica
      6.3,2.8,5.1,1.5,Iris-virginica
      6.1,2.6,5.6,1.4,Iris-virginica
      7.7,3,6.1,2.3,Iris-virginica
      6.3,3.4,5.6,2.4,Iris-virginica
      6.4,3.1,5.5,1.8,Iris-virginica
      6,3,4.8,1.8,Iris-virginica
      6.9,3.1,5.4,2.1,Iris-virginica
      6.7,3.1,5.6,2.4,Iris-virginica
      6.9,3.1,5.1,2.3,Iris-virginica
      5.8,2.7,5.1,1.9,Iris-virginica
      6.8,3.2,5.9,2.3,Iris-virginica
      6.7,3.3,5.7,2.5,Iris-virginica
      6.7,3,5.2,2.3,Iris-virginica
      6.3,2.5,5,1.9,Iris-virginica
      6.5,3,5.2,2,Iris-virginica
      6.2,3.4,5.4,2.3,Iris-virginica
      5.9,3,5.1,1.8,Iris-virginica
  2. 准备模型文件/初始权重(作业发起方)

    作业发起方需要提供模型、初始权重(非必须),上传到Agent1的挂载目录下并使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组。

    使用python代码创建模型文件,保存为二进制文件model.h5,以鸢尾花为例,生成如下的模型:

    import tensorflow as tf
    import keras
     
    model = keras.Sequential([
        keras.layers.Dense(4, activation=tf.nn.relu, input_shape=(4,)),
        keras.layers.Dense(6, activation=tf.nn.relu),
        keras.layers.Dense(3, activation='softmax')
    ])
     
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.save("d:/model.h5")

    初始权重的格式是浮点数数组,与模型对应。使用联邦学习训练出来的结果result_1可以作为初始权重,样例如下:

    -0.23300957679748535,0.7804553508758545,0.0064492723904550076,0.5866460800170898,0.676144003868103,-0.7883696556091309,0.5472091436386108,-0.20961782336235046,0.58524489402771,-0.5079598426818848,-0.47474920749664307,-0.3519996106624603,-0.10822880268096924,-0.5457949042320251,-0.28117161989212036,-0.7369481325149536,-0.04728877171874046,0.003856887575238943,0.051739662885665894,0.033792052417993546,-0.31878742575645447,0.7511205673217773,0.3158722519874573,-0.7290999293327332,0.7187696695327759,0.09846954792737961,-0.06735057383775711,0.7165604829788208,-0.730293869972229,0.4473201036453247,-0.27151209115982056,-0.6971480846405029,0.7360773086547852,0.819558322429657,0.4984433054924011,0.05300116539001465,-0.6597640514373779,0.7849202156066895,0.6896201372146606,0.11731931567192078,-0.5380218029022217,0.18895208835601807,-0.18693888187408447,0.357051283121109,0.05440644919872284,0.042556408792734146,-0.04341210797429085,0.0,-0.04367709159851074,-0.031455427408218384,0.24731603264808655,-0.062861368060112,-0.4265706539154053,0.32981523871421814,-0.021271884441375732,0.15228557586669922,0.1818728893995285,0.4162319302558899,-0.22432318329811096,0.7156463861465454,-0.13709741830825806,0.7237883806228638,-0.5489991903305054,0.47034209966659546,-0.04692812263965607,0.7690137028694153,0.40263476967811584,-0.4405142068862915,0.016018997877836227,-0.04845477640628815,0.037553105503320694
  3. 编写训练脚本(作业发起方)

    作业发起方还需要编写联邦学习训练脚本,其中需要用户自行实现读取数据、训练模型、评估模型、获取评估指标的逻辑。计算节点会将数据集配置文件中的path属性作为参数传递给训练脚本。

    JobParam属性如下:

    class JobParam:
        """训练脚本参数
        """
        # 作业id
        job_id = ''
        # 当前轮数
        round = 0
        # 迭代次数
        epoch = 0
        # 模型文件路径
        model_file = ''
        # 数据集路径
        dataset_path = ''
        # 是否仅做评估
        eval_only = False
        # 权重文件
        weights_file = ''
        # 输出路径
        output = ''
        # 其他参数json字符串
        param = ''

    鸢尾花的训练脚本iris_train.py样例如下:

    # -*- coding: utf-8 -*-
    
    import getopt
    import sys
    
    import keras
    
    import horizontal.horizontallearning as hl
    
    
    def train():
        # 解析命令行输入
        jobParam = JobParam()
        jobParam.parse_from_command_line()
        job_type = 'evaluation' if jobParam.eval_only else 'training'
        print(f"Starting round {jobParam.round} {job_type}")
    
        # 加载模型,设置初始权重参数
        model = keras.models.load_model(jobParam.model_file)
        hl.set_model_weights(model, jobParam.weights_file)
    
        # 加载数据、训练、评估 -- 用户自己实现
        print(f"Load data {jobParam.dataset_path}")
        train_x, test_x, train_y, test_y, class_dict = load_data(jobParam.dataset_path)
    
        if not jobParam.eval_only:
            b_size = 1
            model.fit(train_x, train_y, batch_size=b_size, epochs=jobParam.epoch, shuffle=True, verbose=1)
            print(f"Training job [{jobParam.job_id}] finished")
        eval = model.evaluate(test_x, test_y, verbose=0)
        print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" % (eval[0], eval[1] * 100))
    
        # 结果以json格式保存 -- 用户读取评估指标
        result = {}
        result['loss'] = eval[0]
        result['accuracy'] = eval[1]
    
        # 生成结果文件
        hl.save_train_result(jobParam, model, result)
    
    
    # 读取CSV数据集,并拆分为训练集和测试集
    # 该函数的传入参数为CSV_FILE_PATH: csv文件路径
    def load_data(CSV_FILE_PATH):
        import pandas as pd
        from sklearn.model_selection import train_test_split
        from sklearn.preprocessing import LabelBinarizer
    
        IRIS = pd.read_csv(CSV_FILE_PATH)
        target_var = 'class'  # 目标变量
        # 数据集的特征
        features = list(IRIS.columns)
        features.remove(target_var)
        # 目标变量的类别
        Class = IRIS[target_var].unique()
        # 目标变量的类别字典
        Class_dict = dict(zip(Class, range(len(Class))))
        # 增加一列target, 将目标变量进行编码
        IRIS['target'] = IRIS[target_var].apply(lambda x: Class_dict[x])
        # 对目标变量进行0-1编码(One-hot Encoding)
        lb = LabelBinarizer()
        lb.fit(list(Class_dict.values()))
        transformed_labels = lb.transform(IRIS['target'])
        y_bin_labels = []  # 对多分类进行0-1编码的变量
        for i in range(transformed_labels.shape[1]):
            y_bin_labels.append('y' + str(i))
            IRIS['y' + str(i)] = transformed_labels[:, i]
        # 将数据集分为训练集和测试集
        train_x, test_x, train_y, test_y = train_test_split(IRIS[features], IRIS[y_bin_labels],
                                                            train_size=0.7, test_size=0.3, random_state=0)
        return train_x, test_x, train_y, test_y, Class_dict
    
    
    class JobParam:
        """训练脚本参数
        """
        # required parameters
        job_id = ''
        round = 0
        epoch = 0
        model_file = ''
        dataset_path = ''
        eval_only = False
    
        # optional parameters
        weights_file = ''
        output = ''
        param = ''
    
        def parse_from_command_line(self):
            """从命令行中解析作业参数
            """
            opts, args = getopt.getopt(sys.argv[1:], 'hn:w:',
                                       ['round=', 'epoch=', 'model_file=', 'eval_only', 'dataset_path=',
                                        'weights_file=', 'output=', 'param=', 'job_id='])
            for key, value in opts:
                if key in ['--round']:
                    self.round = int(value)
                if key in ['--epoch']:
                    self.epoch = int(value)
                if key in ['--model_file']:
                    self.model_file = value
                if key in ['--eval_only']:
                    self.eval_only = True
                if key in ['--dataset_path']:
                    self.dataset_path = value
                if key in ['--weights_file']:
                    self.weights_file = value
                if key in ['--output']:
                    self.output = value
                if key in ['--param']:
                    self.param = value
                if key in ['--job_id']:
                    self.job_id = value
    
    
    if __name__ == '__main__':
        train()

准备本地纵向联邦数据资源

纵向联邦学习的数据方分为标签方(数据集中有标签列的一方)和特征方(数据集中没有标签列的一方),目前仅支持csv格式的文本文件。以下示例中如果没有特别说明,一般都是CSV格式的文件。

例如,标签方有12条数据,每条数据有1列ID、1列特征和1列标签:
ID,f1,LABEL
ff4e60d87b394f7189657d1c9392c8cb,1,0
47a45426fb7d47d08fe1bbca3ce63f46,1,0
78f6c1bf399942c7ae740d360557e638,2,1
3204f44517ba4dddba6638435aee1346,3,0
d695e7c3058a476a9bf1e8188c5e340b,5,0
69b38278dab647df99b246db7d772589,8,1
2cf720857caa4a19a7a0b18016f98dc0,13,0
3a2bb99f47154187a58c17f0ac5a2b96,21,1
3bbd81ded02a43f990b7a4324d1ac116,34,0
4a022350bfe7460797110b8a840736a9,55,1
eb12bdcb69d14b09ac81f9b99b6e5579,89,0
bbff1d6dfc854ad9b3a8b73d58421aef,144,0
特征方有20条数据,每条数据有1列ID和14列特征:
ID,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14
041b93f8bea04b57af81cf5edb3bde9c,9047897,51862,84434,58990796,51,99063,250853443,82572689,40,98527709,9244,0,446726236,964
ff4e60d87b394f7189657d1c9392c8cb,450,5415312,63,1129630,675,39751,5680577,21919201,198604,962046,644,510927,49,9116025
47a45426fb7d47d08fe1bbca3ce63f46,999742,822275264,969459,56054153,344,16406,638428,2233,53814,374238,691,4,96,0137
78f6c1bf399942c7ae740d360557e638,458467249,10909614,518,679148004,682,90,0,2,4530,15,2927238,975193,8,14
3204f44517ba4dddba6638435aee1346,2967,8357,905,291746,494260,753289592,5050,60149,701,2,78,88,6,394
d695e7c3058a476a9bf1e8188c5e340b,698151,56030950,288516,138,62594,820260,369101761,6199,499700165,490877269,6478535,7698306,941142,8518818
69b38278dab647df99b246db7d772589,88,66,1020893,3746457,54246917,6229,4,3,76,4,95153694,2327597,689472,463
2cf720857caa4a19a7a0b18016f98dc0,110927262,12,3702556,1,54186,783742,188948,5,1622726,89645,927,6748821,0082130,2883108
b36d1e4766f5477f9ff611820b9f15b2,4,20993,940,303957,943339669,06,6926836,50,3,1,1,761416739,282541036,2
a83632f7b9cc4216adc564e964b52537,143707,3382707,662392417,309248525,5787916,9,35921,2422627,00883996,560,5896678,64703,14,45373760
fd8e7802304a4da6abeff908e4337040,89,09,66,302037,10252,8668994,28713,91765,0650,778656553,8,130448,387,53
910483041de347279ccb1683cd185349,76,466,364727097,55864362,676210583,640092992,92890595,4,08091626,704,59781940,698,730292,480561519
d480251c37a341a0a125af63e45228eb,6,080095,6339,795858545,4530,370,8,668750969,79636,8,0397,42456714,7876007,829482100
ba010104797147b99ef52f254d6a2067,506,60074,12,77112,44677416,3854400,0056,87723,27,0,5641,7454,313,507
97ebde6e9fc14671816bd63b1b6ff105,56745940,013,7681257,6777,5623,116,0,0952,320,08084066,315775,572,288269113,456933
3a2bb99f47154187a58c17f0ac5a2b96,429214373,5703,2057,54815522,7,8791,74033124,8574,8589778,59419948,1,704729913,9,3710987
3bbd81ded02a43f990b7a4324d1ac116,2323,76,9,48099469,987,42,958965,08975,9,81949,903,737744876,401664160,966763
4a022350bfe7460797110b8a840736a9,034,6251413,24058,405,8,5519,74026810,187613940,996638433,73244,76606425,555363,16938548,74975
eb12bdcb69d14b09ac81f9b99b6e5579,0,21833,5595812,150317,9968,70922486,9423136,814866054,825430087,56925,83542247,96954,13878,055823844
bbff1d6dfc854ad9b3a8b73d58421aef,32722,05646618,1079,94836,05099464,099201,2305157,2422,685,310,7729507,51,97,396413

参考准备本地横向联邦数据资源 -> 上传数据集文件,将该文件上传到两个不同计算节点的挂载路径下,即完成纵向联邦数据集配置。

如果数据集文件不含有csv文件表头,需要用户提供额外的配置文件用来说明数据集每一列的信息。以上述标签方数据及为例,没有表头的数据集文件和数据配置文件的格式如下:

数据集文件:
041b93f8bea04b57af81cf5edb3bde9c,9047897,51862,84434,58990796,51,99063,250853443,82572689,40,98527709,9244,0,446726236,964
ff4e60d87b394f7189657d1c9392c8cb,450,5415312,63,1129630,675,39751,5680577,21919201,198604,962046,644,510927,49,9116025
47a45426fb7d47d08fe1bbca3ce63f46,999742,822275264,969459,56054153,344,16406,638428,2233,53814,374238,691,4,96,0137
78f6c1bf399942c7ae740d360557e638,458467249,10909614,518,679148004,682,90,0,2,4530,15,2927238,975193,8,14
3204f44517ba4dddba6638435aee1346,2967,8357,905,291746,494260,753289592,5050,60149,701,2,78,88,6,394
d695e7c3058a476a9bf1e8188c5e340b,698151,56030950,288516,138,62594,820260,369101761,6199,499700165,490877269,6478535,7698306,941142,8518818
69b38278dab647df99b246db7d772589,88,66,1020893,3746457,54246917,6229,4,3,76,4,95153694,2327597,689472,463
2cf720857caa4a19a7a0b18016f98dc0,110927262,12,3702556,1,54186,783742,188948,5,1622726,89645,927,6748821,0082130,2883108
b36d1e4766f5477f9ff611820b9f15b2,4,20993,940,303957,943339669,06,6926836,50,3,1,1,761416739,282541036,2
a83632f7b9cc4216adc564e964b52537,143707,3382707,662392417,309248525,5787916,9,35921,2422627,00883996,560,5896678,64703,14,45373760
fd8e7802304a4da6abeff908e4337040,89,09,66,302037,10252,8668994,28713,91765,0650,778656553,8,130448,387,53
910483041de347279ccb1683cd185349,76,466,364727097,55864362,676210583,640092992,92890595,4,08091626,704,59781940,698,730292,480561519
d480251c37a341a0a125af63e45228eb,6,080095,6339,795858545,4530,370,8,668750969,79636,8,0397,42456714,7876007,829482100
ba010104797147b99ef52f254d6a2067,506,60074,12,77112,44677416,3854400,0056,87723,27,0,5641,7454,313,507
97ebde6e9fc14671816bd63b1b6ff105,56745940,013,7681257,6777,5623,116,0,0952,320,08084066,315775,572,288269113,456933
3a2bb99f47154187a58c17f0ac5a2b96,429214373,5703,2057,54815522,7,8791,74033124,8574,8589778,59419948,1,704729913,9,3710987
3bbd81ded02a43f990b7a4324d1ac116,2323,76,9,48099469,987,42,958965,08975,9,81949,903,737744876,401664160,966763
4a022350bfe7460797110b8a840736a9,034,6251413,24058,405,8,5519,74026810,187613940,996638433,73244,76606425,555363,16938548,74975
eb12bdcb69d14b09ac81f9b99b6e5579,0,21833,5595812,150317,9968,70922486,9423136,814866054,825430087,56925,83542247,96954,13878,055823844
bbff1d6dfc854ad9b3a8b73d58421aef,32722,05646618,1079,94836,05099464,099201,2305157,2422,685,310,7729507,51,97,396413

配置文件(.json):

{
  "schema": [
    {
      "name": "ID",
      "type": "STRING",
      "label_type": "UNIQUE_ID"
    },
    {
      "name": "f1",
      "type": "FLOAT",
      "label_type": "FEATURE"
    },
    {
      "name": "LABEL",
      "type": "INTEGER",
      "label_type": "LABEL",
	  "description": "this is a label column"
    }
  ]
}

如果数据集文件不包含ID,该数据集将不能进行样本对齐,且特征选择、联邦训练、评估时会校验特征方、标签方的数据量是否相等,若不相等作业会报错。用户可以提供额外的数据ID文件用来说明数据每一行的ID。以上述标签数据集为例,有表头没有ID的数据集文件和数据ID文件格式如下:

数据集文件内容:
f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14
9047897,51862,84434,58990796,51,99063,250853443,82572689,40,98527709,9244,0,446726236,964
450,5415312,63,1129630,675,39751,5680577,21919201,198604,962046,644,510927,49,9116025
999742,822275264,969459,56054153,344,16406,638428,2233,53814,374238,691,4,96,0137
458467249,10909614,518,679148004,682,90,0,2,4530,15,2927238,975193,8,14
2967,8357,905,291746,494260,753289592,5050,60149,701,2,78,88,6,394
698151,56030950,288516,138,62594,820260,369101761,6199,499700165,490877269,6478535,7698306,941142,8518818
88,66,1020893,3746457,54246917,6229,4,3,76,4,95153694,2327597,689472,463
110927262,12,3702556,1,54186,783742,188948,5,1622726,89645,927,6748821,0082130,2883108
4,20993,940,303957,943339669,06,6926836,50,3,1,1,761416739,282541036,2
143707,3382707,662392417,309248525,5787916,9,35921,2422627,00883996,560,5896678,64703,14,45373760
89,09,66,302037,10252,8668994,28713,91765,0650,778656553,8,130448,387,53
76,466,364727097,55864362,676210583,640092992,92890595,4,08091626,704,59781940,698,730292,480561519
6,080095,6339,795858545,4530,370,8,668750969,79636,8,0397,42456714,7876007,829482100
506,60074,12,77112,44677416,3854400,0056,87723,27,0,5641,7454,313,507
56745940,013,7681257,6777,5623,116,0,0952,320,08084066,315775,572,288269113,456933
429214373,5703,2057,54815522,7,8791,74033124,8574,8589778,59419948,1,704729913,9,3710987
2323,76,9,48099469,987,42,958965,08975,9,81949,903,737744876,401664160,966763
034,6251413,24058,405,8,5519,74026810,187613940,996638433,73244,76606425,555363,16938548,74975
0,21833,5595812,150317,9968,70922486,9423136,814866054,825430087,56925,83542247,96954,13878,055823844
32722,05646618,1079,94836,05099464,099201,2305157,2422,685,310,7729507,51,97,396413
数据ID文件内容:
041b93f8bea04b57af81cf5edb3bde9c
ff4e60d87b394f7189657d1c9392c8cb
47a45426fb7d47d08fe1bbca3ce63f46
78f6c1bf399942c7ae740d360557e638
3204f44517ba4dddba6638435aee1346
d695e7c3058a476a9bf1e8188c5e340b
69b38278dab647df99b246db7d772589
2cf720857caa4a19a7a0b18016f98dc0
b36d1e4766f5477f9ff611820b9f15b2
a83632f7b9cc4216adc564e964b52537
fd8e7802304a4da6abeff908e4337040
910483041de347279ccb1683cd185349
d480251c37a341a0a125af63e45228eb
ba010104797147b99ef52f254d6a2067
97ebde6e9fc14671816bd63b1b6ff105
3a2bb99f47154187a58c17f0ac5a2b96
3bbd81ded02a43f990b7a4324d1ac116
4a022350bfe7460797110b8a840736a9
eb12bdcb69d14b09ac81f9b99b6e5579
bbff1d6dfc854ad9b3a8b73d58421aef
分享:

    相关文档

    相关产品

close