准备数据
(可选)准备MRS Hive数据源
如果您的数据需通过MRS Hive发布到TICS,则您需要提前准备MRS Hive数据源。
准备数据步骤如下:
- 购买MRS服务,操作步骤参考创建集群章节,且MRS服务的VPC必须与计算节点部署节点处于同一个VPC内。
注意事项:
- “区域”必须与CCE集群在同一个VPC下。图1 区域配置
- “Kerberos”认证无论是否勾选,当前的MRS Hive连接器都支持。
- “虚拟私有云”与后续要建立的CCE集群必须在同一个VPC下。
- “安全组”建议在同一个安全组内且对同组节点开放必要端口。
- “区域”必须与CCE集群在同一个VPC下。
- 准备MRS Hive用户,操作步骤参考准备开发用户。需要注意的是用户必须具有Hive权限以及对应库表的访问权限。
如果要创建MRS安全集群的数据连接,不能使用Admin用户。因为Admin用户是默认的管理页面用户,这个用户无法作为安全集群的认证用户来使用。您可以参考以下步骤创建一个新的MRS用户:
- 使用Admin帐号登录MRS Manager页面。
- 系统设置中,单击“角色管理”,选择添加角色,单击“Hive”,单击“Hive Read Write Privileges”视图,勾选后续需要发布的Hive库表的读或者写权限。图2 添加角色权限
- 登录MRS Manager,在页面的“系统设置”中,单击“用户管理”,在用户管理页面,添加用户,添加一个专有用户作为Kerberos认证用户,并且为这个用户添加用户组和分配角色权限,用户组至少选择Hive组,角色至少要勾选新建的Hive,然后根据页面提示完成用户的创建。图3 新建用户
- 使用新建的用户登录MRS Manager页面,更新初始密码。
- 将数据资源导入MRS中的Hive,操作步骤参考从零开始使用Hive中关于导入数据的描述。
- 配置安全组,操作步骤请参考如何配置安全组。
安全组配置示例
该步骤是为了确保计算节点的部署节点能够与该MRS集群通信以获取Hive数据。
一种方式是让计算节点与MRS集群的master节点处于同一个安全组。
另一个方式,是配置MRS集群的安全组策略,开放部分端口提供给计算节点。
必须确保互通的ip和端口:
- KrbServer的ip,以及tcp端口21730 和udp端口(21732,21731)
- zookeeper的ip和端口(2181)
- Hive-server的ip和端口(10000)
- MRS Manager的TCP端口(9022)
参考如下:
图4 添加入方向规则
(可选)准备RDS(MySQL)数据源
如果您的数据需通过RDS(MySQL)发布到TICS,则您需要提前准备RDS(MySQL)数据源。
JDBC数据源支持原生MySQL及RDS(MySQL)的连接。这里介绍RDS(MySQL)准备数据的步骤:
- 购买RDS服务,操作步骤参考购买RDS(MySQL)数据库实例,且RDS服务的VPC必须与计算节点部署节点处于同一个VPC内。
参数配置注意事项:
- “区域”必须与后续要建立的CCE集群在同一个区域下。
- “虚拟私有云”与CCE集群必须在同一个VPC下。
- “安全组”建议在同一个安全组内且对同组节点开放数据库端口。
- 当前暂不支持开启“SSL连接”。
- 准备数据库数据及访问用户,操作步骤参考数据库与用户创建。需要注意的是的访问用户必须具有对应库表的访问权限。
- 将数据导入RDS库表中。
- 进入 ,配置安全组。确保数据库端口对计算节点开放。
(可选)准备DWS数据源
如果您的数据需通过DWS发布到TICS,则您需要提前准备DWS数据源。
JDBC数据源支持DWS(GaussDB SQL)的连接,目前仅支持默认数据库为postgres的DWS数据源。这里介绍DWS(GaussDB SQL)准备数据的步骤:
- 购买DWS服务,选择默认数据库为postgres的数仓,创建DWS集群,操作步骤参考创建DWS集群。
参数配置注意事项:
- “安全组”建议自动创建安全组,或选择与计算节点在同一个安全组内且对同组节点开放数据库端口。
- 当前暂不支持开启“SSL连接”。
- 若购买“公网访问”,按照实际带宽需求来进行购买。
- 准备数据库数据及访问用户。需要注意的是访问用户必须具有对应库表的访问权限。
- 将数据导入DWS库表中。
- 进入“DWS实例 -> 基本信息 -> 网络 -> 安全组”,检查安全组配置。确保数据库端口对计算节点开放。
准备本地横向联邦数据资源
- 上传数据集文件(作业参与方)
上传数据集文件到计算节点挂载路径下,供计算节点执行的脚本读取。如果是主机挂载,上传到宿主机的挂载路径下。如果是OBS挂载,使用华为云提供的对象存储服务,上传到当前计算节点使用的对象桶中。
图5 对象桶名称此处以主机挂载为例:
- 创建一个主机挂载的计算节点Agent1,挂载路径为/tmp/tics1/。
- 使用文件上传工具上传包含数据集iris1.csv的dataset文件夹到宿主机/tmp/tics1/目录下。iris1.csv内容如下:
sepal_length,sepal_width,petal_length,petal_width,class 5.1,3.5,1.4,0.3,Iris-setosa 5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa 5.4,3.4,1.7,0.2,Iris-setosa 5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1,0.2,Iris-setosa 5.1,3.3,1.7,0.5,Iris-setosa 4.8,3.4,1.9,0.2,Iris-setosa 5,3,1.6,0.2,Iris-setosa 5,3.4,1.6,0.4,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris-setosa 4.7,3.2,1.6,0.2,Iris-setosa 4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-setosa 5.2,4.1,1.5,0.1,Iris-setosa 5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5,3.2,1.2,0.2,Iris-setosa 5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 4.4,3,1.3,0.2,Iris-setosa 5.1,3.4,1.5,0.2,Iris-setosa 5,3.5,1.3,0.3,Iris-setosa 4.5,2.3,1.3,0.3,Iris-setosa 4.4,3.2,1.3,0.2,Iris-setosa 5,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5,3.3,1.4,0.2,Iris-setosa 6.8,2.8,4.8,1.4,Iris-versicolor 6.7,3,5,1.7,Iris-versicolor 6,2.9,4.5,1.5,Iris-versicolor 5.7,2.6,3.5,1,Iris-versicolor 5.5,2.4,3.8,1.1,Iris-versicolor 5.5,2.4,3.7,1,Iris-versicolor 5.8,2.7,3.9,1.2,Iris-versicolor 6,2.7,5.1,1.6,Iris-versicolor 5.4,3,4.5,1.5,Iris-versicolor 6,3.4,4.5,1.6,Iris-versicolor 6.7,3.1,4.7,1.5,Iris-versicolor 6.3,2.3,4.4,1.3,Iris-versicolor 5.6,3,4.1,1.3,Iris-versicolor 5.5,2.5,4,1.3,Iris-versicolor 5.5,2.6,4.4,1.2,Iris-versicolor 6.1,3,4.6,1.4,Iris-versicolor 5.8,2.6,4,1.2,Iris-versicolor 5,2.3,3.3,1,Iris-versicolor 5.6,2.7,4.2,1.3,Iris-versicolor 5.7,3,4.2,1.2,Iris-versicolor 5.7,2.9,4.2,1.3,Iris-versicolor 6.2,2.9,4.3,1.3,Iris-versicolor 5.1,2.5,3,1.1,Iris-versicolor 5.7,2.8,4.1,1.3,Iris-versicolor 6.3,3.3,6,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3,5.9,2.1,Iris-virginica 6.3,2.9,5.6,1.8,Iris-virginica 6.5,3,5.8,2.2,Iris-virginica 7.6,3,6.6,2.1,Iris-virginica 4.9,2.5,4.5,1.7,Iris-virginica 7.3,2.9,6.3,1.8,Iris-virginica 6.7,2.5,5.8,1.8,Iris-virginica 7.2,3.6,6.1,2.5,Iris-virginica 6.5,3.2,5.1,2,Iris-virginica 6.4,2.7,5.3,1.9,Iris-virginica 6.8,3,5.5,2.1,Iris-virginica 5.7,2.5,5,2,Iris-virginica 5.8,2.8,5.1,2.4,Iris-virginica 6.4,3.2,5.3,2.3,Iris-virginica 6.5,3,5.5,1.8,Iris-virginica 7.7,3.8,6.7,2.2,Iris-virginica 7.7,2.6,6.9,2.3,Iris-virginica 6,2.2,5,1.5,Iris-virginica 6.9,3.2,5.7,2.3,Iris-virginica 5.6,2.8,4.9,2,Iris-virginica 7.7,2.8,6.7,2,Iris-virginica 6.3,2.7,4.9,1.8,Iris-virginica 6.7,3.3,5.7,2.1,Iris-virginica 7.2,3.2,6,1.8,Iris-virginica
- 为了使容器内的计算节点程序有权限能够读取到文件,使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组为1000:1000。
- 在第二台主机上创建计算节点Agent2,挂载路径为/tmp/tics2/。上传包含数据集iris2.csv的dataset文件夹到宿主机目录下,修改属主。iris2.csv的内容如下:
sepal_length,sepal_width,petal_length,petal_width,class 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-setosa 4.8,3.4,1.6,0.2,Iris-setosa 4.8,3,1.4,0.1,Iris-setosa 4.3,3,1.1,0.1,Iris-setosa 5.8,4,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris-setosa 7,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor 5,2,3.5,1,Iris-versicolor 5.9,3,4.2,1.5,Iris-versicolor 6,2.2,4,1,Iris-versicolor 6.1,2.9,4.7,1.4,Iris-versicolor 5.6,2.9,3.6,1.3,Iris-versicolor 6.7,3.1,4.4,1.4,Iris-versicolor 5.6,3,4.5,1.5,Iris-versicolor 5.8,2.7,4.1,1,Iris-versicolor 6.2,2.2,4.5,1.5,Iris-versicolor 5.6,2.5,3.9,1.1,Iris-versicolor 5.9,3.2,4.8,1.8,Iris-versicolor 6.1,2.8,4,1.3,Iris-versicolor 6.3,2.5,4.9,1.5,Iris-versicolor 6.1,2.8,4.7,1.2,Iris-versicolor 6.4,2.9,4.3,1.3,Iris-versicolor 6.6,3,4.4,1.4,Iris-versicolor 6.8,2.8,4.8,1.4,Iris-versicolor 6.2,2.8,4.8,1.8,Iris-virginica 6.1,3,4.9,1.8,Iris-virginica 6.4,2.8,5.6,2.1,Iris-virginica 7.2,3,5.8,1.6,Iris-virginica 7.4,2.8,6.1,1.9,Iris-virginica 7.9,3.8,6.4,2,Iris-virginica 6.4,2.8,5.6,2.2,Iris-virginica 6.3,2.8,5.1,1.5,Iris-virginica 6.1,2.6,5.6,1.4,Iris-virginica 7.7,3,6.1,2.3,Iris-virginica 6.3,3.4,5.6,2.4,Iris-virginica 6.4,3.1,5.5,1.8,Iris-virginica 6,3,4.8,1.8,Iris-virginica 6.9,3.1,5.4,2.1,Iris-virginica 6.7,3.1,5.6,2.4,Iris-virginica 6.9,3.1,5.1,2.3,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 6.8,3.2,5.9,2.3,Iris-virginica 6.7,3.3,5.7,2.5,Iris-virginica 6.7,3,5.2,2.3,Iris-virginica 6.3,2.5,5,1.9,Iris-virginica 6.5,3,5.2,2,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3,5.1,1.8,Iris-virginica
- 准备模型文件/初始权重(作业发起方)
作业发起方需要提供模型、初始权重(非必须),上传到Agent1的挂载目录下并使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组。
使用python代码创建模型文件,保存为二进制文件model.h5,以鸢尾花为例,生成如下的模型:
import tensorflow as tf import keras model = keras.Sequential([ keras.layers.Dense(4, activation=tf.nn.relu, input_shape=(4,)), keras.layers.Dense(6, activation=tf.nn.relu), keras.layers.Dense(3, activation='softmax') ]) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.save("d:/model.h5")
初始权重的格式是浮点数数组,与模型对应。使用联邦学习训练出来的结果result_1可以作为初始权重,样例如下:
-0.23300957679748535,0.7804553508758545,0.0064492723904550076,0.5866460800170898,0.676144003868103,-0.7883696556091309,0.5472091436386108,-0.20961782336235046,0.58524489402771,-0.5079598426818848,-0.47474920749664307,-0.3519996106624603,-0.10822880268096924,-0.5457949042320251,-0.28117161989212036,-0.7369481325149536,-0.04728877171874046,0.003856887575238943,0.051739662885665894,0.033792052417993546,-0.31878742575645447,0.7511205673217773,0.3158722519874573,-0.7290999293327332,0.7187696695327759,0.09846954792737961,-0.06735057383775711,0.7165604829788208,-0.730293869972229,0.4473201036453247,-0.27151209115982056,-0.6971480846405029,0.7360773086547852,0.819558322429657,0.4984433054924011,0.05300116539001465,-0.6597640514373779,0.7849202156066895,0.6896201372146606,0.11731931567192078,-0.5380218029022217,0.18895208835601807,-0.18693888187408447,0.357051283121109,0.05440644919872284,0.042556408792734146,-0.04341210797429085,0.0,-0.04367709159851074,-0.031455427408218384,0.24731603264808655,-0.062861368060112,-0.4265706539154053,0.32981523871421814,-0.021271884441375732,0.15228557586669922,0.1818728893995285,0.4162319302558899,-0.22432318329811096,0.7156463861465454,-0.13709741830825806,0.7237883806228638,-0.5489991903305054,0.47034209966659546,-0.04692812263965607,0.7690137028694153,0.40263476967811584,-0.4405142068862915,0.016018997877836227,-0.04845477640628815,0.037553105503320694
- 编写训练脚本(作业发起方)
作业发起方还需要编写联邦学习训练脚本,其中需要用户自行实现读取数据、训练模型、评估模型、获取评估指标的逻辑。计算节点会将数据集配置文件中的path属性作为参数传递给训练脚本。
JobParam属性如下:
class JobParam: """训练脚本参数 """ # 作业id job_id = '' # 当前轮数 round = 0 # 迭代次数 epoch = 0 # 模型文件路径 model_file = '' # 数据集路径 dataset_path = '' # 是否仅做评估 eval_only = False # 权重文件 weights_file = '' # 输出路径 output = '' # 其他参数json字符串 param = ''
鸢尾花的训练脚本iris_train.py样例如下:
# -*- coding: utf-8 -*- import getopt import sys import keras import horizontal.horizontallearning as hl def train(): # 解析命令行输入 jobParam = JobParam() jobParam.parse_from_command_line() job_type = 'evaluation' if jobParam.eval_only else 'training' print(f"Starting round {jobParam.round} {job_type}") # 加载模型,设置初始权重参数 model = keras.models.load_model(jobParam.model_file) hl.set_model_weights(model, jobParam.weights_file) # 加载数据、训练、评估 -- 用户自己实现 print(f"Load data {jobParam.dataset_path}") train_x, test_x, train_y, test_y, class_dict = load_data(jobParam.dataset_path) if not jobParam.eval_only: b_size = 1 model.fit(train_x, train_y, batch_size=b_size, epochs=jobParam.epoch, shuffle=True, verbose=1) print(f"Training job [{jobParam.job_id}] finished") eval = model.evaluate(test_x, test_y, verbose=0) print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" % (eval[0], eval[1] * 100)) # 结果以json格式保存 -- 用户读取评估指标 result = {} result['loss'] = eval[0] result['accuracy'] = eval[1] # 生成结果文件 hl.save_train_result(jobParam, model, result) # 读取CSV数据集,并拆分为训练集和测试集 # 该函数的传入参数为CSV_FILE_PATH: csv文件路径 def load_data(CSV_FILE_PATH): import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelBinarizer IRIS = pd.read_csv(CSV_FILE_PATH) target_var = 'class' # 目标变量 # 数据集的特征 features = list(IRIS.columns) features.remove(target_var) # 目标变量的类别 Class = IRIS[target_var].unique() # 目标变量的类别字典 Class_dict = dict(zip(Class, range(len(Class)))) # 增加一列target, 将目标变量进行编码 IRIS['target'] = IRIS[target_var].apply(lambda x: Class_dict[x]) # 对目标变量进行0-1编码(One-hot Encoding) lb = LabelBinarizer() lb.fit(list(Class_dict.values())) transformed_labels = lb.transform(IRIS['target']) y_bin_labels = [] # 对多分类进行0-1编码的变量 for i in range(transformed_labels.shape[1]): y_bin_labels.append('y' + str(i)) IRIS['y' + str(i)] = transformed_labels[:, i] # 将数据集分为训练集和测试集 train_x, test_x, train_y, test_y = train_test_split(IRIS[features], IRIS[y_bin_labels], train_size=0.7, test_size=0.3, random_state=0) return train_x, test_x, train_y, test_y, Class_dict class JobParam: """训练脚本参数 """ # required parameters job_id = '' round = 0 epoch = 0 model_file = '' dataset_path = '' eval_only = False # optional parameters weights_file = '' output = '' param = '' def parse_from_command_line(self): """从命令行中解析作业参数 """ opts, args = getopt.getopt(sys.argv[1:], 'hn:w:', ['round=', 'epoch=', 'model_file=', 'eval_only', 'dataset_path=', 'weights_file=', 'output=', 'param=', 'job_id=']) for key, value in opts: if key in ['--round']: self.round = int(value) if key in ['--epoch']: self.epoch = int(value) if key in ['--model_file']: self.model_file = value if key in ['--eval_only']: self.eval_only = True if key in ['--dataset_path']: self.dataset_path = value if key in ['--weights_file']: self.weights_file = value if key in ['--output']: self.output = value if key in ['--param']: self.param = value if key in ['--job_id']: self.job_id = value if __name__ == '__main__': train()
准备本地纵向联邦数据资源
纵向联邦学习的数据方分为标签方(数据集中有标签列的一方)和特征方(数据集中没有标签列的一方),目前仅支持csv格式的文本文件。以下示例中如果没有特别说明,一般都是CSV格式的文件。
ID,f1,LABEL ff4e60d87b394f7189657d1c9392c8cb,1,0 47a45426fb7d47d08fe1bbca3ce63f46,1,0 78f6c1bf399942c7ae740d360557e638,2,1 3204f44517ba4dddba6638435aee1346,3,0 d695e7c3058a476a9bf1e8188c5e340b,5,0 69b38278dab647df99b246db7d772589,8,1 2cf720857caa4a19a7a0b18016f98dc0,13,0 3a2bb99f47154187a58c17f0ac5a2b96,21,1 3bbd81ded02a43f990b7a4324d1ac116,34,0 4a022350bfe7460797110b8a840736a9,55,1 eb12bdcb69d14b09ac81f9b99b6e5579,89,0 bbff1d6dfc854ad9b3a8b73d58421aef,144,0
ID,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14 041b93f8bea04b57af81cf5edb3bde9c,9047897,51862,84434,58990796,51,99063,250853443,82572689,40,98527709,9244,0,446726236,964 ff4e60d87b394f7189657d1c9392c8cb,450,5415312,63,1129630,675,39751,5680577,21919201,198604,962046,644,510927,49,9116025 47a45426fb7d47d08fe1bbca3ce63f46,999742,822275264,969459,56054153,344,16406,638428,2233,53814,374238,691,4,96,0137 78f6c1bf399942c7ae740d360557e638,458467249,10909614,518,679148004,682,90,0,2,4530,15,2927238,975193,8,14 3204f44517ba4dddba6638435aee1346,2967,8357,905,291746,494260,753289592,5050,60149,701,2,78,88,6,394 d695e7c3058a476a9bf1e8188c5e340b,698151,56030950,288516,138,62594,820260,369101761,6199,499700165,490877269,6478535,7698306,941142,8518818 69b38278dab647df99b246db7d772589,88,66,1020893,3746457,54246917,6229,4,3,76,4,95153694,2327597,689472,463 2cf720857caa4a19a7a0b18016f98dc0,110927262,12,3702556,1,54186,783742,188948,5,1622726,89645,927,6748821,0082130,2883108 b36d1e4766f5477f9ff611820b9f15b2,4,20993,940,303957,943339669,06,6926836,50,3,1,1,761416739,282541036,2 a83632f7b9cc4216adc564e964b52537,143707,3382707,662392417,309248525,5787916,9,35921,2422627,00883996,560,5896678,64703,14,45373760 fd8e7802304a4da6abeff908e4337040,89,09,66,302037,10252,8668994,28713,91765,0650,778656553,8,130448,387,53 910483041de347279ccb1683cd185349,76,466,364727097,55864362,676210583,640092992,92890595,4,08091626,704,59781940,698,730292,480561519 d480251c37a341a0a125af63e45228eb,6,080095,6339,795858545,4530,370,8,668750969,79636,8,0397,42456714,7876007,829482100 ba010104797147b99ef52f254d6a2067,506,60074,12,77112,44677416,3854400,0056,87723,27,0,5641,7454,313,507 97ebde6e9fc14671816bd63b1b6ff105,56745940,013,7681257,6777,5623,116,0,0952,320,08084066,315775,572,288269113,456933 3a2bb99f47154187a58c17f0ac5a2b96,429214373,5703,2057,54815522,7,8791,74033124,8574,8589778,59419948,1,704729913,9,3710987 3bbd81ded02a43f990b7a4324d1ac116,2323,76,9,48099469,987,42,958965,08975,9,81949,903,737744876,401664160,966763 4a022350bfe7460797110b8a840736a9,034,6251413,24058,405,8,5519,74026810,187613940,996638433,73244,76606425,555363,16938548,74975 eb12bdcb69d14b09ac81f9b99b6e5579,0,21833,5595812,150317,9968,70922486,9423136,814866054,825430087,56925,83542247,96954,13878,055823844 bbff1d6dfc854ad9b3a8b73d58421aef,32722,05646618,1079,94836,05099464,099201,2305157,2422,685,310,7729507,51,97,396413
参考准备本地横向联邦数据资源 -> 上传数据集文件,将该文件上传到两个不同计算节点的挂载路径下,即完成纵向联邦数据集配置。
如果数据集文件不含有csv文件表头,需要用户提供额外的配置文件用来说明数据集每一列的信息。以上述标签方数据及为例,没有表头的数据集文件和数据配置文件的格式如下:
041b93f8bea04b57af81cf5edb3bde9c,9047897,51862,84434,58990796,51,99063,250853443,82572689,40,98527709,9244,0,446726236,964 ff4e60d87b394f7189657d1c9392c8cb,450,5415312,63,1129630,675,39751,5680577,21919201,198604,962046,644,510927,49,9116025 47a45426fb7d47d08fe1bbca3ce63f46,999742,822275264,969459,56054153,344,16406,638428,2233,53814,374238,691,4,96,0137 78f6c1bf399942c7ae740d360557e638,458467249,10909614,518,679148004,682,90,0,2,4530,15,2927238,975193,8,14 3204f44517ba4dddba6638435aee1346,2967,8357,905,291746,494260,753289592,5050,60149,701,2,78,88,6,394 d695e7c3058a476a9bf1e8188c5e340b,698151,56030950,288516,138,62594,820260,369101761,6199,499700165,490877269,6478535,7698306,941142,8518818 69b38278dab647df99b246db7d772589,88,66,1020893,3746457,54246917,6229,4,3,76,4,95153694,2327597,689472,463 2cf720857caa4a19a7a0b18016f98dc0,110927262,12,3702556,1,54186,783742,188948,5,1622726,89645,927,6748821,0082130,2883108 b36d1e4766f5477f9ff611820b9f15b2,4,20993,940,303957,943339669,06,6926836,50,3,1,1,761416739,282541036,2 a83632f7b9cc4216adc564e964b52537,143707,3382707,662392417,309248525,5787916,9,35921,2422627,00883996,560,5896678,64703,14,45373760 fd8e7802304a4da6abeff908e4337040,89,09,66,302037,10252,8668994,28713,91765,0650,778656553,8,130448,387,53 910483041de347279ccb1683cd185349,76,466,364727097,55864362,676210583,640092992,92890595,4,08091626,704,59781940,698,730292,480561519 d480251c37a341a0a125af63e45228eb,6,080095,6339,795858545,4530,370,8,668750969,79636,8,0397,42456714,7876007,829482100 ba010104797147b99ef52f254d6a2067,506,60074,12,77112,44677416,3854400,0056,87723,27,0,5641,7454,313,507 97ebde6e9fc14671816bd63b1b6ff105,56745940,013,7681257,6777,5623,116,0,0952,320,08084066,315775,572,288269113,456933 3a2bb99f47154187a58c17f0ac5a2b96,429214373,5703,2057,54815522,7,8791,74033124,8574,8589778,59419948,1,704729913,9,3710987 3bbd81ded02a43f990b7a4324d1ac116,2323,76,9,48099469,987,42,958965,08975,9,81949,903,737744876,401664160,966763 4a022350bfe7460797110b8a840736a9,034,6251413,24058,405,8,5519,74026810,187613940,996638433,73244,76606425,555363,16938548,74975 eb12bdcb69d14b09ac81f9b99b6e5579,0,21833,5595812,150317,9968,70922486,9423136,814866054,825430087,56925,83542247,96954,13878,055823844 bbff1d6dfc854ad9b3a8b73d58421aef,32722,05646618,1079,94836,05099464,099201,2305157,2422,685,310,7729507,51,97,396413
配置文件(.json):
{ "schema": [ { "name": "ID", "type": "STRING", "label_type": "UNIQUE_ID" }, { "name": "f1", "type": "FLOAT", "label_type": "FEATURE" }, { "name": "LABEL", "type": "INTEGER", "label_type": "LABEL", "description": "this is a label column" } ] }
如果数据集文件不包含ID,该数据集将不能进行样本对齐,且特征选择、联邦训练、评估时会校验特征方、标签方的数据量是否相等,若不相等作业会报错。用户可以提供额外的数据ID文件用来说明数据每一行的ID。以上述标签数据集为例,有表头没有ID的数据集文件和数据ID文件格式如下:
f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14 9047897,51862,84434,58990796,51,99063,250853443,82572689,40,98527709,9244,0,446726236,964 450,5415312,63,1129630,675,39751,5680577,21919201,198604,962046,644,510927,49,9116025 999742,822275264,969459,56054153,344,16406,638428,2233,53814,374238,691,4,96,0137 458467249,10909614,518,679148004,682,90,0,2,4530,15,2927238,975193,8,14 2967,8357,905,291746,494260,753289592,5050,60149,701,2,78,88,6,394 698151,56030950,288516,138,62594,820260,369101761,6199,499700165,490877269,6478535,7698306,941142,8518818 88,66,1020893,3746457,54246917,6229,4,3,76,4,95153694,2327597,689472,463 110927262,12,3702556,1,54186,783742,188948,5,1622726,89645,927,6748821,0082130,2883108 4,20993,940,303957,943339669,06,6926836,50,3,1,1,761416739,282541036,2 143707,3382707,662392417,309248525,5787916,9,35921,2422627,00883996,560,5896678,64703,14,45373760 89,09,66,302037,10252,8668994,28713,91765,0650,778656553,8,130448,387,53 76,466,364727097,55864362,676210583,640092992,92890595,4,08091626,704,59781940,698,730292,480561519 6,080095,6339,795858545,4530,370,8,668750969,79636,8,0397,42456714,7876007,829482100 506,60074,12,77112,44677416,3854400,0056,87723,27,0,5641,7454,313,507 56745940,013,7681257,6777,5623,116,0,0952,320,08084066,315775,572,288269113,456933 429214373,5703,2057,54815522,7,8791,74033124,8574,8589778,59419948,1,704729913,9,3710987 2323,76,9,48099469,987,42,958965,08975,9,81949,903,737744876,401664160,966763 034,6251413,24058,405,8,5519,74026810,187613940,996638433,73244,76606425,555363,16938548,74975 0,21833,5595812,150317,9968,70922486,9423136,814866054,825430087,56925,83542247,96954,13878,055823844 32722,05646618,1079,94836,05099464,099201,2305157,2422,685,310,7729507,51,97,396413
041b93f8bea04b57af81cf5edb3bde9c ff4e60d87b394f7189657d1c9392c8cb 47a45426fb7d47d08fe1bbca3ce63f46 78f6c1bf399942c7ae740d360557e638 3204f44517ba4dddba6638435aee1346 d695e7c3058a476a9bf1e8188c5e340b 69b38278dab647df99b246db7d772589 2cf720857caa4a19a7a0b18016f98dc0 b36d1e4766f5477f9ff611820b9f15b2 a83632f7b9cc4216adc564e964b52537 fd8e7802304a4da6abeff908e4337040 910483041de347279ccb1683cd185349 d480251c37a341a0a125af63e45228eb ba010104797147b99ef52f254d6a2067 97ebde6e9fc14671816bd63b1b6ff105 3a2bb99f47154187a58c17f0ac5a2b96 3bbd81ded02a43f990b7a4324d1ac116 4a022350bfe7460797110b8a840736a9 eb12bdcb69d14b09ac81f9b99b6e5579 bbff1d6dfc854ad9b3a8b73d58421aef
