准备数据
首先,企业A和大数据厂商B需要商议确定要提供的数据范围及对应的元数据信息,例如双方初始决定使用最近三个月的已有用户转化数据作为联邦训练的训练集和评估集。
字段名称 |
字段类型 |
描述 |
---|---|---|
id |
string |
hash过后的手机号字符串 |
col0-col4 |
float |
企业A数据特征 |
label |
int |
企业A对用户的标签属性 |
industry1.csv
id,col0,col1,col2,col3,col4,label 19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7,-0.823913755,0.787712038,0.429635596,-1.315646486,-1.652321611,1 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3,3.041881096,-0.651684341,3.661649955,0.035548734,3.477873904,0 3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278,-1.847252571,0.496981447,1.654416521,-1.945006902,0.394151993,1 4523540f1504cd17100c4835e85b7eefd49911580f8efff0599a8f283be6b9e3,-0.593556893,-0.351750558,0.964512256,-0.017390132,0.092562565,1 4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5,0.241505219,-0.219114719,1.51438745,-0.665234511,0.178575706,0 4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a,0.372607556,-0.29194018,0.080862655,0.391501604,-0.012276428,1 4e07408562bedb8b60ce05c1decfe3ad16b72230967de01f640b7e4729b49fce,1.544282251,-0.203027285,3.076050022,-0.530666302,2.156693386,0 4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a,1.006651366,-0.972403786,1.314115256,0.363296291,5.171128738,0 4fc82b26aecb47d2868c4efbe3581732a3e7cbcc6c2efb32062c08170a05eeb8,-2.859681221,-1.465959913,-0.930994729,-0.773533542,-3.673734138,0 5feceb66ffc86f38d952786c6d696c79c2dbc239dd4e91b46729d73a27fb57e9,-1.409250598,-0.589367921,-4.467396693,1.370376188,-1.2368325,1
大数据厂商B的数据如下,一共有10条记录。
字段名称 |
字段类型 |
描述 |
---|---|---|
id |
string |
hash过后的手机号字符串 |
f0-f4 |
float |
大数据厂商数据特征 |
bigdata1.csv
id,f0,f1,f2,f3,f4 2c624232cdd221771294dfbb310aca000a0df6ac8b66b696d90ef06fdefb64a3,0.390064223,0.664175034,3.20228741,0.380574513,0.017733811 3fdba35f04dc8c462986c992bcf875546257113072a909c162f7e470e581e278,-0.483250226,0.616586578,3.001851708,2.407914633,0.856369412 4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5,-0.070919538,-2.219653517,1.461645551,1.66185096,0.778770954 4ec9599fc203d176a301536c2e091a19bc852759b255bd6818810a42c5fed14a,0.024227451,-1.087235302,3.67470964,-2.420729037,-3.132456573 4fc82b26aecb47d2868c4efbe3581732a3e7cbcc6c2efb32062c08170a05eeb8,-0.771151327,-1.184821181,-0.674077615,-0.379858223,0.158957184 6b51d431df5d7f141cbececcf79edf3dd861c3b4069f0b11661a3eefacbba918,-0.738091802,-1.474822882,2.93475295,-3.763763721,-1.817301398 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b,-1.216062821,-1.093614452,-1.632396806,0.887601314,-4.40930101 8527a891e224136950ff32ca212b45bc93f69fbb801c3b1ebedac52775f99e61,-0.789268594,1.071733834,3.763254446,-3.760298263,0.49776472 e7f6c011776e8db7cd330b54174fd76f7d0216b612387a5ffcfb81e6f0919683,-2.759963795,0.405262468,1.264947591,1.027350049,1.293868423
其中为了保证数据安全,企业A和大数据厂商B通过讨论决定使用hash过后的手机号作为已有数据的唯一标识id字段,并将唯一标识作为数据对齐的依据。