步骤1:数据准备
使用DataArts Studio前的准备
如果您是第一次使用DataArts Studio,请参考购买并配置DataArts Studio章节完成注册华为账号、购买DataArts Studio实例、创建工作空间等一系列操作。然后进入到对应的工作空间,即可开始使用DataArts Studio。
数据源准备
本示例演示数据来自:https://grouplens.org/datasets/movielens/100k/,即1000名用户对1700部电影的100,000个评分数据。获取链接中的zip数据包并解压,其中的“u.item”和“u.data”文件分别为电影信息和评分信息。
为方便演示,本示例提供了用于模拟原始数据的部分数据。为了方便将源数据集成到云上,我们需要先将样例数据存储为CSV文件,将CSV文件上传至OBS服务中。
- 创建CSV文件(UTF-8无bom格式),文件名称为对应的数据表名,将后文提供的各样例数据分别复制粘贴到不同CSV文件中,然后保存CSV文件。
以下是Windows下生成.csv文件的办法之一:
- 使用文本编辑工具(例如记事本等)新建一个txt文档,将后文提供的样例数据复制进文档中。注意复制后检查数据的行数及数据分行的正确性(注意,如果是从PDF文档中复制样例数据,单行的数据过长时会产生换行,需手动重新调整为单行)。
- 单击 ,在弹出的对话框中,“保存类型”选择为“所有文件(*.*)”,在“文件名”处输入文件名和.csv后缀,选择 编码格式(不能带BOM),则能以CSV格式保存该文件。
- 将源数据CSV文件上传到OBS服务。
- movies.csv:
movieId,movieTitle,videoReleaseDate,IMDbURL,unknown,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,FilmNoir,Horror,Musical,Mystery,Romance,SciFi,Thriller,War,Western 1,Toy Story (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Toy%20Story%20(1995),0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0 2,GoldenEye (1995),1-Jan-95,http://us.imdb.com/M/title-exact?GoldenEye%20(1995),0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0 3,Four Rooms (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0 4,Get Shorty (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995),0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0 5,Copycat (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0 6,Shanghai Triad (Yao a yao yao dao waipo qiao) (1995),1-Jan-95,http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 7,Twelve Monkeys (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0 8,Babe (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Babe%20(1995),0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0 9,Dead Man Walking (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Dead%20Man%20Walking%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 10,Richard III (1995),22-Jan-96,http://us.imdb.com/M/title-exact?Richard%20III%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0 11,Seven (Se7en) (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Se7en%20(1995),0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 12,"Usual Suspects, The (1995)",14-Aug-95,"http://us.imdb.com/M/title-exact?Usual%20Suspects,%20The%20(1995)",0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0 13,Mighty Aphrodite (1995),30-Oct-95,http://us.imdb.com/M/title-exact?Mighty%20Aphrodite%20(1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 14,"Postino, Il (1994)",1-Jan-94,"http://us.imdb.com/M/title-exact?Postino,%20Il%20(1994)",0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 15,Mr. Holland's Opus (1995),29-Jan-96,http://us.imdb.com/M/title-exact?Mr.%20Holland's%20Opus%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 16,French Twist (Gazon maudit) (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Gazon%20maudit%20(1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0 17,From Dusk Till Dawn (1996),5-Feb-96,http://us.imdb.com/M/title-exact?From%20Dusk%20Till%20Dawn%20(1996),0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0 18,"White Balloon, The (1995)",1-Jan-95,http://us.imdb.com/M/title-exact?Badkonake%20Sefid%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 19,Antonia's Line (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Antonia%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 20,Angels and Insects (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Angels%20and%20Insects%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 21,Muppet Treasure Island (1996),16-Feb-96,http://us.imdb.com/M/title-exact?Muppet%20Treasure%20Island%20(1996),0,1,1,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0 22,Braveheart (1995),16-Feb-96,http://us.imdb.com/M/title-exact?Braveheart%20(1995),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0 23,Taxi Driver (1976),16-Feb-96,http://us.imdb.com/M/title-exact?Taxi%20Driver%20(1976),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0 24,Rumble in the Bronx (1995),23-Feb-96,http://us.imdb.com/M/title-exact?Hong%20Faan%20Kui%20(1995),0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0 25,"Birdcage, The (1996)",8-Mar-96,"http://us.imdb.com/M/title-exact?Birdcage,%20The%20(1996)",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 26,"Brothers McMullen, The (1995)",1-Jan-95,"http://us.imdb.com/M/title-exact?Brothers%20McMullen,%20The%20(1995)",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 27,Bad Boys (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Bad%20Boys%20(1995),0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 28,Apollo 13 (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Apollo%2013%20(1995),0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0 29,Batman Forever (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Batman%20Forever%20(1995),0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0 30,Belle de jour (1967),1-Jan-67,http://us.imdb.com/M/title-exact?Belle%20de%20jour%20(1967),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 31,Crimson Tide (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Crimson%20Tide%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0 32,Crumb (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Crumb%20(1994),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 33,Desperado (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Desperado%20(1995),0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0 34,"Doom Generation, The (1995)",1-Jan-95,"http://us.imdb.com/M/title-exact?Doom%20Generation,%20The%20(1995)",0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0 35,Free Willy 2: The Adventure Home (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Free%20Willy%202:%20The%20Adventure%20Home%20(1995),0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0 36,Mad Love (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Mad%20Love%20(1995),0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0 37,Nadja (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Nadja%20(1994),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 38,"Net, The (1995)",1-Jan-95,"http://us.imdb.com/M/title-exact?Net,%20The%20(1995)",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0 39,Strange Days (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Strange%20Days%20(1995),0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0 40,"To Wong Foo, Thanks for Everything! Julie Newmar (1995)",1-Jan-95,"http://us.imdb.com/M/title-exact?To%20Wong%20Foo,%20Thanks%20for%20Everything!%20Julie%20Newmar%20(1995)",0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 41,Billy Madison (1995),1-Jan-95,http://us.imdb.com/M/title-exact?Billy%20Madison%20(1995),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 42,Clerks (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Clerks%20(1994),0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0 43,Disclosure (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Disclosure%20(1994),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0 44,Dolores Claiborne (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Dolores%20Claiborne%20(1994),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0 45,Eat Drink Man Woman (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Yinshi%20Nan%20Nu%20(1994),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0 46,Exotica (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Exotica%20(1994),0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0 47,Ed Wood (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Ed%20Wood%20(1994),0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0 48,Hoop Dreams (1994),1-Jan-94,http://us.imdb.com/M/title-exact?Hoop%20Dreams%20(1994),0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0 49,I.Q. (1994),1-Jan-94,http://us.imdb.com/M/title-exact?I.Q.%20(1994),0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0 50,Star Wars (1977),1-Jan-77,http://us.imdb.com/M/title-exact?Star%20Wars%20(1977),0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0
数据说明如下:
表1 电影数据说明 字段名称
字段类型
字段说明
movieId
INT
电影ID
movieTitle
VARCHAR
电影名
videoReleaseDate
VARCHAR
电影发行日期
IMDbURL
VARCHAR
IMDb链接
unknown
INT
未知,是则为1,否为0
Action
INT
动作,是则为1,否为0
Adventure
INT
冒险,是则为1,否为0
Animation
INT
动画,是则为1,否为0
Children
INT
儿童,是则为1,否为0
Comedy
INT
喜剧,是则为1,否为0
Crime
INT
犯罪,是则为1,否为0
Documentary
INT
纪录片,是则为1,否为0
Drama
INT
戏剧,是则为1,否为0
Fantasy
INT
幻想,是则为1,否为0
FilmNoir
INT
黑色,是则为1,否为0
Horror
INT
恐怖,是则为1,否为0
Musical
INT
音乐,是则为1,否为0
Mystery
INT
神秘,是则为1,否为0
Romance
INT
浪漫,是则为1,否为0
SciFi
INT
科幻,是则为1,否为0
Thriller
INT
惊悚,是则为1,否为0
War
INT
战争,是则为1,否为0
Western
INT
西部,是则为1,否为0
- ratings.csv:
userId,movieId,rating,timestamp 210,40,3,891035994 224,29,3,888104457 308,1,4,887736532 7,32,4,891350932 10,16,4,877888877 99,4,5,886519097 115,20,3,881171009 138,26,5,879024232 243,15,3,879987440 293,5,3,888906576 162,25,4,877635573 135,23,4,879857765 62,21,3,879373460 59,23,5,888205300 43,14,2,883955745 19,4,4,885412840 5,2,3,875636053 72,48,4,880036718 224,26,3,888104153 299,14,4,877877775 151,10,5,879524921 6,14,5,883599249 250,7,4,878089716 268,2,2,875744173 292,11,5,881104093 181,3,2,878963441 145,15,2,875270655 1,33,4,878542699 276,2,4,874792436 18,26,4,880129731 87,40,3,879876917 272,12,5,879455254 296,20,5,884196921 5,17,4,875636198 128,15,4,879968827 287,1,5,875334088 65,47,2,879216672 1,20,4,887431883 290,50,5,880473582 45,25,4,881014015 109,8,3,880572642 157,25,3,886890787 301,33,4,882078228 62,12,4,879373613 276,40,3,874791871 269,22,1,891448072 10,7,4,877892210 244,17,2,880607205 222,26,3,878183043 185,23,4,883524249 207,13,3,875506839 8,22,5,879362183 222,49,3,878183512 200,11,5,884129542 90,25,5,891384789 15,25,3,879456204 234,10,3,891227851 295,39,4,879518279 217,2,3,889069782 189,20,5,893264466 42,44,3,881108548 268,21,3,875742822 262,28,3,879792220 90,22,4,891384357 270,25,5,876954456 194,23,4,879522819 161,48,1,891170745 58,9,4,884304328 79,50,4,891271545 221,48,5,875245462 223,11,3,891550649 292,9,4,881104148 16,8,5,877722736 17,13,3,885272654 148,1,4,877019411 280,1,4,891700426 110,38,3,886988574 90,12,5,891383241 239,9,5,889180446 311,9,4,884963365 151,13,3,879542688 2,50,5,888552084 8,50,5,879362124 286,44,3,877532173 85,25,2,879452769 274,50,5,878944679 217,27,1,889070011 181,14,1,878962392 297,25,4,874954497 1,47,4,875072125 6,23,4,883601365 222,22,5,878183285 314,28,5,877888346 291,15,5,874833668 94,24,4,885873423 83,43,4,880308690 43,40,3,883956468 44,15,4,878341343 158,24,4,880134261 151,12,5,879524368 66,1,3,883601324 5,1,4,875635748 207,25,4,876079113 109,1,4,880563619 227,50,4,879035347 181,1,3,878962392 213,13,4,878955139 121,14,5,891390014 117,15,5,880125887 85,13,3,879452866 313,22,3,891014870 43,5,4,875981421 11,38,3,891905936 72,28,4,880036824 115,8,5,881171982 95,1,5,879197329 145,22,5,875273021 66,7,3,883601355 267,17,4,878971773 25,25,5,885853415 103,24,4,880415847 87,9,4,879877931 49,47,5,888068715 135,39,3,879857931 269,13,4,891446662 99,50,5,885679998 306,14,5,876503995 291,7,5,874834481 312,28,4,891698300 184,36,3,889910195 305,11,1,886323237 198,7,4,884205317 104,7,3,888465972 293,39,3,888906804 256,25,5,882150552 92,15,3,875640189 1,17,3,875073198 214,42,5,892668130 82,14,4,876311280 305,50,5,886321799 223,8,2,891550684 91,28,4,891439243 315,13,4,879821158 269,9,4,891446246 217,7,4,889069741 49,7,4,888067307 87,2,4,879876074 268,1,3,875742341 262,47,2,879794599 84,12,5,883452874 264,33,3,886122644 224,20,1,888104487 200,24,2,884127370 92,24,3,875640448 276,38,3,874792574 286,34,5,877534701 49,38,1,888068289 311,5,3,884365853 269,47,4,891448386 194,4,4,879521397 57,28,4,883698324 108,50,4,879879739 207,4,4,876198457 181,16,1,878962996 94,9,5,885872684 234,20,4,891227979 68,7,3,876974096 13,14,4,884538727 98,47,4,880498898 53,24,3,879442538 239,10,5,889180338 63,20,3,875748004 276,43,1,874791383 272,48,4,879455143 116,7,2,876453915 26,25,3,891373727 62,24,4,879372633 295,47,5,879518166 63,50,4,875747292 49,17,2,888068651 310,24,4,879436242 7,44,5,891351728 326,22,4,879874989 213,12,5,878955409 222,29,3,878184571 249,11,5,879640868 217,22,5,889069741 189,1,5,893264174 234,50,4,892079237 296,48,5,884197091 81,3,4,876592546 151,15,4,879524879 59,12,5,888204260 246,8,3,884921245 276,34,2,877934264 97,50,5,884239471 244,7,4,880602558 298,8,5,884182748 7,28,5,891352341 41,28,4,890687353
数据说明如下:
表2 评分数据说明 字段名称
字段类型
字段说明
userId
INT
用户ID
movieId
INT
电影ID
rating
INT
评分,5分制
timestamp
VARCHAR
时间戳
数据湖准备
在本示例中,选择数据仓库服务(DWS)服务作为数据底座。
- DataArts Studio实例(指DataArts Studio实例中的CDM集群)与DWS集群处于不同区域的情况下,需要通过公网或者专线打通网络。
- DataArts Studio实例(指DataArts Studio实例中的CDM集群)与DWS集群同区域情况下,同虚拟私有云、同子网、同安全组的不同实例默认网络互通;如果同虚拟私有云但是子网或安全组不同,还需配置路由规则及安全组规则,配置路由规则请参见如何配置路由规则章节,配置安全组规则请参见如何配置安全组规则章节。
- 此外,您还必须确保DWS集群与DataArts Studio工作空间所属的企业项目必须相同,如果不同,您需要修改工作空间的企业项目。
创建DWS集群后,您需要在管理中心创建DWS连接,然后通过数据开发组件新建数据库、数据库模式,再执行SQL来创建DWS表。操作步骤如下:
- 参考访问DataArts Studio实例控制台,登录DataArts Studio管理控制台。
- 在DataArts Studio控制台首页,选择对应工作空间的“管理中心”模块,进入管理中心页面。
- 在“数据连接”页面,单击“创建数据连接”按钮。
图1 数据连接
- 在弹出窗口中,配置数据连接参数,完成配置后,单击“确定”完成数据连接的创建。参数配置如图2所示。
- 数据连接类型:数据仓库服务(DWS)
- 数据连接名称:dws_link
- 标签:可选参数。您可以输入新的标签名称,也可以在下拉列表中选择已有的标签。
- 适用组件:保持默认即可。
- SSL加密:是否开启,请与源端DWS集群的配置保持一致。
- 连接方式:选择“通过代理连接”。
- 手动:选择“集群名”模式,“IP”和“端口”不需要手动填写。
- DWS集群名:选择所创建的DWS集群。
- KMS密钥:选择一个KMS密钥,使用KMS密钥对敏感数据进行加密。如果未创建KMS密钥,请单击“访问KMS”进入KMS控制台创建一个密钥。
- 绑定Agent:需选择一个数据集成集群作为连接代理,该集群和DWS集群必须网络互通。本示例可选择创建DataArts Studio实例时自动创建的数据集成集群。
- 用户名:数据库的用户名,创建DWS集群时指定的用户名,默认为dbadmin。
- 密码:数据库的访问密码,创建DWS集群时指定的密码。
- DWS连接创建完成后,跳转到数据开发页面。
图3 跳转到数据开发页面
- 创建DWS数据库和数据库模式。
- 在数据开发界面,在DWS连接上右键单击,选择“新建数据库”,创建一个数据库用于存放数据表,数据库名称为“demo”。
图4 创建数据库
- 展开DWS连接目录至demo数据库的数据库模式层级,然后再右键单击,选择“新建模式”,创建数据库模式用于存放数据表,数据库模式名称为“dgc”。
图5 创建数据库模式
- 在数据开发界面,在DWS连接上右键单击,选择“新建数据库”,创建一个数据库用于存放数据表,数据库名称为“demo”。
- 创建一个DWS SQL脚本,以通过DWS SQL语句来创建数据表。
图6 新建脚本
- 在新建脚本弹出的SQL编辑器中输入如下SQL语句,并单击“运行”来创建数据表。其中,movies_item、ratings_item为原始数据表,具体数据将在之后通过CDM由OBS迁移到表中;top_rating_movie和top_active_movie为结果表,用于存放分析结果。
SET SEARCH_PATH TO dgc; CREATE TABLE IF NOT EXISTS movies_item( movieId INT, movieTitle VARCHAR, videoReleaseDate VARCHAR, IMDbURL VARCHAR, unknown INT, Action INT, Adventure INT, Animation INT, Children INT, Comedy INT, Crime INT, Documentary INT, Drama INT, Fantasy INT, FilmNoir INT, Horror INT, Musical INT, Mystery INT, Romance INT, SciFi INT, Thriller INT, War INT, Western INT ); CREATE TABLE IF NOT EXISTS ratings_item( userId INT, movieId INT, rating INT, timestamp VARCHAR ); CREATE TABLE IF NOT EXISTS top_rating_movie( movieTitle VARCHAR, avg_rating float, rating_user_number int ); CREATE TABLE IF NOT EXISTS top_active_movie( movieTitle VARCHAR, avg_rating float, rating_user_number int );
图7 创建数据表
- 脚本运行成功后,可以通过如下脚本检查数据表是否创建成功。确认数据表创建成功后,该脚本后续无需使用,可直接关闭。
SELECT * FROM pg_tables;
认证数据准备
- Access Key Id(AK):访问密钥ID。与私有访问密钥关联的唯一标识符;访问密钥ID和私有访问密钥一起使用,对请求进行加密签名。
- Secret Access Key(SK):与访问密钥ID结合使用的密钥,对请求进行加密签名,可标识发送方,并防止请求被修改。
- 登录控制台,在用户名下拉列表中选择“我的凭证”。
- 进入“我的凭证”页面,选择,如图8所示。
- 单击“确定”,根据浏览器提示,保存密钥文件。密钥文件会直接保存到浏览器默认的下载文件夹中。打开名称为“credentials.csv”的文件,即可查看访问密钥(Access Key Id和Secret Access Key)。
- 每个用户仅允许新增两个访问密钥。
- 为保证访问密钥的安全,访问密钥仅在初次生成时自动下载,后续不可再次通过管理控制台界面获取。请在生成后妥善保管。