文档首页/
MapReduce服务 MRS/
开发指南(普通版_3.x)/
Spark2x开发指南(普通模式)/
开发Spark应用/
Spark SQL样例程序/
Spark SQL样例程序(Python)
更新时间:2024-08-05 GMT+08:00
Spark SQL样例程序(Python)
功能简介
统计日志文件中本周末网购停留总时间超过2个小时的女性网民信息。
代码样例
下面代码片段仅为演示,具体代码参见SparkSQLPythonExample:
# -*- coding:utf-8 -*- import sys from pyspark.sql import SparkSession from pyspark.sql import SQLContext def contains(str1, substr1): if substr1 in str1: return True return False if __name__ == "__main__": if len(sys.argv) < 2: print "Usage: SparkSQLPythonExample.py <file>" exit(-1) # 初始化SparkSession和SQLContext sc = SparkSession.builder.appName("CollectFemaleInfo").getOrCreate() sqlCtx = SQLContext(sc) # RDD转换为DataFrame inputPath = sys.argv[1] inputRDD = sc.read.text(inputPath).rdd.map(lambda r: r[0])\ .map(lambda line: line.split(","))\ .map(lambda dataArr: (dataArr[0], dataArr[1], int(dataArr[2])))\ .collect() df = sqlCtx.createDataFrame(inputRDD) # 注册表 df.registerTempTable("FemaleInfoTable") # 执行SQL查询并显示结果 FemaleTimeInfo = sqlCtx.sql("SELECT * FROM " + "(SELECT _1 AS Name,SUM(_3) AS totalStayTime FROM FemaleInfoTable " + "WHERE _2 = 'female' GROUP BY _1 )" + " WHERE totalStayTime >120").show() sc.stop()
父主题: Spark SQL样例程序