文档首页/
    
      
      MapReduce服务 MRS/
      
      
        
        
        开发指南(LTS版)/
        
        
        Spark2x开发指南(安全模式)/
        
        
        开发Spark应用/
        
        
        Spark SQL样例程序/
        
      
      Spark SQL样例程序(Python)
    
  
  
    
        更新时间:2024-08-03 GMT+08:00
        
          
          
        
      
      
      
      
      
      
      
      
  
      
      
      
        
Spark SQL样例程序(Python)
功能简介
统计日志文件中本周末网购停留总时间超过2个小时的女性网民信息。
代码样例
下面代码片段仅为演示,具体代码参见SparkSQLPythonExample:
# -*- coding:utf-8 -*-
import sys
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
def contains(str1, substr1):
    if substr1 in str1:
        return True
    return False
if __name__ == "__main__":
    if len(sys.argv) < 2:
        print "Usage: SparkSQLPythonExample.py <file>"
        exit(-1)
    # 初始化SparkSession和SQLContext
    sc = SparkSession.builder.appName("CollectFemaleInfo").getOrCreate()
    sqlCtx = SQLContext(sc)
    # RDD转换为DataFrame
    inputPath = sys.argv[1]
    inputRDD = sc.read.text(inputPath).rdd.map(lambda r: r[0])\
        .map(lambda line: line.split(","))\
        .map(lambda dataArr: (dataArr[0], dataArr[1], int(dataArr[2])))\
        .collect()
    df = sqlCtx.createDataFrame(inputRDD)
    # 注册表
    df.registerTempTable("FemaleInfoTable")
    # 执行SQL查询并显示结果
    FemaleTimeInfo = sqlCtx.sql("SELECT * FROM " +
               "(SELECT _1 AS Name,SUM(_3) AS totalStayTime FROM FemaleInfoTable " +
               "WHERE _2 = 'female' GROUP BY _1 )" +
               " WHERE totalStayTime >120").show()
    sc.stop()
 
   父主题: Spark SQL样例程序