pyspark样例代码

开发说明

mongo只支持增强型跨源。

DDS即文档数据库服务，兼容MongoDB协议。

前提条件
 在DLI管理控制台上已完成创建增强跨源连接，并绑定队列。具体操作请参考《数据湖探索用户指南》。

认证用的password硬编码到代码中或者明文存储都有很大的安全风险，建议在配置文件或者环境变量中密文存放，使用时解密，确保安全。

通过DataFrame API 访问

import相关依赖

from __future__ import print_function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession

创建session

     sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate()

设置连接参数

     url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin"
uri = "mongodb://username:pwd@host:8635/db"
user = "rwuser"
database = "test"
collection = "test"
password = "######"
 
 
  

详细的参数说明请参考表1。

创建DataFrame

     dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19),(2,"Tom",20)])
schema = StructType([StructField("id", IntegerType(), False),          
                     StructField("name", StringType(), False),
                     StructField("age", IntegerType(), False)])
dataFrame = sparkSession.createDataFrame(dataList, schema)
 
 
  

导入数据到mongo

     dataFrame.write.format("mongo")
  .option("url", url)\
  .option("uri", uri)\
  .option("user",user)\
  .option("password",password)\
  .option("database",database)\
  .option("collection",collection)\
  .mode("Overwrite")\
  .save()
 
 
  

读取MongoDB上的数据

     jdbcDF = sparkSession.read
  .format("mongo")\
  .option("url", url)\
  .option("uri", uri)\
  .option("user",user)\
  .option("password",password)\
  .option("database",database)\
  .option("collection",collection)\
  .load()
jdbcDF.show()
 
 
  

操作结果

通过SQL API 访问

创建DLI关联跨源访问 MongoDB的关联表。

sparkSession.sql(
      "create table test_dds(id string, name string, age int) using mongo options(
      'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin',
      'uri' = 'mongodb://username:pwd@host:8635/db',
      'database' = 'test',
      'collection' = 'test', 
      'user' = 'rwuser', 
      'password' = '######')")

详细的参数说明请参考表1。

插入数据

     sparkSession.sql("insert into test_dds values('3', 'Ann',23)")

查询数据

     sparkSession.sql("select * from test_dds").show()

提交Spark作业
1. 将写好的python代码文件上传至OBS桶中。
2. 在Spark作业编辑器中选择对应的Module模块并执行Spark作业。
  - 如果选择spark版本为2.3.2（即将下线）或2.4.5提交作业时，需要指定Module模块，名称为：sys.datasource.mongo。
  - 如果选择Spark版本为3.1.1及以上版本时，无需选择Module模块，需在 'Spark参数（--conf)' 配置
     spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
    
    spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/mongo/*
  - 通过控制台提交作业请参考《数据湖探索用户指南》中的“选择依赖资源参数说明”。
  - 通过API提交作业请参考《数据湖探索API参考》>《创建批处理作业》中“表2-请求参数说明”关于“modules”参数的说明。

完整示例代码

通过DataFrame API 访问

from __future__ import print_function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession

if __name__ == "__main__":
  # Create a SparkSession session.
  sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate()

  # Create a DataFrame and initialize the DataFrame data.  
  dataList = sparkSession.sparkContext.parallelize([("1", "Katie", 19),("2","Tom",20)])
  
  # Setting schema  
  schema = StructType([StructField("id", IntegerType(), False),StructField("name", StringType(), False), StructField("age", IntegerType(), False)])

  # Create a DataFrame from RDD and schema  
  dataFrame = sparkSession.createDataFrame(dataList, schema)

  # Setting connection parameters  
  url = "192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin"
  uri = "mongodb://username:pwd@host:8635/db"
  user = "rwuser"
  database = "test"
  collection = "test"
  password = "######"
 
  # Write data to the mongodb table  
  dataFrame.write.format("mongo")
    .option("url", url)\
    .option("uri", uri)\
    .option("user",user)\
    .option("password",password)\
    .option("database",database)\
    .option("collection",collection)
    .mode("Overwrite").save()

  # Read data  
  jdbcDF = sparkSession.read.format("mongo")
    .option("url", url)\
    .option("uri", uri)\
    .option("user",user)\
    .option("password",password)\
    .option("database",database)\
    .option("collection",collection)\
    .load()   
  jdbcDF.show()
 
  # close session  
  sparkSession.stop()

通过SQL API 访问

from __future__ import print_function
from pyspark.sql import SparkSession

if __name__ == "__main__":
  # Create a SparkSession session.  
  sparkSession = SparkSession.builder.appName("datasource-mongo").getOrCreate()

  # Create a data table for DLI - associated mongo
    sparkSession.sql(
      "create table test_dds(id string, name string, age int) using mongo options(\
      'url' = '192.168.4.62:8635,192.168.5.134:8635/test?authSource=admin',\
      'uri' = 'mongodb://username:pwd@host:8635/db',\
      'database' = 'test',\
      'collection' = 'test', \
      'user' = 'rwuser', \
      'password' = '######')")

  # Insert data into the DLI-table  
  sparkSession.sql("insert into test_dds values('3', 'Ann',23)")

  # Read data from DLI-table  
  sparkSession.sql("select * from test_dds").show()

  # close session  
  sparkSession.stop()

父主题： 对接MongoDB

上一篇：scala样例代码

下一篇：java样例代码

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！

系统繁忙，请稍后重试