Help Center/ Data Lake Insight/ Developer Guide/ Spark Jar Jobs/ Using Spark Jobs to Access Data Sources of Datasource Connections/ Connecting to GaussDB(DWS)/ PySpark Example Code

Updated on 2026-03-10 GMT+08:00

PySpark Example Code

Scenario

This section provides PySpark example code that demonstrates how to use a Spark job to access data from the DWS data source.

A datasource connection has been created and bound to a queue on the DLI management console. For details, see Enhanced Datasource Connections.

Hard coding passwords or storing them in code in plaintext poses significant security risks. You are advised to store them in encrypted form in configuration files or environment variables and decrypt them when needed to ensure security.

You can also use DEW to manage access credentials for data sources.

Preparations

Import dependency packages.

     from __future__ import print_function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession

Create a session.

     sparkSession = SparkSession.builder.appName("datasource-dws").getOrCreate()

Accessing a Data Source Using a DataFrame API

Set connection parameters.

     url = "jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres"
dbtable = "customer"
user = "dbadmin"
password = "######"
driver = "org.postgresql.Driver"
 
 
  

Set data.

     dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19)])

Configure the schema.

     schema = StructType([StructField("id", IntegerType(), False),\                
                     StructField("name", StringType(), False),\            
                     StructField("age", IntegerType(), False)])

Create a DataFrame.

     dataFrame = sparkSession.createDataFrame(dataList, schema)

Save the data to DWS.

     dataFrame.write \   
    .format("jdbc") \  
    .option("url", url) \  
    .option("dbtable", dbtable) \  
    .option("user", user) \ 
    .option("password", password) \ 
    .option("driver", driver) \ 
    .mode("Overwrite") \  
    .save()
 
 
  

The options of mode can be one of the following:

ErrorIfExists: If the data already exists, the system throws an exception.
Overwrite: If the data already exists, the original data will be overwritten.
Append: If the data already exists, the system saves the new data.
Ignore: If the data already exists, no operation is required. This is similar to the SQL statement CREATE TABLE IF NOT EXISTS.

Read data from DWS.

     jdbcDF = sparkSession.read \
    .format("jdbc") \
    .option("url", url) \
    .option("dbtable", dbtable) \
    .option("user", user) \
    .option("password", password) \
    .option("driver", driver) \
    .load()
jdbcDF.show()
 
 
  

View the operation result.

Accessing a Data Source Using a SQL API

Create a table to connect to a DWS data source.

     sparkSession.sql(
    "CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS (
    'url'='jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres',\
    'dbtable'='customer',\
    'user'='dbadmin',\
    'password'='######',\
    'driver'='org.postgresql.Driver')")
 
 
  

For details about table creation parameters, see Table 1.

Insert data.

     sparkSession.sql("insert into dli_to_dws values(2,'John',24)")

Query data.

     jdbcDF = sparkSession.sql("select * from dli_to_dws").show()

View the operation result.

Submitting a Spark Job

Upload the Python code file to the OBS bucket.
In the Spark job editor, select the corresponding dependency module and execute the Spark job.
- For Spark 2.3.2 (soon to be taken offline) or 2.4.5, set Module to sys.datasource.dws when submitting a job.
- If the Spark version is 3.1.1 or later, you do not need to select a module. Configure Spark parameters (--conf).
  spark.driver.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/dws/*
  
  spark.executor.extraClassPath=/usr/share/extension/dli/spark-jar/datasource/dws/*
- For how to submit a job on the console, see table "Parameters for selecting dependency resources" in Creating a Spark Job.
- For details about how to submit a job through an API, see the description of the modules parameter in Table 2 "Request parameters" in Creating a Batch Processing Job.

Complete Example Code

Connecting to data sources through DataFrame APIs

     # _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession

if __name__ == "__main__":
  # Create a SparkSession session.   
  sparkSession = SparkSession.builder.appName("datasource-dws").getOrCreate()

  # Set cross-source connection parameters  
  url = "jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres"
  dbtable = "customer" 
  user = "dbadmin"
  password = "######"
  driver = "org.postgresql.Driver"

  # Create a DataFrame and initialize the DataFrame data.   
  dataList = sparkSession.sparkContext.parallelize([(1, "Katie", 19)])

  # Setting schema   
  schema = StructType([StructField("id", IntegerType(), False),\     
                       StructField("name", StringType(), False),\    
                       StructField("age", IntegerType(), False)])

  # Create a DataFrame from RDD and schema   
  dataFrame = sparkSession.createDataFrame(dataList, schema)

  # Write data to the DWS table  
  dataFrame.write \ 
      .format("jdbc") \    
      .option("url", url) \  
      .option("dbtable", dbtable) \    
      .option("user", user) \    
      .option("password", password) \   
      .option("driver", driver) \     
      .mode("Overwrite") \   
      .save()

  # Read data   
  jdbcDF = sparkSession.read \  
      .format("jdbc") \     
      .option("url", url) \  
      .option("dbtable", dbtable) \     
      .option("user", user) \     
      .option("password", password) \   
      .option("driver", driver) \  
      .load()  
  jdbcDF.show()

  # close session  
  sparkSession.stop()
 
 
  

Connecting to data sources through SQL APIs

     # _*_ coding: utf-8 _*_
from __future__ import print_function
from pyspark.sql import SparkSession

if __name__ == "__main__":
  # Create a SparkSession session. 
  sparkSession = SparkSession.builder.appName("datasource-dws").getOrCreate()

  # Create a data table for DLI - associated DWS 
  sparkSession.sql(
      "CREATE TABLE IF NOT EXISTS dli_to_dws USING JDBC OPTIONS (\
      'url'='jdbc:postgresql://to-dws-1174404951-W8W4cW8I.datasource.com:8000/postgres',\
      'dbtable'='customer',\
      'user'='dbadmin',\
      'password'='######',\
      'driver'='org.postgresql.Driver')")

  # Insert data into the DLI data table  
  sparkSession.sql("insert into dli_to_dws values(2,'John',24)")

  # Read data from DLI data table  
  jdbcDF = sparkSession.sql("select * from dli_to_dws").show()

  # close session  
  sparkSession.stop()
 
 
  

Parent topic: Connecting to GaussDB(DWS)

Previous topic: Scala Example Code

Next topic: Java Example Code

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot