Using MRS Spark SQL to Access GaussDB(DWS)

You can use MRS to quickly build and operate a full-stack cloud-native big data platform on Huawei Cloud. Big data components such as HDFS, Hive, HBase, and Spark, are available on the platform for analyzing enterprise data at scale.

You can process structured data with the Spark SQL language that is similar to SQL. With Spark SQL, you can access different databases, extract data from these databases, process the data, and load it to different data stores.

This practice demonstrates how to use MRS Spark SQL to access GaussDB(DWS) data.

This section applies only to MRS 3.x or later.

Prerequisites

You have created an MRS cluster that contains the Spark component. For details, see Buying an MRS cluster.
If Kerberos authentication is enabled for the cluster, log in to FusionInsight Manager, choose System > Permission > User, and add the human-machine user sparkuser to the user groups hadoop (primary) and hive. Add the ADD JAR permission by referring to Adding a Ranger Access Permission Policy for Spark2x. If Kerberos authentication is disabled for the MRS cluster, you do not need to add the user.
The MRS cluster client has been installed. For details, see Installing a Client.
You have created a GaussDB (DWS) cluster. For details, see Creating a GaussDB (DWS) Cluster. To ensure network connectivity, the AZ, VPC, and security group of the GaussDB (DWS) cluster must be the same as those of the MRS cluster.
You have obtained the IP address, port number, database name, username, and password for connecting to the GaussDB(DWS) database. The user must have the read and write permissions on GaussDB(DWS) tables.

Procedure

Prepare data and create databases and tables in the GaussDB(DWS) cluster.
1. Log in to the GaussDB(DWS) console and click Log In in the Operation column of the cluster.
2. Log in to the default database gaussdb of the cluster and run the following command to create the dws_test database:
  CREATE DATABASE dws_test;
3. Connect to the created database and run the following command to create the dws_order table:
  CREATE SCHEMA dws_data;
  
  CREATE TABLE dws_data.dws_order
  
  ( order_id VARCHAR,
  
  order_channel VARCHAR,
  
  order_time VARCHAR,
  
  cust_code VARCHAR,
  
  pay_amount DOUBLE PRECISION,
  
  real_pay DOUBLE PRECISION );
4. Run the following command to insert data to the dws_order table:
  INSERT INTO dws_data.dws_order VALUES ('202306270001', 'webShop', '2023-06-27 10:00:00', 'CUST1', 1000, 1000);
  
  INSERT INTO dws_data.dws_order VALUES ('202306270002', 'webShop', '2023-06-27 11:00:00', 'CUST2', 5000, 5000);
5. Run the following command to query the table data to check whether the data is inserted:
  SELECT * FROM dws_data.dws_order;
Download the JDBC driver of the GaussDB(DWS) database and upload it to the MRS cluster.
1. Log in to the GaussDB (DWS) console, click Connections on the left, and download the JDBC driver.
2. Decompress the package to obtain the gsjdbc200.jar file and upload it to the active Master node of the MRS cluster, for example, to the /tmp directory.
3. Log in to the active Master node as user root and run the following commands:
  cd Client installation directory
  
  source bigdata_env
  
  kinit sparkuser (Change the password upon the first authentication. If Kerberos authentication is disabled, you do not need to run this command.)
  
  hdfs dfs -put /tmp/gsjdbc200.jar /tmp
Create a data source table in MRS Spark and access the GaussDB(DWS) table.
1. Log in to the Spark client node and run the following commands:
  cd Client installation directory
  
  source ./bigdata_env
  
  kinit sparkuser
  
  spark-sql --master yarn
2. Run the following command to add the driver Jar package:
  add jar hdfs://hacluster/tmp/gsjdbc200.jar;
3. Run the following commands to create a data source table in Spark and access GaussDB(DWS) data:
  CREATE TABLE IF NOT EXISTS spk_dws_order
  
  USING JDBC OPTIONS (
  
  'url'='jdbc:gaussdb://192.168.0.228:8000/dws_test',
  
  'driver'='com.huawei.gauss200.jdbc.Driver',
  
  'dbtable'='dws_data.dws_order',
  
  'user'='dbadmin',
  
  'password'='xxx');
4. Run the following command to query the Spark table. Check whether the displayed data is the same as the GaussDB(DWS) data.
  SELECT * FROM spk_dws_order;
  
  Verify that the returned data is the same as that shown in 1.