Using Spark

You can use Spark's Kubernetes scheduler spark-submit to submit Spark applications to Kubernetes clusters. For details, see Running Spark on Kubernetes. The submission mechanism works as follows:

Create a pod to run the Spark driver.
The driver creates pods for executing the programs and establishes a connection with these pods.
After the application is complete, the pods that execute the programs are terminated and cleaned up, but the driver pod exists and remains in the completed state until the garbage is collected or it is manually cleaned up. In the completed state, the driver pod does not use any compute or memory resources.

Figure 1 Submission mechanism
Click to enlarge

Running SparkPi on CCE

Install kubectl on the node where Spark is running. For details, see Accessing a Cluster Using kubectl.

Grant the cluster-level permissions.

# Create a service account.
kubectl create serviceaccount spark
# Bind the ClusterRole spark-role to the service account created in the previous step, specify the default namespace, and grant the ClusterRole permission to edit resources.
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Submit a SparkPi job to CCE. The following shows an example:
```
spark-submit \
  --master k8s://https://**.**.**.**:5443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=2 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=swr.ap-southeast-1.myhuaweicloud.com/dev-container/spark:3.1.3-obs \
  local:///root/spark-obs/examples/jars/spark-examples_2.12-3.1.1.jar
```
Where:
- --master: indicates the API server of the cluster. https://**.**.**.**:5443 is the address of the master node used in ~/.kube/config. It can be obtained from kubectl cluster-info.
- --deploy-mode:
  - cluster: a mode in which the driver is deployed on the worker nodes.
  - client: (default value) a mode in which the driver is deployed locally as an external client.
- --name: indicates the name of a job. It is used to name the pods in the cluster.
- --class: indicates the applications, for example, org.apache.spark.examples.SparkPi.
- --conf: indicates the Spark configuration parameters in the key-value pair format. All parameters that can be specified using --conf are read from the ~/spark-obs/conf/spark-defaults.conf file by default. Therefore, the general configuration can be written to be the default settings, the same way as Interconnecting Spark with OBS.
  - spark.executor.instances: indicates the number of pods to execute programs.
  - spark.kubernetes.authenticate.driver.serviceAccountName: indicates the driver's cluster-level permissions. Select the service account created in 2.
  - spark.kubernetes.container.image: indicates the address of the image pushed to SWR in Pushing an Image to SWR.
- local: indicates the path to the JAR packages stored in the local files. In this example, a local file is used to store the JAR packages. The value of this parameter can be file, http, or local. For details, see the Official Documentation.

Accessing OBS

Use spark-submit to deliver an HDFS job. Change the value of obs://bucket-name/filename at the end of the script to the actual file name of the tenant.

spark-submit \
  --master k8s://https://**.**.**.**:5443 \
  --deploy-mode cluster \
  --name spark-hdfs-test \
  --class org.apache.spark.examples.HdfsTest \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=swr.ap-southeast-1.myhuaweicloud.com/dev-container/spark:3.1.3-obs \
  local:///root/spark-obs/examples/jars/spark-examples_2.12-3.1.1.jar obs://bucket-name/filename

Support for Spark Shell Commands to Interact with Spark-Scala

spark-shell \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=swr.ap-southeast-1.myhuaweicloud.com/dev-container/spark:3.1.3-obs \
  --master k8s://https://**.**.**.**:5443

Define the algorithms of Spark computing jobs linecount and wordcount.

def linecount(input:org.apache.spark.sql.Dataset[String]):Long=input.filter(line => line.length()>0).count()
def wordcount(input:org.apache.spark.sql.Dataset[String]):Long=input.flatMap(value => value.split("\\s+")).groupByKey(value => value).count().count()

Define data sources.

var alluxio = spark.read.textFile("alluxio://alluxio-master:19998/sample-1g")
var obs = spark.read.textFile("obs://gene-container-gtest/sample-1g")
var hdfs = spark.read.textFile("hdfs://192.168.1.184:9000/user/hadoop/books/sample-1g")

Start the computing jobs.

spark.time(wordcount(obs))
spark.time(linecount(obs))

Parent Topic: Deploying and Using Spark in a CCE Cluster

Previous topic: Installing Spark

Next topic: Cloud Native AI

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot