Using Spark on CCE

Running SparkPi on CCE

The following describes how to submit a Spark-Pi job to CCE.

spark-submit \
  --master k8s://https://aa.bb.cc.dd:5443 \
  --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --conf spark.kubernetes.container.image=swr.cn-east-3.myhuaweicloud.com/batch/spark:2.4.5-obs \
    local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar

Configuration description:

  1. aa.bb.cc.dd is the master address specified in ~/.kube/config. You can run the kubectl cluster-info command to obtain the master address.
  2. spark.kubernetes.container.image is the address of the pushed image. If the image is a private image, you also need to configure spark.kubernetes.container.image.pullSecrets.
  3. All parameters that can be specified using --conf are read from the ~/spark-obs/conf/spark-defaults.conf file by default. Therefore, the general configuration can be written to be the default settings, the same way as OBS access configuration.

Accessing OBS

Use spark-submit to deliver an HDFS job. Change the value of obs://bucket-name/filename at the end of the script to the actual file name of the tenant.

spark-submit \
  --master k8s://https://aa.bb.cc.dd:5443 \
  --deploy-mode cluster \
  --name spark-hdfs-test \
  --class org.apache.spark.examples.HdfsTest \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=swr.cn-east-3.myhuaweicloud.com/batch/spark:2.4.5-obs \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar obs://bucket-name/filename

Support for Spark Shell Commands to Interact with Spark-Scala

spark-shell \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image= swr.cn-east-3.myhuaweicloud.com/batch/spark:2.4.5-obs \
  --master k8s://https://aa.bb.cc.dd:5443

Run the following commands to define the algorithms of Spark computing jobs linecount and wordcount:

def linecount(input:org.apache.spark.sql.Dataset[String]):Long=input.filter(line => line.length()>0).count()
def wordcount(input:org.apache.spark.sql.Dataset[String]):Long=input.flatMap(value => value.split("\\s+")).groupByKey(value => value).count().count()

Run the following commands to define data sources:

var alluxio = spark.read.textFile("alluxio://alluxio-master:19998/sample-1g")
var obs = spark.read.textFile("obs://gene-container-gtest/sample-1g")
var hdfs = spark.read.textFile("hdfs://192.168.1.184:9000/user/hadoop/books/sample-1g")

Run the following command to start computing jobs:

spark.time(wordcount(obs))
spark.time(linecount(obs))