Installing Spark

Prerequisites

JDK 1.8 or later must be configured in the environment.

Obtaining the SDK Package

As OBS matches Hadoop 2.8.3 and 3.1.1, the spark-2.4.5-bin-hadoop2.8.tgz package is used in this example.

git clone -b v2.4.5 https://github.com/apache/spark.git

./dev/make-distribution.sh --name hadoop2.8 --tgz -Pkubernetes -Pyarn -Dhadoop.version=2.8.3

Obtaining the HUAWEI CLOUD OBS JAR Package

The hadoop-huaweicloud-2.8.3-hw-40.jar package is used, which can be obtained from https://github.com/huaweicloud/obsa-hdfs/tree/master/release.

Configuring Spark Running Environment

To simplify the operation, use the root user to place spark-2.4.5-bin-hadoop2.8.tgz in the /root directory on the operation node.

Run the following command to install Spark:

    tar -zxvf spark-2.4.5-bin-hadoop2.8.tgz
    mv spark-2.4.5-bin-hadoop2.8 spark-obs
    cat >> ~/.bashrc <<EOF
PATH=/root/spark-obs/bin:\$PATH
PATH=/root/spark-obs/sbin:\$PATH
export SPARK_HOME=/root/spark-obs
EOF
 
    source ~/.bashrc

At this time, the spark-submit script is available. You can run the spark-submit --version command to check the Spark version.

Interconnecting Spark with OBS

  1. Copy the HUAWEI CLOUD OBS JAR package to the corresponding directory.

    cp hadoop-huaweicloud-2.8.3-hw-40.jar /root/spark-obs/jars/

  2. Modify Spark ConfigMaps.

    To interconnect Spark with OBS, add ConfigMaps for Spark as follows:

        cp ~/spark-obs/conf/spark-defaults.conf.template ~/spark-obs/conf/spark-defaults.conf
        cat >> ~/spark-obs/conf/spark-defaults.conf <<EOF
    spark.hadoop.fs.obs.readahead.inputstream.enabled=true
    spark.hadoop.fs.obs.buffer.max.range=6291456
    spark.hadoop.fs.obs.buffer.part.size=2097152
    spark.hadoop.fs.obs.threads.read.core=500
    spark.hadoop.fs.obs.threads.read.max=1000
    spark.hadoop.fs.obs.write.buffer.size=8192
    spark.hadoop.fs.obs.read.buffer.size=8192
    spark.hadoop.fs.obs.connection.maximum=1000
    spark.hadoop.fs.obs.access.key=******
    spark.hadoop.fs.obs.secret.key=******
    spark.hadoop.fs.obs.endpoint=******
    spark.hadoop.fs.obs.buffer.dir=/root/hadoop-obs/obs-cache
    spark.hadoop.fs.obs.impl=org.apache.hadoop.fs.obs.OBSFileSystem
    spark.hadoop.fs.obs.connection.ssl.enabled=false
    spark.hadoop.fs.obs.fast.upload=true
    spark.hadoop.fs.obs.socket.send.buffer=65536
    spark.hadoop.fs.obs.socket.recv.buffer=65536
    spark.hadoop.fs.obs.max.total.tasks=20
    spark.hadoop.fs.obs.threads.max=20
    EOF 
        
        vim ~/spark-obs/conf/spark-defaults.conf

    Change the values of AK_OF_YOUR_ACCOUNT, SK_OF_YOUR_ACCOUNT, and OBS_ENDPOINT to the actual values.

Pushing an Image to SWR

Running Spark on Kubernetes requires the Spark image of the same version. A Dockerfile file has been generated during compilation. You can use this file to create an image and push it to SWR.

  1. Create an image.

    cd ~/spark-obs

    docker build -t spark:2.4.5-obs -f kubernetes/dockerfiles/spark/Dockerfile .

  2. Push the image.

    Log in to the SWR console and obtain the login command.

    Log in to the node where the image is created and run the login command.

    docker tag [{Image name}:{Image tag}] swr.cn-east-3.myhuaweicloud.com/{Organization name}/{Image name}:{Image tag}

    docker push swr.cn-east-3.myhuaweicloud.com/{Organization name}/{Image name}:{Image tag}

    For example:

    Record the image access address for later use.

    For example, swr.cn-east-3.myhuaweicloud.com/batch/spark:2.4.5-obs.

Configuring Spark History Server

cat >> ~/spark-obs/conf/spark-defaults.conf <<EOF
spark.eventLog.enabled=true
spark.eventLog.dir=obs://******
EOF

Ensure that the bucket name and directory in the preceding command are valid.

For example, obs://spark-sh1/history-obs/ is a valid OBS directory.

Modify the ~/spark-obs/conf/spark-env.sh file. If the file does not exist, run the command to copy the template as a file:

    cp ~/spark-obs/conf/spark-env.sh.template ~/spark-obs/conf/spark-env.sh
 
    cat >> ~/spark-obs/conf/spark-env.sh <<EOF
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=obs://******"
    EOF

The OBS directory must be the same as that in spark-default.conf.

start-history-server.sh

Start Spark History Server.

After the startup, you can access the server over port 18080.