Updated on 2023-12-20 GMT+08:00

Connecting Spark to OBS

Overview

Apache Spark is a fast and general compute engine for processing large-scale data sets.

Prerequisites

Hadoop has been installed. For details, see Connecting Hadoop to OBS.

Precautions

To reduce output logs, add the following configuration to the /opt/spark-2.3.3/conf/log4j.properties file:

log4j.logger.com.obs= ERROR

Procedure

The following uses Spark 2.3.3 as an example.

  1. Download spark-2.3.3-bin-without-hadoop.tgz and decompress it to the /opt/spark-2.3.3 directory.
  2. Add the following content to the /etc/profile file:

    export SPARK_HOME=/opt/spark-2.3.3
    export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

  3. Configure Spark.

    1. Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
      export SPARK_DIST_CLASSPATH=$(hadoop classpath)

      For more configurations, see Apache Hadoop.

    2. Rename log4j.properties.template under /opt/spark-2.3.3/conf/ as log4j.properties.

  4. Check whether the connection is successful:

    $SPARK_HOME/bin/run-example org.apache.spark.examples.JavaWordCount obs://obs-bucket/input/test.txt