Connecting Spark to OBS

Apache Spark is a fast and general compute engine for processing large-scale data sets.

Hadoop has been installed. For details, see Connecting Hadoop to OBS.

To reduce output logs, add the following configuration to the /opt/spark-2.3.3/conf/log4j.properties file:

log4j.logger.com.obs= ERROR

The following uses Spark 2.3.3 as an example.

Download spark-2.3.3-bin-without-hadoop.tgz and decompress it to the /opt/spark-2.3.3 directory.

Add the following content to the /etc/profile file:

export SPARK_HOME=/opt/spark-2.3.3
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

Configure Spark.
1. Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
```
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
```
  For more configurations, see Apache Hadoop.
2. Rename log4j.properties.template under /opt/spark-2.3.3/conf/ as log4j.properties.
Check whether the connection is successful:

$SPARK_HOME/bin/run-example org.apache.spark.examples.JavaWordCount obs://obs-bucket/input/test.txt