Connecting Spark to OBS

Apache Spark is a fast and general compute engine for processing large-scale data sets.

Hadoop has been installed. For details, see Connecting Hadoop to OBS.

To reduce output logs, add the following configuration to the /opt/spark-2.3.3/conf/log4j.properties file:

log4j.logger.com.obs= ERROR

The following uses Spark 2.3.3 as an example.

Download spark-2.3.3-bin-without-hadoop.tgz and decompress it to the /opt/spark-2.3.3 directory.

Add the following content to the /etc/profile file:

export SPARK_HOME=/opt/spark-2.3.3
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

Configure Spark.
1. Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
```
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
```
  For more configurations, see Apache Hadoop.
2. Rename log4j.properties.template under /opt/spark-2.3.3/conf/ as log4j.properties.
Check whether the connection is successful:

$SPARK_HOME/bin/run-example org.apache.spark.examples.JavaWordCount obs://obs-bucket/input/test.txt

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.