Connecting Spark to OBS
Overview
Apache Spark is a fast and general compute engine for processing large-scale data sets.
Prerequisites
Hadoop has been installed. For details, see Connecting Hadoop to OBS.
Precautions
To reduce output logs, add the following configuration to the /opt/spark-2.3.3/conf/log4j.properties file:
log4j.logger.com.obs= ERROR
Procedure
The following uses Spark 2.3.3 as an example.
- Download spark-2.3.3-bin-without-hadoop.tgz and decompress it to the /opt/spark-2.3.3 directory.
- Add the following content to the /etc/profile file:
export SPARK_HOME=/opt/spark-2.3.3 export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
- Configure Spark.
- Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
For more configurations, see Apache Hadoop.
- Rename log4j.properties.template under /opt/spark-2.3.3/conf/ as log4j.properties.
- Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
- Check whether the connection is successful:
$SPARK_HOME/bin/run-example org.apache.spark.examples.JavaWordCount obs://obs-bucket/input/test.txt
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.