Connecting Spark to OBS
Overview
Apache Spark is a fast and general compute engine for processing large-scale data sets.
Prerequisites
Hadoop has been installed. For details, see Connecting Hadoop to OBS.
Precautions
To reduce output logs, add the following configuration to the /opt/spark-2.3.3/conf/log4j.properties file:
log4j.logger.com.obs= ERROR
Procedure
The following uses Spark 2.3.3 as an example.
- Download spark-2.3.3-bin-without-hadoop.tgz and decompress it to the /opt/spark-2.3.3 directory.
- Add the following content to the /etc/profile file:
export SPARK_HOME=/opt/spark-2.3.3 export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
- Configure Spark.
- Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
For more configurations, see Apache Hadoop.
- Rename log4j.properties.template under /opt/spark-2.3.3/conf/ as log4j.properties.
- Rename spark-env.sh.template under /opt/spark-2.3.3/conf/ as spark-env.sh and add the following configuration:
- Check whether the connection is successful:
$SPARK_HOME/bin/run-example org.apache.spark.examples.JavaWordCount obs://obs-bucket/input/test.txt
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot