Using the External Shuffle Service to Improve Performance
Scenario
When the Spark system runs applications that contain a shuffle process, an executor process also writes shuffle data and provides shuffle data for other executors in addition to running tasks. If the executor is heavily loaded and GC occurs, the executor cannot provide shuffle data for other Executors, affecting task running.
The external shuffle service is an auxiliary service in NodeManager. It captures shuffle data to reduce the load on executors. If GC occurs on an executor, tasks on other executors are not affected.
Procedure
- Enable the external shuffle service on NodeManager.
- On MRS Manager (for details about how to log in to MRS Manager, see Logging in to MRS Manager), choose Services > Yarn > Service Configuration and choose Yarn > Customize to add the following configuration items to yarn-site.xml:
<property> <name>yarn.nodemanager.aux-services</name> <value>spark_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> <value>org.apache.spark.network.yarn.YarnShuffleService</value> </property>
Parameter
Description
yarn.nodemanager.aux-services
A long-term auxiliary service in NodeManager for improving shuffle computing performance
yarn.nodemanager.aux-services.spark_shuffle.class
Class of an auxiliary service in NodeManager
- Add a dependency JAR file.
Copy ${SPARK_HOME}/lib/spark-1.5.1-yarn-shuffle.jar to the ${HADOOP_HOME}/share/hadoop/yarn/lib/ directory.
- Restart the NodeManager process so that the external shuffle service is started.
- On MRS Manager (for details about how to log in to MRS Manager, see Logging in to MRS Manager), choose Services > Yarn > Service Configuration and choose Yarn > Customize to add the following configuration items to yarn-site.xml:
- Apply the external shuffle service to Spark applications.
- Add the following configuration items to the client installation directory /Spark/spark/conf/spark-defaults.conf:
spark.shuffle.service.enabled true spark.shuffle.service.port 7337
Parameter
Description
spark.shuffle.service.enabled
A long-term auxiliary service in NodeManager for improving shuffle computing performance The default value is false, indicating that this function is disabled.
spark.shuffle.service.port
Port for the shuffle service to monitor requests for obtaining data. This parameter is optional and its default value is 7337.
1. If the yarn.nodemanager.aux-services configuration item exists, add spark_shuffle to its value. Use a comma to separate this value from other values.
2. The value of spark.shuffle.service.port must be the same as that in the yarn-site.xml file.
- Add the following configuration items to the client installation directory /Spark/spark/conf/spark-defaults.conf:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot