Configuring the Spark Native Engine

Updated on 2024-12-13 GMT+08:00

View PDF

NOTE:

This section applies only to MRS 3.3.0 or later.

Scenarios

The Spark Native engine uses the vectorized C++ acceleration library to accelerate Spark operators. Traditional SparkSQL is based on row data and uses JVM codegen to accelerate query. The JVM has a range of restrictions on the generated Java code, such as the method length and number of parameters, and the memory bandwidth utilization of row data is low. The performance needs to be improved. When the mature vectorized C++ acceleration library is used, data is stored in the memory in vectorized format, which improves bandwidth utilization and speeds up queries by processing columns in batches.

You can enable the Spark Native engine to accelerate SparkSQL queries.

Constraints

The Scan operator supports the following data types: Boolean, Integer, Long, Float, Double, String, Date, and Decimal.
Parquet and ORC data formats are supported.
OBS and HDFS file systems are supported.
ADM64 and Arm architectures are supported.
Spark SQL mode is supported.

Parameters

Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter	Description	Default Value
spark.plugins	Plug-in used by Spark. Set this parameter to io.glutenproject.GlutenPlugin. NOTE: If spark.plugins has been configured, add io.glutenproject.GlutenPlugin to the file and separate them with commas (,).	N/A
spark.memory.offHeap.enabled	If this parameter is set to true, Native acceleration requires the off-heap memory of the JVM.	false
spark.memory.offHeap.size	Size of the off-heap memory. Set the value based on the site requirements. The initial value is 1 GB.	-1
spark.yarn.dist.files	This parameter is used to distribute libch.so and libjsig.so to all nodes so that all executors can use the spark.executorEnv.LD_PRELOAD parameter to preload the above libraries. For the x86 architecture, set this parameter to {Client installation directory}/Spark/spark/native/libch.so,{Client installation directory}/JDK/jdk1.8.0_372/jre/lib/amd64/libjsig.so. For the Arm architecture, set this parameter to {Client installation directory}/Spark/spark/native/libch.so,{Client installation directory}/JDK/jdk1.8.0_372/jre/lib/aarch64/libjsig.so. NOTE: If spark.yarn.dist.files has been configured, you can add this parameter to it and separate them with commas (,). libch.so and libjsig.so in the same path as export LD_PRELOAD in spark-env.sh in 2 must be used.	None
spark.executorEnv.LD_PRELOAD	Environment variable LD_PRELOAD for the executor. Set this parameter to $PWD/libch.so $PWD/libjsig.so. NOTE: This parameter is used by the executor to preload libch.so and libjsig.so. If spark.executorEnv.LD_PRELOAD has been configured, add the preceding parameters and separate them with spaces.	None
spark.gluten.sql.columnar.libpath	Path of the Native acceleration library on the server. This file does not exist if database mirroring is not used and is left empty.	Spark installation directory in the cluster, for example, *${BIGDATA_HOME}/FusionInsight_Spark_xxx/install/FusionInsight-Spark-/spark/native/libch.so**.
spark.sql.orc.impl	native: The native ORC of Spark is used to read data. hive: Hive is used to process ORC data. Set this parameter to native.	hive
spark.gluten.sql.columnar.scanOnly	Whether to enable scanOnly for acceleration. Set this parameter to true to enable the scanOnly mode.	false

Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-env.sh file on the Spark client.
- For the x86 architecture:
  Set export LD_PRELOAD to {Client installation directory}/Spark/spark/native/libch.so {Client installation directory}/JDK/jdk1.8.0_372/jre/lib/amd64/libjsig.so.
- For the Arm architecture:
  Set export LD_PRELOAD to {Client installation directory}/Spark/spark/native/libch.so {Client installation directory}/JDK/jdk1.8.0_372/jre/lib/aarch64/libjsig.so.
  
  Note: Use the libch.so and libjsig.so that are in the same path of the spark.yarn.dist.files parameter. If there are multiple SO files, separate them with commas (,) and add double quotation marks (") before and after each SO file.

Parent topic: Spark Core Enterprise-Class Enhancements

Previous topic: Configuring the Switchover Between the Multi-active Instance Mode and the Multi-tenant Mode

Next topic: Configuring the Size of the Spark Event Queue

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot