Updated on 2024-05-29 GMT+08:00

Configuring the Spark Native Engine

Scenario

The Spark Native engine uses the vectorized C++ acceleration library to accelerate Spark operators. Traditional SparkSQL is based on row data and uses JVM codegen to accelerate query. The JVM has a range of restrictions on the generated Java code, such as the method length and number of parameters, and the memory bandwidth utilization of row data is low. The performance needs to be improved. When the mature vectorized C++ acceleration library is used, data is stored in the memory in vectorized format, which improves bandwidth utilization and speeds up queries by processing columns in batches.

You can enable the Spark Native engine to accelerate SparkSQL queries.

Constraints

  • The Scan operator supports the following data types: Boolean, Integer, Long, Float, Double, String, Date, and Decimal.
  • Parquet and ORC data formats are supported.
  • OBS and HDFS file systems are supported.
  • ADM64 and Arm architectures are supported.
  • Spark SQL mode is supported.

Parameters

  1. Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client.

    Parameter

    Description

    Default Value

    spark.plugins

    Plug-in used by Spark. Set this parameter to io.glutenproject.GlutenPlugin.

    NOTE:

    If spark.plugins has been configured, you can add io.glutenproject.GlutenPlugin to the file and separate them with commas (,).

    N/A

    spark.memory.offHeap.enabled

    If this parameter is set to true, Native acceleration requires the off-heap memory of the JVM.

    false

    spark.memory.offHeap.size

    Size of the off-heap memory. Set the value based on the site requirements. The initial value is 1 GB.

    -1

    spark.yarn.dist.files

    This parameter is used to distribute libch.so and libjsig.so to all nodes so that all executors can use the spark.executorEnv.LD_PRELOAD parameter to preload the above libraries.

    • For the x86 architecture, set this parameter to {Client installation directory}/Spark/spark/native/libch.so,{Client installation directory}/JDK/jdk1.8.0_372/jre/lib/amd64/libjsig.so.
    • For the Arm architecture, set this parameter to {Client installation directory}/Spark/spark/native/libch.so,{Client installation directory}/JDK/jdk1.8.0_372/jre/lib/aarch64/libjsig.so.
    NOTE:

    If spark.yarn.dist.files has been configured, you can add this parameter to it and separate them with commas (,).

    libch.so and libjsig.so in the same path as export LD_PRELOAD in spark-env.sh in 2 must be used.

    None

    spark.executorEnv.LD_PRELOAD

    Environment variable LD_PRELOAD for the executor.

    Set this parameter to $PWD/libch.so $PWD/libjsig.so.

    NOTE:

    This parameter is used by the executor to preload libch.so and libjsig.so. If spark.executorEnv.LD_PRELOAD has been configured, you can add this parameter to it and separate them with spaces.

    None

    spark.gluten.sql.columnar.libpath

    Path of the Native acceleration library on the server. This file does not exist if database mirroring is not used. Leave it blank.

    Spark installation directory in the cluster, for example, ${BIGDATA_HOME}/FusionInsight_Spark_xxx/install/FusionInsight-Spark-*/spark/native/libch.so

    spark.sql.orc.impl

    native: The native ORC of Spark is used to read data.

    hive: Hive is used to process ORC data.

    Set this parameter to native.

    hive

    spark.gluten.sql.columnar.scanOnly

    Whether to enable scanOnly for acceleration.

    Set this parameter to true to enable the scanOnly mode.

    false

  1. Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-env.sh file on the Spark client.
    • On x86, set the parameter as follows:

      Set export LD_PRELOAD to {Client installation directory}/Spark/spark/native/libch.so {Client installation directory}/JDK/jdk1.8.0_372/jre/lib/amd64/libjsig.so.

    • On Arm, set tha parameter as follows:

      Set export LD_PRELOAD to {Client installation directory}/Spark/spark/native/libch.so {Client installation directory}/JDK/jdk1.8.0_372/jre/lib/aarch64/libjsig.so.

      Note: Use the libch.so and libjsig.so that are in the same path of the spark.yarn.dist.files parameter. Seperate .so libraries with spaces and add double quotation marks before and after the .so library.