Updated on 2024-12-13 GMT+08:00

Configuring Spark Parameters Rapidly

Overview

This section describes how to quickly configure common parameters and lists parameters that are not recommended to be modified when Spark2x is used.

Common parameters to be configured

Some parameters have been adapted during cluster installation. However, the following parameters need to be adjusted based on application scenarios. Unless otherwise specified, the following parameters are configured in the spark-defaults.conf file on the Spark2x client.

Table 1 Common parameters to be configured

Configuration Item

Description

Default Value

spark.sql.parquet.compression.codec

Used to set the compression format of a non-partitioned Parquet table.

Set the queue in the spark-defaults.conf configuration file on the JDBCServer server.

snappy

spark.dynamicAllocation.enabled

Indicates whether to use dynamic resource scheduling, which is used to adjust the number of executors registered with the application according to scale. Currently, this parameter is valid only in Yarn mode.

The default value for JDBCServer is true, and that for the client is false.

false

spark.executor.memory

Indicates the memory size used by each executor process. Its character sting is in the same format as the JVM memory (example: 512 MB or 2 GB).

4G

spark.sql.autoBroadcastJoinThreshold

Indicates the maximum value for the broadcast configuration when two tables are joined.

  • When the size of a field in a table involved in an SQL statement is less than the value of this parameter, the system broadcasts the SQL statement.
  • If the value is set to -1, broadcast is not performed.

10485760

spark.yarn.queue

Specifies the Yarn queue where JDBCServer resides.

Set the queue in the spark-defaults.conf configuration file on the JDBCServer server.

default

spark.driver.memory

In a large cluster, you are advised to configure the memory used by the 32 GB to 64 GB driver process, that is, the SparkContext initialization process (for example, 512 MB and 2 GB).

4G

spark.yarn.security.credentials.hbase.enabled

Indicates whether to enable the function of obtaining HBase tokens. If the Spark on HBase function is required and a security cluster is configured, set this parameter to true. Otherwise, set this parameter to false.

false

spark.serializer

Used to serialize the objects that are sent over the network or need to be cached.

The default value of Java serialization applies to any Serializable Java object, but the running speed is slow. Therefore, you are advised to use org.apache.spark.serializer.KryoSerializer and configure Kryo serialization. It can be any subclass of org.apache.spark.serializer.Serializer.

org.apache.spark.serializer.JavaSerializer

spark.executor.cores

Indicates the number of kernels used by each executor.

Set this parameter in standalone mode and Mesos coarse-grained mode. When there are sufficient kernels, the application is allowed to execute multiple executable programs on the same worker. Otherwise, each application can run only one executable program on each worker.

1

spark.shuffle.service.enabled

Indicates a long-term auxiliary service in NodeManager for improving shuffle computing performance.

false

spark.sql.adaptive.enabled

Indicates whether to enable the adaptive execution framework.

false

spark.executor.memoryOverhead

Indicates the heap memory to be allocated to each executor, in MB.

This is the memory that occupies the overhead of the VM, similar to the internal string and other built-in overhead. The value increases with the executor size (usually 6% to 10%).

1 GB

spark.streaming.kafka.direct.lifo

Indicates whether to enable the LIFO function of Kafka.

false

Parameters Not Recommended to Be Modified

The following parameters have been adapted during cluster installation. You are not advised to modify them.

Table 2 Parameters not recommended to be modified

Configuration Item

Description

Default Value or Configuration Example

spark.password.factory

Selects the password parsing mode.

org.apache.spark.om.util.FIPasswordFactory

spark.ssl.ui.protocol

Sets the SSL protocol of the UI.

TLSv1.2

spark.yarn.archive

Archives Spark JAR files, which are distributed to Yarn cache. If this parameter is set, the value will replace <code> spark.yarn.jars </code> and be archived in the containers of all applications. The archive should contain the JAR files in its root directory. Archives can also be hosted on HDFS to speed up file distribution.

hdfs://hacluster/user/spark2x/jars/xxx/spark-archive-2x.zip

NOTE:

The version xxx is used as an example. Replace it with the actual version number.

spark.yarn.am.extraJavaOptions

Indicates a string of extra JVM options to pass to the YARN ApplicationMaster in client mode. Use spark.driver.extraJavaOptions in cluster mode.

-Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Djdk.tls.ephemeralDHKeySize=2048

spark.shuffle.servicev2.port

Indicates the port for the shuffle service to monitor requests for obtaining data.

27338

spark.ssl.historyServer.enabled

Sets whether the history server uses SSL.

true

spark.files.overwrite

When the target file exists and its content does not match that of the source file, whether to overwrite the file added through SparkContext.addFile().

false

spark.yarn.cluster.driver.extraClassPath

Indicates the extraClassPath of the driver in Yarn-cluster mode. Set the parameter to the path and parameters of the server.

${BIGDATA_HOME}/common/runtime/security

spark.driver.extraClassPath

Indicates the extra class path entries attached to the class path of the driver.

${BIGDATA_HOME}/common/runtime/security

spark.yarn.dist.innerfiles

Sets the files that need to be uploaded to HDFS from Spark in Yarn mode.

/Spark_path/spark/conf/s3p.file,/Spark_path/spark/conf/locals3.jceks

Spark_path is the installation path of the Spark client.

spark.sql.bigdata.register.dialect

Registers the SQL parser.

org.apache.spark.sql.hbase.HBaseSQLParser

spark.shuffle.manager

Indicates the data processing mode. There are two implementation modes: sort and hash. The sort shuffle has a higher memory utilization. It is the default option in Spark 1.2 and later versions. Spark 2.x and later versions do not support hash.

SORT

spark.deploy.zookeeper.url

Indicates the address of ZooKeeper. Multiple addresses are separated by commas (,).

For example:

host1:2181,host2:2181,host3:2181

spark.broadcast.factory

Indicates the broadcast mode.

org.apache.spark.broadcast.TorrentBroadcastFactory

spark.sql.session.state.builder

Session state constructor.

org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder

spark.executor.extraLibraryPath

Sets the special library path used when the executor JVM is started.

${BIGDATA_HOME}/FusionInsight_HD_xxx/install/FusionInsight-Hadoop-*/hadoop/lib/native

spark.ui.customErrorPage

Indicates whether to display the custom error information page when an error occurs on the page.

true

spark.httpdProxy.enable

Indicates whether to use the httpd proxy.

true

spark.ssl.ui.enabledAlgorithms

Sets the SSL algorithm of UI.

TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_DSS_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_DSS_WITH_AES_128_GCM_SHA256

spark.ui.logout.enabled

Sets the logout button for the web UI of the Spark component.

true

spark.security.hideInfo.enabled

Indicates whether to hide sensitive information on the UI.

true

spark.yarn.cluster.driver.extraLibraryPath

Indicates the extraLibraryPath of the driver in Yarn-cluster mode. Set this parameter to the path and parameters of the server.

${BIGDATA_HOME}/FusionInsight_HD_xxx/install/FusionInsight-Hadoop-*/hadoop/lib/native

spark.driver.extraLibraryPath

Sets a special library path for starting the driver JVM.

${DATA_NODE_INSTALL_HOME}/hadoop/lib/native

spark.ui.killEnabled

Allows stages and jobs to be stopped on the web UI.

true

spark.yarn.access.hadoopFileSystems

Spark can access multiple NameService instances. If there are multiple NameService instances, set this parameter to all the NameService instances and separate them with commas (,).

hdfs://hacluster,hdfs://hacluster

spark.yarn.cluster.driver.extraJavaOptions

Indicates extra JVM option passed to the executor, for example, GC setting and logging. Do not set Spark attributes or heap size using this option. Instead, set Spark attributes using the SparkConf object or the spark-defaults.conf file specified when the spark-submit script is called. Set heap size using spark.executor.memory.

-Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x_app -Dcarbon.properties.filepath=./__spark_conf__/__hadoop_conf__/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048

spark.driver.extraJavaOptions

Indicates a series of extra JVM options passed to the driver,

-Xloggc:${SPARK_LOG_DIR}/indexserver-omm-%p-gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=512M -XX:MaxMetaspaceSize=512M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -XX:OnOutOfMemoryError='kill -9 %p' -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/snappy_tmp -Djava.io.tmpdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/io_tmp -Dcarbon.properties.filepath=${SPARK_CONF_DIR}/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 -Dspark.ssl.keyStore=${SPARK_CONF_DIR}/child.keystore #{java_stack_prefer}

spark.eventLog.overwrite

Indicates whether to overwrite any existing file.

false

spark.eventLog.dir

Indicates the directory for logging Spark events if spark.eventLog.enabled is set to true. In this directory, Spark creates a subdirectory for each application and logs events of the application in the subdirectory. You can also set a unified address similar to the HDFS directory so that the History Server can read historical files.

hdfs://hacluster/spark2xJobHistory2x

spark.random.port.min

Sets the minimum random port.

22600

spark.authenticate

Indicates whether Spark authenticates its internal connections. If the application is not running on Yarn, see spark.authenticate.secret.

true

spark.random.port.max

Sets the maximum random port.

22899

spark.eventLog.enabled

Indicates whether to log Spark events, which are used to reconstruct the web UI after the application execution is complete.

true

spark.executor.extraJavaOptions

Indicates extra JVM option passed to the executor, for example, GC setting and logging. Do not set Spark attributes or heap size using this option.

-Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./log4j-executor.properties -Djava.security.auth.login.config=./jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./kdc.conf -Dcarbon.properties.filepath=./carbon.properties

-Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Dcarbon.properties.filepath=./__spark_conf__/__hadoop_conf__/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048

spark.sql.authorization.enabled

Indicates whether to enable authentication for the Hive client.

true