El contenido no se encuentra disponible en el idioma seleccionado. Estamos trabajando continuamente para agregar más idiomas. Gracias por su apoyo.

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Configuring Parameters Rapidly

Updated on 2022-08-12 GMT+08:00

Overview

This section describes how to quickly configure common parameters and lists parameters that are not recommended to be modified when Spark2x is used.

Common parameters to be configured

Some parameters have been adapted during cluster installation. However, the following parameters need to be adjusted based on application scenarios. Unless otherwise specified, the following parameters are configured in the spark-defaults.conf file on the Spark2x client.

Table 1 Common parameters to be configured

Configuration Item

Description

Default Value

spark.sql.parquet.compression.codec

Used to set the compression format of a non-partitioned Parquet table.

Set the queue in the spark-defaults.conf configuration file on the JDBCServer server.

snappy

spark.dynamicAllocation.enabled

Indicates whether to use dynamic resource scheduling, which is used to adjust the number of executors registered with the application according to scale. Currently, this parameter is valid only in Yarn mode.

The default value for JDBCServer is true, and that for the client is false.

false

spark.executor.memory

Indicates the memory size used by each executor process. Its character sting is in the same format as the JVM memory (example: 512 MB or 2 GB).

4G

spark.sql.autoBroadcastJoinThreshold

Indicates the maximum value for the broadcast configuration when two tables are joined.

  • When the size of a field in a table involved in an SQL statement is less than the value of this parameter, the system broadcasts the SQL statement.
  • If the value is set to -1, broadcast is not performed.

10485760

spark.yarn.queue

Specifies the Yarn queue where JDBCServer resides.

Set the queue in the spark-defaults.conf configuration file on the JDBCServer server.

default

spark.driver.memory

In a large cluster, you are advised to configure the memory used by the 32 GB to 64 GB driver process, that is, the SparkContext initialization process (for example, 512 MB and 2 GB).

4G

spark.yarn.security.credentials.hbase.enabled

Indicates whether to enable the function of obtaining HBase tokens. If the Spark on HBase function is required and a security cluster is configured, set this parameter to true. Otherwise, set this parameter to false.

false

spark.serializer

Used to serialize the objects that are sent over the network or need to be cached.

The default value of Java serialization applies to any Serializable Java object, but the running speed is slow. Therefore, you are advised to use org.apache.spark.serializer.KryoSerializer and configure Kryo serialization. It can be any subclass of org.apache.spark.serializer.Serializer.

org.apache.spark.serializer.JavaSerializer

spark.executor.cores

Indicates the number of kernels used by each executor.

Set this parameter in standalone mode and Mesos coarse-grained mode. When there are sufficient kernels, the application is allowed to execute multiple executable programs on the same worker. Otherwise, each application can run only one executable program on each worker.

1

spark.shuffle.service.enabled

Indicates a long-term auxiliary service in NodeManager for improving shuffle computing performance.

false

spark.sql.adaptive.enabled

Indicates whether to enable the adaptive execution framework.

false

spark.executor.memoryOverhead

Indicates the heap memory to be allocated to each executor, in MB.

This is the memory that occupies the overhead of the VM, similar to the internal string and other built-in overhead. The value increases with the executor size (usually 6% to 10%).

1 GB

spark.streaming.kafka.direct.lifo

Indicates whether to enable the LIFO function of Kafka.

false

Parameters Not Recommended to Be Modified

The following parameters have been adapted during cluster installation. You are not advised to modify them.

Table 2 Parameters not recommended to be modified

Configuration Item

Description

Default Value or Configuration Example

spark.password.factory

Selects the password parsing mode.

org.apache.spark.om.util.FIPasswordFactory

spark.ssl.ui.protocol

Sets the SSL protocol of the UI.

TLSv1.2

spark.yarn.archive

Archives Spark JAR files, which are distributed to Yarn cache. If this parameter is set, the value will replace <code> spark.yarn.jars </code> and be archived in the containers of all applications. The archive should contain the JAR files in its root directory. Archives can also be hosted on HDFS to speed up file distribution.

hdfs://hacluster/user/spark2x/jars/8.1.0.1/spark-archive-2x.zip

NOTE:

The version 8.1.0.1 is used as an example. Replace it with the actual version number.

spark.yarn.am.extraJavaOptions

Indicates a string of extra JVM options to pass to the YARN ApplicationMaster in client mode. Use spark.driver.extraJavaOptions in cluster mode.

-Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Djdk.tls.ephemeralDHKeySize=2048

spark.shuffle.servicev2.port

Indicates the port for the shuffle service to monitor requests for obtaining data.

27338

spark.ssl.historyServer.enabled

Sets whether the history server uses SSL.

true

spark.files.overwrite

When the target file exists and its content does not match that of the source file, whether to overwrite the file added through SparkContext.addFile().

false

spark.yarn.cluster.driver.extraClassPath

Indicates the extraClassPath of the driver in Yarn-cluster mode. Set the parameter to the path and parameters of the server.

${BIGDATA_HOME}/common/runtime/security

spark.driver.extraClassPath

Indicates the extra class path entries attached to the class path of the driver.

${BIGDATA_HOME}/common/runtime/security

spark.yarn.dist.innerfiles

Sets the files that need to be uploaded to HDFS from Spark in Yarn mode.

/Spark_path/spark/conf/s3p.file,/Spark_path/spark/conf/locals3.jceks

Spark_path is the installation path of the Spark client.

spark.sql.bigdata.register.dialect

Registers the SQL parser.

org.apache.spark.sql.hbase.HBaseSQLParser

spark.shuffle.manager

Indicates the data processing mode. There are two implementation modes: sort and hash. The sort shuffle has a higher memory utilization. It is the default option in Spark 1.2 and later versions.

SORT

spark.deploy.zookeeper.url

Indicates the address of ZooKeeper. Multiple addresses are separated by commas (,).

For example:

host1:2181,host2:2181,host3:2181

spark.broadcast.factory

Indicates the broadcast mode.

org.apache.spark.broadcast.TorrentBroadcastFactory

spark.sql.session.state.builder

Session state constructor.

org.apache.spark.sql.hive.FIHiveACLSessionStateBuilder

spark.executor.extraLibraryPath

Sets the special library path used when the executor JVM is started.

${BIGDATA_HOME}/FusionInsight_HD_8.1.0.1/install/FusionInsight-Hadoop-3.1.1/hadoop/lib/native

spark.ui.customErrorPage

Indicates whether to display the custom error information page when an error occurs on the page.

true

spark.httpdProxy.enable

Indicates whether to use the httpd proxy.

true

spark.ssl.ui.enabledAlgorithms

Sets the SSL algorithm of UI.

TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_RSA_WITH_AES_256_GCM_SHA384,TLS_DHE_DSS_WITH_AES_256_GCM_SHA384,TLS_DHE_RSA_WITH_AES_128_GCM_SHA256,TLS_DHE_DSS_WITH_AES_128_GCM_SHA256

spark.ui.logout.enabled

Sets the logout button for the web UI of the Spark component.

true

spark.security.hideInfo.enabled

Indicates whether to hide sensitive information on the UI.

true

spark.yarn.cluster.driver.extraLibraryPath

Indicates the extraLibraryPath of the driver in Yarn-cluster mode. Set this parameter to the path and parameters of the server.

${BIGDATA_HOME}/FusionInsight_HD_8.1.0.1/install/FusionInsight-Hadoop-3.1.1/hadoop/lib/native

spark.driver.extraLibraryPath

Sets a special library path for starting the driver JVM.

${DATA_NODE_INSTALL_HOME}/hadoop/lib/native

spark.ui.killEnabled

Allows stages and jobs to be stopped on the web UI.

true

spark.yarn.access.hadoopFileSystems

Spark can access multiple NameService instances. If there are multiple NameService instances, set this parameter to all the NameService instances and separate them with commas (,).

hdfs://hacluster,hdfs://hacluster

spark.yarn.cluster.driver.extraJavaOptions

Indicates extra JVM option passed to the executor, for example, GC setting and logging. Do not set Spark attributes or heap size using this option. Instead, set Spark attributes using the SparkConf object or the spark-defaults.conf file specified when the spark-submit script is called. Set heap size using spark.executor.memory.

-Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x_app -Dcarbon.properties.filepath=./__spark_conf__/__hadoop_conf__/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048

spark.driver.extraJavaOptions

Indicates a series of extra JVM options passed to the driver,

-Xloggc:${SPARK_LOG_DIR}/indexserver-omm-%p-gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=512M -XX:MaxMetaspaceSize=512M -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -XX:OnOutOfMemoryError='kill -9 %p' -Djetty.version=x.y.z -Dorg.xerial.snappy.tempdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/snappy_tmp -Djava.io.tmpdir=${BIGDATA_HOME}/tmp/spark2x/JDBCServer/io_tmp -Dcarbon.properties.filepath=${SPARK_CONF_DIR}/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048 -Dspark.ssl.keyStore=${SPARK_CONF_DIR}/child.keystore #{java_stack_prefer}

spark.eventLog.overwrite

Indicates whether to overwrite any existing file.

false

spark.eventLog.dir

Indicates the directory for logging Spark events if spark.eventLog.enabled is set to true. In this directory, Spark creates a subdirectory for each application and logs events of the application in the subdirectory. You can also set a unified address similar to the HDFS directory so that the History Server can read historical files.

hdfs://hacluster/spark2xJobHistory2x

spark.random.port.min

Sets the minimum random port.

22600

spark.authenticate

Indicates whether Spark authenticates its internal connections. If the application is not running on Yarn, see spark.authenticate.secret.

true

spark.random.port.max

Sets the maximum random port.

22899

spark.eventLog.enabled

Indicates whether to log Spark events, which are used to reconstruct the web UI after the application execution is complete.

true

spark.executor.extraJavaOptions

Indicates extra JVM option passed to the executor, for example, GC setting and logging. Do not set Spark attributes or heap size using this option.

-Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./log4j-executor.properties -Djava.security.auth.login.config=./jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./kdc.conf -Dcarbon.properties.filepath=./carbon.properties

-Xloggc:<LOG_DIR>/gc.log -XX:+PrintGCDetails -XX:-OmitStackTraceInFastThrow -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=20 -XX:GCLogFileSize=10M -Dlog4j.configuration=./__spark_conf__/__hadoop_conf__/log4j-executor.properties -Djava.security.auth.login.config=./__spark_conf__/__hadoop_conf__/jaas-zk.conf -Dzookeeper.server.principal=zookeeper/hadoop.<system domain name> -Djava.security.krb5.conf=./__spark_conf__/__hadoop_conf__/kdc.conf -Dcarbon.properties.filepath=./__spark_conf__/__hadoop_conf__/carbon.properties -Djdk.tls.ephemeralDHKeySize=2048

spark.sql.authorization.enabled

Indicates whether to enable authentication for the Hive client.

true

Utilizamos cookies para mejorar nuestro sitio y tu experiencia. Al continuar navegando en nuestro sitio, tú aceptas nuestra política de cookies. Descubre más

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback