Optimizing Spark Performance

Overview

Spark is an in-memory distributed computing framework. In iterative computation scenarios, the computing capability of Spark is 10 to 100 times higher than MapReduce, because data is stored in memory when being processed. Spark can use HDFS as the underlying storage system, enabling users to quickly switch to Spark from MapReduce. Spark provides one-stop data analysis capabilities, such as the streaming processing in small batches, offline batch processing, SQL query, and data mining. Users can seamlessly use these functions in a same application.

Features of Spark are as follows:

Improves the data processing capability through distributed memory computing and directed acyclic graph (DAG) execution engine. The delivered performance is 10 to 100 times higher than that of MapReduce.
Supports multiple development languages (Scala/Java/Python) and dozens of highly abstract operators to facilitate the construction of distributed data processing applications.
Builds data processing stacks using SQL, Streaming, MLlib, and GraphX to provide one-stop data processing capabilities.
Fits into the Hadoop ecosystem, allowing Spark applications to run on Standalone, Mesos, or Yarn, enabling access of multiple data sources such as HDFS, HBase, and Hive, and supporting smooth migration of the MapReduce application to Spark.

Cluster service deployment planning

Service scale and capacity parameter configuration

Spark, as a memory-based computing engine, requires a large memory and a large number of CPU resources. You must properly plan memory and CPU resources based on the current service capacity and growth rate. The following aspects must be considered:
If the program runs in yarn-client mode, pay attention to the data volume at the driver end and set a proper memory for the driver.
The CPU and memory resources must be planned based on service objectives. During planning, you need to set executor-memory, executor-cores, and Executor-num based on the data distribution and service complexity. You also need to plan the number of CPUs and memory.
Reserve some memory space (usually 20% of the total memory) as the buffer cache of the operating system.
Consider data expansion after data blocks read from HDFS are decompressed.
Plan some disks for storing cached data, logs, and shuffle data.

Optimization principles

Increase CPU usage while reducing extra performance overhead.

Increase memory usage.

Optimize the service logic to reduce the calculation workload and I/O operations.

Typical service tuning

Before tuning Spark parameters, you must optimize code logic through planning and design.
Slow Spark jobs and low CPU utilization. Executor threads cannot be fully occupied. Reduce the number of cores of each executor, add more executors, and increase the number of partitions.
Memory overflow may occur if too much data is processed by a task due to large data segments, or if the memory of a task is insufficient due to a low concurrency of the executor. Remove some executors and add data shards.
A small data volume but a large number of small files. Reduce data fragments and execute the reduce operator and then the coalesce operator to reduce tasks and CPU load.
Search for a large table with Spark SQL for a large number of columns but a small number of columns. Use the rcfile or parquet format to reduce file read costs. Select a proper compaction format to reduce memory load.

Metrics monitoring

Performance metrics include throughput, resource usage, and scalability.

Throughput: Run the same computing tasks in the same resource environment and check how fast the task is complete.
Resource usage: Run computing tasks and view the CPU, memory, and network usage under different loads.
Scalability:

− Performance improvement curve upon horizontal scaling: Run the same computing tasks before and after resource scaling and then compare system performance.

− Performance decrease curve upon increase of system load: Increase system load under the same resource environment and compare system performance before and after the load increase.

Parent topic: Optimizing Big Data Performance

Previous topic: Optimizing Hive

Next topic: Optimizing Flink Performance