Spark Core Memory Tuning

Scenario

Spark is an in-memory computing frame. If the memory is insufficient during computing, the Spark execution efficiency will be adversely affected. You can determine whether the memory becomes a performance bottleneck by monitoring garbage collection (GC) and evaluating the resilient distributed dataset (RDD) size in the memory, and take performance optimization measures.

To monitor GC of node processes, add the -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps parameter to the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions configuration items in the conf/spark-default.conf configuration file of the client.

If "Full GC" is frequently reported, GC needs to be optimized. Cache the RDD and query the RDD size in the log. If a large value is found, change the RDD storage level.

Procedure

To optimize GC, adjust the size and ratio of the old generation and young generation. In the conf/spark-default.conf configuration file of the client, add the -XX:NewRatio parameter to the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions configuration items. For example, if you add -XX:NewRatio=2, the young generation accounts for 1/3 of the heap space, and the old generation accounts for 2/3.
Optimize the RDD data structure when developing Spark applications.
- Use primitive arrays to replace fastutil arrays.
- Avoid nested structure.
- Avoid using String in keys.
Serialize RDDs when developing Spark applications.
By default, data is not serialized when RDDs are cached. You can set the storage level to serialize the RDDs and minimize memory usage. The following is an example.
```
testRDD.persist(StorageLevel.MEMORY_ONLY_SER)
```