Updated on 2022-06-01 GMT+08:00

Data Serialization

Scenario

Spark supports the following types of serialization:

  • JavaSerializer
  • KryoSerializer

Data serialization greatly affects the Spark application performance. In specific data format, KryoSerializer offers 10 times higher performance than JavaSerializer. For Int data, performance optimization can be ignored.

KryoSerializer depends on Chill of Twitter. Not all Java Serializable objects support KryoSerializer. Therefore, a class must be manually registered.

Serialization involves task serialization and data serialization. Only JavaSerializer can be used for Spark task serialization. JavaSerializer and KryoSerializer can be used for data serialization.

Procedure

When the Spark application is running, a large volume of data needs to be serialized during the shuffle and RDD cache procedures. By default, JavaSerializer is used. You can also configure KryoSerializer as the data serializer to improve serialization performance.

When developing an application, add the following code to enable KryoSerializer as data serializer:

  • Implement the class register and manually register classes.
    package com.etl.common;
    
    import com.esotericsoftware.kryo.Kryo;
    import org.apache.spark.serializer.KryoRegistrator; 
    
    public class DemoRegistrator implements KryoRegistrator
    {
        @Override
        public void registerClasses(Kryo kryo)
        {
            // The following is an example class. Please register a customized class.
            kryo.register(AggrateKey.class);
            kryo.register(AggrateValue.class);
        }
    }

    You can configure spark.kryo.registrationRequired on a Spark client to determine whether registration with KryoSerializer is required.

    If the parameter is set to true, an exception is thrown if a project has classes that are not serialized. If the parameter is set to false (default value), KryoSerializer automatically writes unregistered classes to the corresponding objects. This operation affects system performance. If the parameter is set to true, you must manually register classes. The system does not write classes that are not serialized but throws exceptions. System performance is not affected.

  • Configure KryoSerializer as the data serializer and class register.
    val conf = new SparkConf()
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .set("spark.kryo.registrator", "com.etl.common.DemoRegistrator")