Help Center/ Data Lake Insight/ Product Bulletin/ Version Support Bulletin/ Differences Between Spark 2.4.x and Spark 3.3.x/ Differences in General-Purpose Queues Between Spark 2.4.x and Spark 3.3.x

Updated on 2025-07-23 GMT+08:00

View PDF

Differences in General-Purpose Queues Between Spark 2.4.x and Spark 3.3.x

DLI has summarized the differences in general-purpose queues between Spark 2.4.x and Spark 3.3.x to help you understand the impact of upgrading the Spark version on jobs running in the general queues with the new engine.

Log4j Dependency Updated from 1.x to 2.x

Explanation:
Log4j dependency is updated from 1.x to 2.x.
- Spark 2.4.x: Log4j dependency version 1.x (no longer supported by the community).
- Spark 3.3.x: Log4j dependency version 2.x.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Spark 3.3.x Does Not Support v1 Tables

Explanation:
Spark 2.4.x supports datasource v1 and v2 tables. Spark 3.3.x does not support datasource v1 tables.

For details, see DLI Datasource V1 Table and Datasource V2 Table.
- Spark 2.4.x supports datasource v1 and v2 tables.
- Spark 3.3.x does not support datasource v1 tables.
Is there any impact on jobs after the engine version upgrade?
There is an impact. You are advised to migrate to v2 tables in Spark 2.4.5 before upgrading to Spark 3.3.1. For details, refer to the examples in DLI Datasource V1 Table and Datasource V2 Table.

Empty Input Splits Do Not Create Partitions by Default

Explanation:
- Spark 2.4.x: Empty input splits create partitions by default.
- Spark 3.3.x: Empty input splits do not create partitions by default.
  When using Spark 3.3.x, spark.hadoopRDD.ignoreEmptySplits is set to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; it needs to be determined if partition names are used for service judgments.

Event Log Compression Format Set to zstd

Explanation:
In Spark 3.3.x, the default value for spark.eventLog.compression.codec is set to zstd. Spark will no longer use the value of spark.io.compression.codec for compressing event logs.
- Spark 2.4.x: Uses the value of spark.io.compression.codec for event log compression format.
- Spark 3.3.x: spark.eventLog.compression.codec is set to zstd by default.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the event log compression format changes.

Change in spark.launcher.childConectionTimeout Configuration

Explanation:
- Spark 2.4.x: The configuration name is spark.launcher.childConectionTimeout.
- Spark 3.3.x: The configuration name is changed to spark.launcher.childConnectionTimeout.
Is there any impact on jobs after the engine version upgrade?
There is an impact; configuration parameter names change.

Spark 3.3.x No Longer Supports Apache Mesos as a Resource Manager

Explanation:
- Spark 2.4.x: Uses Apache Mesos as a resource manager.
- Spark 3.3.x: No longer supports using Apache Mesos as a resource manager.
Is there any impact on jobs after the engine version upgrade?
Functional enhancement. If you were using Mesos as a resource manager in Spark 2.4.x, you need to consider switching to another resource manager after upgrading to Spark 3.3.x.

Spark 3.3.x Deletes Kubernetes Driver When the Application Terminates Itself

Explanation: Spark 3.3.x deletes the Kubernetes driver when the application terminates itself.
Is there any impact on jobs after the engine version upgrade?
Functional enhancement. After upgrading to Spark 3.3.x, this affects jobs that rely on Kubernetes as a resource manager. Spark 3.3.x automatically deletes the driver pod when the application terminates, which may affect resource management and cleanup processes.

Spark 3.3.x Supports Custom Kubernetes Schedulers

Explanation:
- Spark 2.4.x: Does not support using a specified Kubernetes scheduler to manage resource allocation and scheduling for Spark jobs.
- Spark 3.3.x: Supports custom Kubernetes schedulers.
Is there any impact on jobs after the engine version upgrade?
Functional enhancement; supports custom schedulers for resource allocation and scheduling management.

Spark Converts Non-Nullable Schemas to Nullable

Explanation:
In Spark 2.4.x, when the user-specified schema contains non-nullable fields, Spark converts these non-nullable schemas to nullable.

In Spark 3.3.x, Spark respects the nullability specified in the user schema, that is, if a field is defined as non-nullable, Spark retains this requirement and does not automatically convert it to a nullable field.
- Spark 2.4.x: In Spark 2.4.x, when the user-specified schema contains non-nullable fields, Spark converts these non-nullable schemas to nullable.
- Spark 3.3.x: Does not automatically convert non-nullable fields to nullable.
  To revert to the behavior of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.respectNullabilityInTextDatasetConversion to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Sample code:

Execute SQL:

spark.read.schema(StructType(
StructField("f1", LongType, nullable = false) ::
StructField("f2", LongType, nullable = false) :: Nil)
).option("mode", "DROPMALFORMED").json(Seq("""{"f1": 1}""").toDS).show(false);

Spark 2.4.5
```
|f1 |f2 |
+---+---+
|1  |0  |
```
Spark 3.3.1
```
|f1 |f2  |
+---+----+
|1  |null|
```

Change in Spark Scala Version

Explanation:
The Spark Scala version changes.
- Spark 2.4.x: Uses Scala 2.11.
- Spark 3.3.x: Upgrades to Scala 2.12.
Is there any impact on jobs after the engine version upgrade?
There is an impact; jars need to be recompiled with the updated Scala version.

Change in Supported Python Versions for PySpark

Explanation:
The supported Python versions for PySpark change.
- Spark 2.4.x: PySpark supports Python 2.6+ to 3.7+.
- Spark 3.3.x: PySpark supports Python 3.6 or later.
Is there any impact on jobs after the engine version upgrade?
There is an impact due to dependency version changes. You need to check whether this issue is involved.

Change in Supported Pandas Versions for PySpark

Explanation:
- Spark 2.4.x: PySpark does not specify a Pandas version.
- Spark 3.3.x: From Spark 3.3.x, PySpark requires Pandas 0.23.2 or later to use Pandas-related functions like toPandas and createDataFrame from Pandas DataFrame.
Is there any impact on jobs after the engine version upgrade?
There is an impact due to dependency version changes. You need to check whether this issue is involved.

Change in Supported PyArrow Versions for PySpark

Explanation:
- Spark 2.4.x: PySpark does not specify a PyArrow version.
- Spark 3.3.x: From Spark 3.3.x, PySpark requires PyArrow 0.12.1 or later to use PyArrow-related functions like Pandas_udf and toPandas.
Is there any impact on jobs after the engine version upgrade?
There is an impact due to dependency version changes. You need to check whether this issue is involved.

DataFrameWriter Triggered Queries Named as Command

In Spark 3.2.x, when DataFrameWriter triggered queries are sent to QueryExecutionListener, these queries are always named as command. In Spark 3.1 or earlier, these queries may be named as save, insertInto, or saveAsTable, depending on the specific operation.

Explanation:
When a query execution is triggered by DataFrameWriter, it is always named as command when sent to QueryExecutionListener.
- Spark 2.4.x: Named as save, insertInto, or saveAsTable.
- Spark 3.3.x: Named as command.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Differences in Reading DATE and TIMESTAMP Fields

Explanation:
Differences in reading DATE and TIMESTAMP fields for the Asia/Shanghai time zone. For values before 1900-01-01 08:05:43, values written by Spark 2.4.5 and read by Spark 3.3.1 differ from values read by Spark 2.4.5.
- Spark 2.4.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5 and read by Spark 2.4.5 returns 1900-01-01 00:00:00.
- Spark 3.3.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5, read by Spark 3.3.1 with spark.sql.parquet.int96RebaseModeInRead=LEGACY returns 1900-01-01 00:00:00. However, with spark.sql.parquet.int96RebaseModeInRead=CORRECTED, the resulting value is 1900-01-01 00:05:43.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the usage of DATE and TIMESTAMP fields needs to be evaluated.

Sample code:

Configure the following information in the job:

spark.sql.session.timeZone=Asia/Shanghai

Spark 2.4.5

spark.sql("create table parquet_timestamp_test (id int, col0 string, col1 timestamp) using parquet");
spark.sql("insert into parquet_timestamp_test values (1, '245', '1900-01-01 00:00:00')");

Execute SQL to read data:

spark.sql("select * from parquet_timestamp_test").show();

Query results:

+---+----+-------------------+
| id|col0|               col1|
+---+----+-------------------+
|  1| 245|1900-01-01 00:00:00|
+---+----+-------------------+

Spark 3.3.1

spark.sql.parquet.int96RebaseModeInRead=LEGACY

Execute a job to read data:

spark.sql("select * from parquet_timestamp_test").show();

Query results

+---+----+-------------------+
| id|col0|               col1|
+---+----+-------------------+
|  1| 245|1900-01-01 00:00:00|
+---+----+-------------------+

Modify the configuration:

spark.sql.parquet.int96RebaseModeInRead=CORRECTED

Re-run the following SQL statement to read data:

spark.sql("select * from parquet_timestamp_test").show();

Query results:

+---+----+-------------------+
| id|col0|               col1|
+---+----+-------------------+
|  1| 245|1900-01-01 00:05:43|
+---+----+-------------------+

Configuration description:

These configuration items specify how Spark handles DATE and TIMESTAMP type fields for specific times (times that are contentious between the proleptic Gregorian and Julian calendars). For example, in the Asia/Shanghai time zone, it refers to how values before 1900-01-01 08:05:43 are processed.

Ambiguous time:

Refers to times before a specific point. For example, for the Asia/Shanghai time zone, it refers to values before 1900-01-01 08:05:43.
Additionally, the metadata of the data file does not specify which calendar is used for storage. For example, in Spark 3.3.x, the calendar used during certain writes is recorded within the data files, ensuring that such timestamps remain unambiguous.

Configuration items in a datasource Parquet table

**Table 1** Configuration items in a Spark 3.3.1 datasource Parquet table
Configuration Item	Default Value	Description
spark.sql.parquet.int96RebaseModeInRead	EXCEPTION (default for Spark jobs)	Takes effect when reading INT96 type TIMESTAMP fields in Parquet files. EXCEPTION: Throws an error when encountering an ambiguous time, causing the read operation to fail. CORRECTED: Reads the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar. This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.
spark.sql.parquet.int96RebaseModeInWrite	EXCEPTION (default for Spark jobs)	Takes effect when writing INT96 type TIMESTAMP fields in Parquet files. EXCEPTION: Throws an error when encountering an ambiguous time, causing the write operation to fail. CORRECTED: Writes the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.
spark.sql.parquet.datetimeRebaseModeInRead	EXCEPTION (default for Spark jobs)	Takes effect when reading DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the read operation to fail. CORRECTED: Reads the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar. This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.
spark.sql.parquet.datetimeRebaseModeInWrite	EXCEPTION (default for Spark jobs)	Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the write operation to fail. CORRECTED: Writes the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.

Configuration items in a datasource Avro table

**Table 2** Configuration items in a Spark 3.3.1 datasource Avro table
Configuration Item	Default Value	Description
spark.sql.avro.datetimeRebaseModeInRead	EXCEPTION (default for Spark jobs)	Takes effect when reading DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the read operation to fail. CORRECTED: Reads the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar. This setting only takes effect when the write information for the Avro file (e.g., Spark, Hive) is unknown.
spark.sql.avro.datetimeRebaseModeInWrite	EXCEPTION (default for Spark jobs)	Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the write operation to fail. CORRECTED: Writes the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Avro files.

Difference in from_unixtime Function

Explanation:
- Spark 2.4.x:
  For the Asia/Shanghai time zone, -2209017600 returns 1900-01-01 00:00:00.
- Spark 3.3.x:
  For the Asia/Shanghai time zone, -2209017943 returns 1900-01-01 00:00:00.
Is there any impact on jobs after the engine version upgrade?
There is an impact; usage of this function needs to be checked.

Sample code:

Configure the following information in the job:

spark.sql.session.timeZone=Asia/Shanghai

Spark 2.4.5

Run the following statement to read data:

select from_unixtime(-2209017600);

Query results:

+-----------------------------------------------+
|from_unixtime(-2209017600, yyyy-MM-dd HH:mm:ss)|
+-----------------------------------------------+
|                            1900-01-01 00:00:00|
+-----------------------------------------------+

Spark 3.3.1

Run the following statement to read data:

select from_unixtime(-2209017600);

Query results

+-----------------------------------------------+
|from_unixtime(-2209017600, yyyy-MM-dd HH:mm:ss)|
+-----------------------------------------------+
|                            1900-01-01 00:05:43|
+-----------------------------------------------+

Difference in unix_timestamp Function

Explanation:
For values less than 1900-01-01 08:05:43 in the Asia/Shanghai time zone.
- Spark 2.4.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017600.
- Spark 3.3.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017943.
Is there any impact on jobs after the engine version upgrade?
There is an impact; usage of this function needs to be checked.

Sample code:

Configure the following information in the job:

spark.sql.session.timeZone=Asia/Shanghai

Spark 2.4.5

Run the following statement to read data:

select unix_timestamp('1900-01-01 00:00:00');

Query results:

+--------------------------------------------------------+
|unix_timestamp(1900-01-01 00:00:00, yyyy-MM-dd HH:mm:ss)|
+--------------------------------------------------------+
|                                             -2209017600|
+--------------------------------------------------------+

Spark 3.3.1

Run the following statement to read data:

select unix_timestamp('1900-01-01 00:00:00');

Query results

+--------------------------------------------------------+
|unix_timestamp(1900-01-01 00:00:00, yyyy-MM-dd HH:mm:ss)|
+--------------------------------------------------------+
|                                             -2209017943|
+--------------------------------------------------------+

Parent topic: Differences Between Spark 2.4.x and Spark 3.3.x

Previous topic: Differences in SQL Queues Between Spark 2.4.x and Spark 3.3.x

Next topic: DLI Datasource V1 Table and Datasource V2 Table