Help Center/ Data Lake Insight/ Product Bulletin/ Version Support Bulletin/ Differences Between Spark 2.4.x and Spark 3.3.x/ Differences in SQL Queues Between Spark 2.4.x and Spark 3.3.x

Updated on 2025-07-23 GMT+08:00

View PDF

Differences in SQL Queues Between Spark 2.4.x and Spark 3.3.x

DLI has summarized the differences in SQL queues between Spark 2.4.x and Spark 3.3.x to help you understand the impact of upgrading the Spark version on jobs running in the SQL queues with the new engine.

Difference in the Return Type of the histogram_numeric Function

Explanation:
The histogram_numeric function in Spark SQL returns an array of structs (x, y), where the type of x varies between different engine versions.
- Spark 2.4.x: In Spark 3.2 or earlier, x is of type double.
- Spark 3.3.x: The type of x is equal to the input value type of the function.
Is there any impact on jobs after the engine version upgrade?
There is an impact; related usages need to be adapted.

Sample code:

Prepare data:

create table test_histogram_numeric(val int);
INSERT INTO test_histogram_numeric VALUES(1),(5),(8),(6),(7),(9),(8),(9);

Execute SQL:

select histogram_numeric(val,3) from test_histogram_numeric;

Spark 2.4.5

[{"x":1.0,"y":1.0},{"x":5.5,"y":2.0},{"x":8.200000000000001,"y":5.0}]

Spark 3.3.1

[{"x":1,"y":1.0},{"x":5,"y":2.0},{"x":8,"y":5.0}]

Spark 3.3.x No Longer Supports Using "0$" to Specify the First Argument

Explanation:
In format_string(strfmt, obj, ...) and printf(strfmt, obj, ...), strfmt will no longer support using 0$ to specify the first argument; the first argument should always be referenced by 1$ when using argument indexing to indicate the position of the argument in the parameter list.
- Spark 2.4.x: Both %0 and %1 can represent the first argument.
- Spark 3.3.x: %0 is no longer supported.
Is there any impact on jobs after the engine version upgrade?
There is an impact; usages involving %0 need to be modified to adapt to Spark 3.3.x.

Sample code 1:

Execute SQL:

SELECT format_string('Hello, %0$s! I\'m %1$s!', 'Alice', 'Lilei');

Spark 2.4.5
```
Hello, Alice! I'm Alice!
```

Spark 3.3.1

DLI.0005: The value of parameter(s) 'strfmt' in `format_string` is invalid: expects %1$, %2$ and so on, but got %0$.

Sample code 2:

Execute SQL:

SELECT format_string('Hello, %1$s! I\'m %2$s!', 'Alice', 'Lilei');

Spark 2.4.5
```
Hello, Alice! I'm Lilei!
```
Spark 3.3.1
```
Hello, Alice! I'm Lilei!
```

Spark 3.3.x Empty String Without Quotes

Explanation:
By default, in the CSV data source, empty strings are represented as "" in Spark 2.4.5. After upgrading to Spark 3.3.1, empty strings have no quotes.
- Spark 2.4.x: Empty strings in the CSV data source are represented as "".
- Spark 3.3.x: Empty strings in the CSV data source have no quotes.
  To restore the format of Spark 2.4.x in Spark 3.3.x, you can set spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the storage format of null values in exported ORC files will be different.

Sample code:

Prepare data:

create table test_null(id int,name string) stored as parquet;
insert into test_null values(1,null);

Export a CSV file and check the file content:

Spark 2.4.5
```
1,""
```
Spark 3.3.1
```
1,
```

Different Return Results of the describe function

Explanation:
If the function does not exist, describe function will fail.
- Spark 2.4.x: The DESCRIBE function can still run and print Function: func_name not found.
- Spark 3.3.x: The error message changes to failure if the function does not exist.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the return information of the describe function related APIs is different.
Sample code:
Execute SQL:
```
describe function dli_no (dli_no does not exist)
```
- Spark 2.4.5
  Successfully executed, function_desc content:
```
Function:func_name not found
```
- Spark 3.3.1
  Execution failed, DLI.0005:
```
Undefined function: dli_no...
```

Clear Indication That Specified External Table Property is Not Supported

Explanation:
The external property of the table becomes reserved. If the external property is specified, certain commands will fail.
- Spark 2.4.x: Commands succeed when specifying the external property via CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES, but the external property is silently ignored, and the table remains a managed table.
- Spark 3.3.x:
  Commands will fail when specifying the external property via CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES.
  
  To restore the Spark 2.4.x usage in Spark 3.3.x, set spark.sql.legacy.notReserveProperties to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; related usages need to be adapted.

Sample code:

Execute SQL:

CREATE TABLE test_external(id INT,name STRING) TBLPROPERTIES('external'=true);

Spark 2.4.5
Successfully executed.

Spark 3.3.1

DLI.0005: The feature is not supported: external is a reserved table property, please use CREATE EXTERNAL TABLE.

New Support for Parsing Strings of "+Infinity", "+INF", and "-INF" Types

Explanation:
- Spark 2.4.x: When reading values from JSON properties defined as FloatType or DoubleType, Spark 2.4.x only supports parsing Infinity and -Infinity.
- Spark 3.3.x: In addition to supporting Infinity and -Infinity, Spark 3.3.x also supports parsing strings of +Infinity, +INF, and -INF.
Is there any impact on jobs after the engine version upgrade?
Function enhancement, no impact.

Default Configuration spark.sql.adaptive.enabled = true

Explanation:
- Spark 2.4.x: In Spark 2.4.x, the default value of the spark.sql.adaptive.enabled configuration item is false, meaning that the Adaptive Query Execution (AQE) feature is disabled.
- Spark 3.3.x: Starting from Spark 3.3.x-320, AQE is enabled by default, that is, spark.sql.adaptive.enabled is set to true.
Is there any impact on jobs after the engine version upgrade?
DLI functionality is enhanced, and the default value of spark.sql.adaptive.enabled has changed.

Change in the Schema of SHOW TABLES Output

Explanation:
The schema of the SHOW TABLES output changes from database: string to namespace: string.
- Spark 2.4.x: The schema of the SHOW TABLES output is database: string.
- Spark 3.3.x:
  The schema of the SHOW TABLES output changes from database: string to namespace: string.
  
  For the built-in catalog, the namespace field is named database; for the v2 catalog, there is no isTemporary field.
  
  To revert to the style of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.keepCommandOutputSchema to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; check the usages related to SHOW TABLES in jobs and adapt them to meet the new version's usage requirements.

Sample code:

Execute SQL:

show tables;

Spark 2.4.5

database    tableName    isTemporary
db1            table1            false

Spark 3.3.1

namespace    tableName    isTemporary
db1                table1            false

Schema Change in SHOW TABLE EXTENDED Output

Explanation:
The schema of the SHOW TABLE EXTENDED output changes from database: string to namespace: string.
- Spark 2.4.x: The schema of the SHOW TABLE EXTENDED output is database: string.
- Spark 3.3.x:
- The schema of the SHOW TABLE EXTENDED output changes from database: string to namespace: string.
  For the built-in catalog, the namespace field is named database; there is no change for the v2 catalog.
  
  To revert to the style of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.keepCommandOutputSchema to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; check the usages related to SHOW TABLES in jobs and adapt them to meet the new version's usage requirements.

Sample code:

Execute SQL:

show table extended like 'table%';

Spark 2.4.5

database    tableName    isTemporary    information
db1            table1            false                Database:db1...

Spark 3.3.1

namespace    tableName    isTemporary    information
db1            table1            false                Database:db1...

Impact of Table Refresh on Dependent Item Cache

Explanation:
After upgrading to Spark 3.3.x, table refresh will clear the table's cache data but keep the dependent item cache.
- Spark 2.4.x: In Spark 2.4.x, when performing a table refresh operation (for example, REFRESH TABLE), the cache data of dependent items (for example, views) is not retained.
```
ALTER TABLE .. ADD PARTITION
ALTER TABLE .. RENAME PARTITION
ALTER TABLE .. DROP PARTITION
ALTER TABLE .. RECOVER PARTITIONS
MSCK REPAIR TABLE
LOAD DATA
REFRESH TABLE
TRUNCATE TABLE
spark.catalog.refreshTable
```
- Spark 3.3.x: After upgrading to Spark 3.3.x, table refresh will clear the table's cache data but keep the dependent item cache.
Is there any impact on jobs after the engine version upgrade?
The upgraded engine version will increase the cache data of the original dependencies.

Impact of Table Refresh on Other Cached Operations Dependent on the Table

Explanation:
- Spark 2.4.x: In Spark 2.4.x, refreshing a table will trigger the uncache operation for all other caches referencing the table only if the table itself is cached.
- Spark 3.3.x: After upgrading to the new engine version, refreshing the table will trigger the uncache operation for other caches dependent on the table, regardless of whether the table itself is cached.
Is there any impact on jobs after the engine version upgrade?
DLI functionality is enhanced to ensure that the table refresh operation can affect the cache, improving program robustness.

New Support for Using Typed Literals in ADD PARTITION

Explanation:
- Spark 2.4.x:
  In Spark 2.4.x, using typed literals (for example, date'2020-01-01') in ADD PARTITION will parse the partition value as a string date'2020-01-01', resulting in an invalid date value and adding a partition with a null value.
  
  The correct approach is to use a string value, such as ADD PARTITION(dt = '2020-01-01').
- Spark 3.3.x: In Spark 3.3.x, partition operations support using typed literals, supporting ADD PARTITION(dt = date'2020-01-01') and correctly parsing the partition value as a date type instead of a string.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the handling of typed literals in ADD PARTITION has changed.

Sample code:

Prepare data:

create table test_part_type (id int,name string,pt date) PARTITIONED by (pt);
insert into test_part_type partition (pt = '2021-01-01') select 1,'name1';
insert into test_part_type partition (pt = date'2021-01-01') select 1,'name1';

Execute SQL:

select id,name,pt from test_part_type;
(Set the parameter spark.sql.forcePartitionPredicatesOnPartitionedTable.enabled to false.)

Spark 2.4.5
```
1 name1 2021-01-01
1 name1
```
Spark 3.3.1
```
1 name1 2021-01-01
1 name1 2021-01-01
```

Mapping Type Change of DayTimeIntervalType to Duration

Explanation:
In the ArrowWriter and ArrowColumnVector developer APIs, starting from Spark 3.3.x, the DayTimeIntervalType in Spark SQL is mapped to Apache Arrow's Duration type.
- Spark 2.4.x: DayTimeIntervalType is mapped to Apache Arrow's Interval type.
- Spark 3.3.x: DayTimeIntervalType is mapped to Apache Arrow's Duration type.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the mapping type of DayTimeIntervalType has changed.

Type Change in the Return Result of Date Difference

Explanation:
The date subtraction expression (e.g., date1 – date2) returns a DayTimeIntervalType value.
- Spark 2.4.x: Returns CalendarIntervalType.
- Spark 3.3.x: Returns DayTimeIntervalType.
  To restore the previous behavior, set spark.sql.legacy.interval.enabled to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the default type of the date difference return result has changed.

Mapping Type Change of Unit-to-Unit Interval

Explanation:
- Spark 2.4.x: In Spark 2.4.x, unit-to-unit intervals (e.g., INTERVAL '1-1' YEAR TO MONTH) and unit list intervals (e.g., INTERVAL '3' DAYS '1' HOUR) are converted to CalendarIntervalType.
- Spark 3.3.x: In Spark 3.3.x, unit-to-unit intervals and unit list intervals are converted to ANSI interval types: YearMonthIntervalType or DayTimeIntervalType.
  To restore the mapping type to that before Spark 2.4.x in Spark 3.3.x, set the configuration item spark.sql.legacy.interval.enabled to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the mapped data type has changed.

Return Value Type Change in timestamps Subtraction Expression

Explanation:
- Spark 2.4.x: In Spark 2.4.x, the timestamps subtraction expression (for example, select timestamp'2021-03-31 23:48:00' – timestamp'2021-01-01 00:00:00') returns a CalendarIntervalType value.
- Spark 3.3.x: In Spark 3.3.x, the timestamps subtraction expression (for example, select timestamp'2021-03-31 23:48:00' – timestamp'2021-01-01 00:00:00') returns a DayTimeIntervalType value.
  To restore the mapping type to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.interval.enabled to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the mapped data type has changed.

Mixed Use of Year-Month and Day-Time Fields No Longer Supported

Explanation:
- Spark 2.4.x: Unit list interval literals can mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND).
- Spark 3.3.x: Unit list interval literals cannot mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND). Invalid input is indicated.
  To restore the usage to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.interval.enabled to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Reserved Properties Cannot Be Used in CREATE TABLE .. LIKE .. Command

Explanation:
Reserved properties cannot be used in the CREATE TABLE .. LIKE .. command.
- Spark 2.4.x: In Spark 2.4.x, the CREATE TABLE .. LIKE .. command can use reserved properties.
  For example, TBLPROPERTIES('location'='/tmp') does not change the table location but creates an invalid property.
- Spark 3.3.x: In Spark 3.3.x, the CREATE TABLE .. LIKE .. command cannot use reserved properties.
  For example, using TBLPROPERTIES('location'='/tmp') or TBLPROPERTIES('owner'='yao') will fail.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Sample code 1:

Prepare data:

CREATE TABLE test0(id int, name string);
CREATE TABLE test_like_properties LIKE test0 LOCATION 'obs://bucket1/test/test_like_properties';

Execute SQL:

DESCRIBE FORMATTED test_like_properties;

Spark 2.4.5
The location is properly displayed.
Spark 3.3.1
The location is properly displayed.

Sample code 2:

Prepare data:

CREATE TABLE test_like_properties0(id int) using parquet LOCATION 'obs://bucket1/dbgms/test_like_properties0';
CREATE TABLE test_like_properties1 like test_like_properties0 tblproperties('location'='obs://bucket1/dbgms/test_like_properties1');

Execute SQL:

DESCRIBE FORMATTED test_like_properties1;

Spark 2.4.5

DLI.0005:
mismatched input 'tblproperties' expecting {<EOF>, 'LOCATION'}

Spark 3.3.1

The feature is not supported: location is a reserved table property, please use the LOCATION clause to specify it.

Failure to Create a View with Auto-Generated Aliases

Explanation:
- Spark 2.4.x: If the statement contains an auto-generated alias, it will execute normally without any prompt.
- Spark 3.3.x: If the statement contains an auto-generated alias, creating/changing the view will fail.
  To restore the usage to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.allowAutoGeneratedAliasForView to true.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Sample code:

Prepare data:

create table test_view_alis(id1 int,id2 int);
INSERT INTO test_view_alis VALUES(1,2);

Execute SQL:

create view view_alis as select id1 + id2 from test_view_alis;

Spark 2.4.5
Successfully executed.

Spark 3.3.1

Error
Not allowed to create a permanent view `view_alis` without explicitly assigning an alias for expression (id1 + id2)

If the following parameter is added in Spark 3.3.1, the SQL will execute successfully:

spark.sql.legacy.allowAutoGeneratedAliasForView = true

Change in Return Value Type After Adding/Subtracting Time Field Intervals to/from Dates

Explanation:
The return type changes when adding/subtracting a time interval (for example, 12 hours) to/from a date-time field (for example, date'2011-11-11').
- Spark 2.4.x: In Spark 2.4.x, when performing date arithmetic operations on JSON attributes defined as FloatType or DoubleType, such as date'2011-11-11' plus or minus a time interval (such as 12 hours), the return type is DateType.
- Spark 3.3.x: The return type changes to a timestamp (TimestampType) to maintain compatibility with Hive.
Is there any impact on jobs after the engine version upgrade?
There is an impact.

Sample code:

Execute SQL:

select date '2011-11-11' - interval 12 hour

Spark 2.4.5
```
2011-11-10
```
Spark 3.3.1
```
1320897600000
```

Support for Char/Varchar Types in Spark SQL

Explanation:
- Spark 2.4.x: Spark SQL table columns do not support Char/Varchar types; when specified as Char or Varchar, they are forcibly converted to the String type.
- Spark 3.3.x: Spark SQL table columns support CHAR/CHARACTER and VARCHAR types.
Is there any impact on jobs after the engine version upgrade?
There is no impact.

Sample code:

Prepare data:

create table test_char(id int,name varchar(24),name2 char(24));

Execute SQL:

show create table test_char;

Spark 2.4.5

create table `test_char`(`id` INT,`name` STRING,`name2` STRING)
ROW FORMAT...

Spark 3.3.1

create table test_char(id INT,name VARCHAR(24),name2 VARCHAR(24))
ROW FORMAT...

Different Query Syntax for Null Partitions

Explanation:
- Spark 2.4.x:
  In Spark 3.0.1 or earlier, if the partition column is of type string, it is parsed as its text representation, such as the string "null".
  
  Querying null partitions using part_col='null'.
- Spark 3.3.x:
  PARTITION(col=null) always parses to null in partition specification, even if the partition column is of type string.
  
  Querying null partitions using part_col is null.
Is there any impact on jobs after the engine version upgrade?
There is an impact; queries for null partitions need to be adapted.
Sample code:
Prepare data:
```
CREATE TABLE test_part_null (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
INSERT INTO TABLE test_part_null PARTITION (p1 = null) SELECT 0;
```
Execute SQL:
```
select * from test_part_null;
```
- Spark 2.4.5
```
0 null
```
  Executing select * from test_part_null where p1='null' can find partition data in Spark 2.4.5.
- Spark 3.3.1
```
0
```
  Executing select * from test_part_null where p1 is null can find data in Spark 3.3.1.

Different Handling of Partitioned Table Data

Explanation:
In datasource v1 partition external tables, non-UUID partition path data already exists.
Performing the insert overwrite partition operation in Spark 3.3.x will clear previous non-UUID partition data, whereas Spark 2.4.x will not.
- Spark 2.4.x:
  Retains data under non-UUID partition paths.
- Spark 3.3.x:
  Deletes data under non-UUID partition paths.
Is there any impact on jobs after the engine version upgrade?
There is an impact; it will clean up dirty data.

Sample code:

Prepare data:

Create a directory named pt=pt1 under obs://bucket1/test/overwrite_datasource and import a Parquet data file into it.

create table overwrite_datasource(id int,name string,pt string) using parquet PARTITIONED by(pt) LOCATION 'obs://bucket1/test/overwrite_datasource';
SELECT * FROM overwrite_datasource1 where pt='pt1' Both versions do not query data.

Execute SQL:

insert OVERWRITE table overwrite_datasource partition(pt='pt1') values(2,'aa2');

Spark 2.4.5
Retains the pt=pt1 directory.
Spark 3.3.1
Deletes the pt=pt1 directory.

Retaining Quotes for Special Characters When Exporting CSV Files

Explanation:
- Spark 2.4.x:
  When exporting CSV files in Spark 2.4.x, if field values contain special characters such as newline (\n) and carriage return (\r), and these special characters are surrounded by quotes (e.g., double quotes "), Spark automatically handles these quotes, omitting them in the exported CSV file.
  
  For example, the field value "a\rb" will not include quotes when exported.
- Spark 3.3.x:
  In Spark 3.3.x, the handling of exporting CSV files is optimized; if field values contain special characters and are surrounded by quotes, Spark retains these quotes in the final CSV file.
  
  For example, the field value "a\rb" retains quotes when exported.
Is there any impact on jobs after the engine version upgrade?
No impact on query results, but it affects the export file format.

Sample code:

Prepare data:

create table test_null2(str1 string,str2 string,str3 string,str4 string);
insert into test_null2 select "a\rb", null, "1\n2", "ab";

Execute SQL:

SELECT * FROM test_null2;

Spark 2.4.5
```
a b  1 2 ab
```
Spark 3.3.1
```
a b  1 2 ab
```

Export query results to OBS and check the CSV file content:

Spark 2.4.5
```
a
b,"","1
2",ab
```
Spark 3.3.1
```
"a
b",,"1
2",ab
```

New Support for Adaptive Skip Partial Agg Configuration

Explanation:
Spark 3.3.x introduces support for adaptive Skip partial aggregation. When partial aggregation is ineffective, it can be skipped to avoid additional performance overhead. Related parameters:
- spark.sql.aggregate.adaptivePartialAggregationEnabled: controls whether to enable adaptive Skip partial aggregation. When set to true, Spark dynamically decides whether to skip partial aggregation based on runtime statistics.
- spark.sql.aggregate.adaptivePartialAggregationInterval: configures the analysis interval, that is, after processing how many rows, Spark will analyze whether to skip partial aggregation.
- spark.sql.aggregate.adaptivePartialAggregationRatio: Threshold to determine whether to skip, based on the ratio of Processed groups/Processed rows. If the ratio exceeds the configured threshold, Spark considers pre-aggregation ineffective and may choose to skip it to avoid further performance loss.
During usage, the system first analyzes at intervals configured by spark.sql.aggregate.adaptivePartialAggregationInterval. When the processed rows reach the interval, it calculates Processed groups/Processed rows. If the ratio exceeds the threshold, it considers pre-aggregation ineffective and can directly skip it.
Is there any impact on jobs after the engine version upgrade?
Enhances DLI functionality.

New Support for Parallel Multi-Insert

Explanation:
Spark 3.3.x adds support for Parallel Multi-Insert. In scenarios with multi-insert SQL, where multiple tables are inserted in the same SQL, this type of SQL is serialized in open-source Spark, limiting performance. Spark 3.3.x introduces multi-insert parallelization optimization in DLI, allowing all inserts to be executed concurrently, improving performance.

Enable the following features by setting them to true (default false):

spark.sql.lazyExecutionForDDL.enabled=true

spark.sql.parallelMultiInsert.enabled=true
Is there any impact on jobs after the engine version upgrade?
Enhances DLI functionality, improving the reliability of jobs with multi-insert parallelization features.

New Support for Enhance Reuse Exchange

Explanation:
Spark 3.3.x introduces support for Enhance Reuse Exchange. When the SQL plan includes reusable sort merge join conditions, setting spark.sql.execution.enhanceReuseExchange.enabled to true allows reuse of SMJ plan nodes.

Enable the following features by setting them to true (default false):

spark.sql.execution.enhanceReuseExchange.enabled=true
Is there any impact on jobs after the engine version upgrade?
Enhances DLI functionality.

Difference in Reading TIMESTAMP Fields

Explanation:
Differences in reading TIMESTAMP fields for the Asia/Shanghai time zone. For values before 1900-01-01 08:05:43, values written by Spark 2.4.5 and read by Spark 3.3.1 differ from values read by Spark 2.4.5.
- Spark 2.4.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5 and read by Spark 2.4.5 returns -2209017600000.
- Spark 3.3.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5, read by Spark 3.3.1 with spark.sql.parquet.int96RebaseModeInRead=LEGACY returns -2209017943000.
Is there any impact on jobs after the engine version upgrade?
There is an impact; the usage of TIMESTAMP fields needs to be evaluated.

Sample code:

Configure in the SQL interface:

spark.sql.session.timeZone=Asia/Shanghai

Spark 2.4.5

create table parquet_timestamp_test (id int, col0 string, col1 timestamp) using parquet;
insert into parquet_timestamp_test values (1, "245", "1900-01-01 00:00:00");

Execute SQL to read data:

select * from parquet_timestamp_test;

Query results:

id   col0    col1
1    245     -2209017600000

Spark 3.3.1

spark.sql.parquet.int96RebaseModeInRead=LEGACY

Execute SQL to read data:

select * from parquet_timestamp_test;

Query results:

id   col0    col1
1    245     -2209017943000

Configuration description:

These configuration items specify how Spark handles DATE and TIMESTAMP type fields for specific times (times that are contentious between the proleptic Gregorian and Julian calendars). For example, in the Asia/Shanghai time zone, it refers to how values before 1900-01-01 08:05:43 are processed.

Ambiguous time:

Refers to times before a specific point. For example, for the Asia/Shanghai time zone, it refers to values before 1900-01-01 08:05:43.
Additionally, the metadata of the data file does not specify which calendar is used for storage. For example, in Spark 3.3.x, the calendar used during certain writes is recorded within the data files, ensuring that such timestamps remain unambiguous.

Configuration items in a datasource Parquet table

**Table 1** Configuration items in a Spark 3.3.1 datasource Parquet table
Configuration Item	Default Value	Description
spark.sql.parquet.int96RebaseModeInRead	LEGACY (default for Spark SQL jobs)	Takes effect when reading INT96 type TIMESTAMP fields in Parquet files. EXCEPTION: Throws an error when encountering an ambiguous time, causing the read operation to fail. CORRECTED: Reads the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar. This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.
spark.sql.parquet.int96RebaseModeInWrite	LEGACY (default for Spark SQL jobs)	Takes effect when writing INT96 type TIMESTAMP fields in Parquet files. EXCEPTION: Throws an error when encountering an ambiguous time, causing the write operation to fail. CORRECTED: Writes the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.
spark.sql.parquet.datetimeRebaseModeInRead	LEGACY (default for Spark SQL jobs)	Takes effect when reading DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the read operation to fail. CORRECTED: Reads the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar. This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.
spark.sql.parquet.datetimeRebaseModeInWrite	LEGACY (default for Spark SQL jobs)	Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the write operation to fail. CORRECTED: Writes the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.

Configuration items in a datasource Avro table

**Table 2** Configuration items in a Spark 3.3.1 datasource Avro table
Configuration Item	Default Value	Description
spark.sql.avro.datetimeRebaseModeInRead	LEGACY (default for Spark SQL jobs)	Takes effect when reading DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the read operation to fail. CORRECTED: Reads the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar. This setting only takes effect when the write information for the Avro file (e.g., Spark, Hive) is unknown.
spark.sql.avro.datetimeRebaseModeInWrite	LEGACY (default for Spark SQL jobs)	Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields. EXCEPTION: Throws an error when encountering an ambiguous time, causing the write operation to fail. CORRECTED: Writes the date/timestamp as is without adjustment. LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Avro files.

Difference in from_unixtime Function

Explanation:
- Spark 2.4.x:
  For the Asia/Shanghai time zone, -2209017600 returns 1900-01-01 00:00:00.
- Spark 3.3.x:
  For the Asia/Shanghai time zone, -2209017943 returns 1900-01-01 00:00:00.
Is there any impact on jobs after the engine version upgrade?
There is an impact; usage of this function needs to be checked.

Sample code:

Configure in the SQL interface:

spark.sql.session.timeZone=Asia/Shanghai

Spark 2.4.5

Execute SQL to read data:

select from_unixtime(-2209017600);

Query results:

 1900-01-01 00:00:00

Spark 3.3.1

Execute SQL to read data:

select from_unixtime(-2209017600);

Query results:

1900-01-01 00:05:43

Difference in unix_timestamp Function

Explanation:
For values less than 1900-01-01 08:05:43 in the Asia/Shanghai time zone.
- Spark 2.4.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017600.
- Spark 3.3.x:
  For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017943.
Is there any impact on jobs after the engine version upgrade?
There is an impact; usage of this function needs to be checked.

Sample code:

Configure in the SQL interface:

spark.sql.session.timeZone=Asia/Shanghai

Spark 2.4.5

Execute SQL to read data:

select unix_timestamp('1900-01-01 00:00:00');

Query results:

 -2209017600

Spark 3.3.1

Execute SQL to read data:

select unix_timestamp('1900-01-01 00:00:00');

Query results

-2209017943

Parent topic: Differences Between Spark 2.4.x and Spark 3.3.x

Previous topic: Differences Between Spark 2.4.x and Spark 3.3.x

Next topic: Differences in General-Purpose Queues Between Spark 2.4.x and Spark 3.3.x

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot