このページは、お客様の言語ではご利用いただけません。Huawei Cloudは、より多くの言語バージョンを追加するために懸命に取り組んでいます。ご協力ありがとうございました。
Differences in SQL Queues Between Spark 2.4.x and Spark 3.3.x
DLI has summarized the differences in SQL queues between Spark 2.4.x and Spark 3.3.x to help you understand the impact of upgrading the Spark version on jobs running in the SQL queues with the new engine.
Difference in the Return Type of the histogram_numeric Function
- Explanation:
The histogram_numeric function in Spark SQL returns an array of structs (x, y), where the type of x varies between different engine versions.
- Spark 2.4.x: In Spark 3.2 or earlier, x is of type double.
- Spark 3.3.x: The type of x is equal to the input value type of the function.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; related usages need to be adapted.
- Sample code:
Prepare data:
create table test_histogram_numeric(val int); INSERT INTO test_histogram_numeric VALUES(1),(5),(8),(6),(7),(9),(8),(9);
Execute SQL:
select histogram_numeric(val,3) from test_histogram_numeric;
- Spark 2.4.5
[{"x":1.0,"y":1.0},{"x":5.5,"y":2.0},{"x":8.200000000000001,"y":5.0}]
- Spark 3.3.1
[{"x":1,"y":1.0},{"x":5,"y":2.0},{"x":8,"y":5.0}]
- Spark 2.4.5
Spark 3.3.x No Longer Supports Using "0$" to Specify the First Argument
- Explanation:
In format_string(strfmt, obj, ...) and printf(strfmt, obj, ...), strfmt will no longer support using 0$ to specify the first argument; the first argument should always be referenced by 1$ when using argument indexing to indicate the position of the argument in the parameter list.
- Spark 2.4.x: Both %0 and %1 can represent the first argument.
- Spark 3.3.x: %0 is no longer supported.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; usages involving %0 need to be modified to adapt to Spark 3.3.x.
- Sample code 1:
Execute SQL:
SELECT format_string('Hello, %0$s! I\'m %1$s!', 'Alice', 'Lilei');
- Spark 2.4.5
Hello, Alice! I'm Alice!
- Spark 3.3.1
DLI.0005: The value of parameter(s) 'strfmt' in `format_string` is invalid: expects %1$, %2$ and so on, but got %0$.
- Spark 2.4.5
- Sample code 2:
Execute SQL:
SELECT format_string('Hello, %1$s! I\'m %2$s!', 'Alice', 'Lilei');
- Spark 2.4.5
Hello, Alice! I'm Lilei!
- Spark 3.3.1
Hello, Alice! I'm Lilei!
- Spark 2.4.5
Spark 3.3.x Empty String Without Quotes
- Explanation:
By default, in the CSV data source, empty strings are represented as "" in Spark 2.4.5. After upgrading to Spark 3.3.1, empty strings have no quotes.
- Spark 2.4.x: Empty strings in the CSV data source are represented as "".
- Spark 3.3.x: Empty strings in the CSV data source have no quotes.
To restore the format of Spark 2.4.x in Spark 3.3.x, you can set spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the storage format of null values in exported ORC files will be different.
- Sample code:
Prepare data:
create table test_null(id int,name string) stored as parquet; insert into test_null values(1,null);
Export a CSV file and check the file content:
- Spark 2.4.5
1,""
- Spark 3.3.1
1,
- Spark 2.4.5
Different Return Results of the describe function
- Explanation:
If the function does not exist, describe function will fail.
- Spark 2.4.x: The DESCRIBE function can still run and print Function: func_name not found.
- Spark 3.3.x: The error message changes to failure if the function does not exist.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the return information of the describe function related APIs is different.
- Sample code:
Execute SQL:
describe function dli_no (dli_no does not exist)
Clear Indication That Specified External Table Property is Not Supported
- Explanation:
The external property of the table becomes reserved. If the external property is specified, certain commands will fail.
- Spark 2.4.x: Commands succeed when specifying the external property via CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES, but the external property is silently ignored, and the table remains a managed table.
- Spark 3.3.x:
Commands will fail when specifying the external property via CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES.
To restore the Spark 2.4.x usage in Spark 3.3.x, set spark.sql.legacy.notReserveProperties to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; related usages need to be adapted.
- Sample code:
Execute SQL:
CREATE TABLE test_external(id INT,name STRING) TBLPROPERTIES('external'=true);
New Support for Parsing Strings of "+Infinity", "+INF", and "-INF" Types
- Explanation:
- Spark 2.4.x: When reading values from JSON properties defined as FloatType or DoubleType, Spark 2.4.x only supports parsing Infinity and -Infinity.
- Spark 3.3.x: In addition to supporting Infinity and -Infinity, Spark 3.3.x also supports parsing strings of +Infinity, +INF, and -INF.
- Is there any impact on jobs after the engine version upgrade?
Function enhancement, no impact.
Default Configuration spark.sql.adaptive.enabled = true
- Explanation:
- Spark 2.4.x: In Spark 2.4.x, the default value of the spark.sql.adaptive.enabled configuration item is false, meaning that the Adaptive Query Execution (AQE) feature is disabled.
- Spark 3.3.x: Starting from Spark 3.3.x-320, AQE is enabled by default, that is, spark.sql.adaptive.enabled is set to true.
- Is there any impact on jobs after the engine version upgrade?
DLI functionality is enhanced, and the default value of spark.sql.adaptive.enabled has changed.
Change in the Schema of SHOW TABLES Output
- Explanation:
The schema of the SHOW TABLES output changes from database: string to namespace: string.
- Spark 2.4.x: The schema of the SHOW TABLES output is database: string.
- Spark 3.3.x:
The schema of the SHOW TABLES output changes from database: string to namespace: string.
For the built-in catalog, the namespace field is named database; for the v2 catalog, there is no isTemporary field.
To revert to the style of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.keepCommandOutputSchema to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; check the usages related to SHOW TABLES in jobs and adapt them to meet the new version's usage requirements.
- Sample code:
Execute SQL:
show tables;
- Spark 2.4.5
database tableName isTemporary db1 table1 false
- Spark 3.3.1
namespace tableName isTemporary db1 table1 false
- Spark 2.4.5
Schema Change in SHOW TABLE EXTENDED Output
- Explanation:
The schema of the SHOW TABLE EXTENDED output changes from database: string to namespace: string.
- Spark 2.4.x: The schema of the SHOW TABLE EXTENDED output is database: string.
- Spark 3.3.x:
- The schema of the SHOW TABLE EXTENDED output changes from database: string to namespace: string.
For the built-in catalog, the namespace field is named database; there is no change for the v2 catalog.
To revert to the style of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.keepCommandOutputSchema to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; check the usages related to SHOW TABLES in jobs and adapt them to meet the new version's usage requirements.
- Sample code:
Execute SQL:
show table extended like 'table%';
- Spark 2.4.5
database tableName isTemporary information db1 table1 false Database:db1...
- Spark 3.3.1
namespace tableName isTemporary information db1 table1 false Database:db1...
- Spark 2.4.5
Impact of Table Refresh on Dependent Item Cache
- Explanation:
After upgrading to Spark 3.3.x, table refresh will clear the table's cache data but keep the dependent item cache.
- Spark 2.4.x: In Spark 2.4.x, when performing a table refresh operation (for example, REFRESH TABLE), the cache data of dependent items (for example, views) is not retained.
ALTER TABLE .. ADD PARTITION ALTER TABLE .. RENAME PARTITION ALTER TABLE .. DROP PARTITION ALTER TABLE .. RECOVER PARTITIONS MSCK REPAIR TABLE LOAD DATA REFRESH TABLE TRUNCATE TABLE spark.catalog.refreshTable
- Spark 3.3.x: After upgrading to Spark 3.3.x, table refresh will clear the table's cache data but keep the dependent item cache.
- Spark 2.4.x: In Spark 2.4.x, when performing a table refresh operation (for example, REFRESH TABLE), the cache data of dependent items (for example, views) is not retained.
- Is there any impact on jobs after the engine version upgrade?
The upgraded engine version will increase the cache data of the original dependencies.
Impact of Table Refresh on Other Cached Operations Dependent on the Table
- Explanation:
- Spark 2.4.x: In Spark 2.4.x, refreshing a table will trigger the uncache operation for all other caches referencing the table only if the table itself is cached.
- Spark 3.3.x: After upgrading to the new engine version, refreshing the table will trigger the uncache operation for other caches dependent on the table, regardless of whether the table itself is cached.
- Is there any impact on jobs after the engine version upgrade?
DLI functionality is enhanced to ensure that the table refresh operation can affect the cache, improving program robustness.
New Support for Using Typed Literals in ADD PARTITION
- Explanation:
- Spark 2.4.x:
In Spark 2.4.x, using typed literals (for example, date'2020-01-01') in ADD PARTITION will parse the partition value as a string date'2020-01-01', resulting in an invalid date value and adding a partition with a null value.
The correct approach is to use a string value, such as ADD PARTITION(dt = '2020-01-01').
- Spark 3.3.x: In Spark 3.3.x, partition operations support using typed literals, supporting ADD PARTITION(dt = date'2020-01-01') and correctly parsing the partition value as a date type instead of a string.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the handling of typed literals in ADD PARTITION has changed.
- Sample code:
Prepare data:
create table test_part_type (id int,name string,pt date) PARTITIONED by (pt); insert into test_part_type partition (pt = '2021-01-01') select 1,'name1'; insert into test_part_type partition (pt = date'2021-01-01') select 1,'name1';
Execute SQL:
select id,name,pt from test_part_type; (Set the parameter spark.sql.forcePartitionPredicatesOnPartitionedTable.enabled to false.)
- Spark 2.4.5
1 name1 2021-01-01 1 name1
- Spark 3.3.1
1 name1 2021-01-01 1 name1 2021-01-01
- Spark 2.4.5
Mapping Type Change of DayTimeIntervalType to Duration
- Explanation:
In the ArrowWriter and ArrowColumnVector developer APIs, starting from Spark 3.3.x, the DayTimeIntervalType in Spark SQL is mapped to Apache Arrow's Duration type.
- Spark 2.4.x: DayTimeIntervalType is mapped to Apache Arrow's Interval type.
- Spark 3.3.x: DayTimeIntervalType is mapped to Apache Arrow's Duration type.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the mapping type of DayTimeIntervalType has changed.
Type Change in the Return Result of Date Difference
- Explanation:
The date subtraction expression (e.g., date1 – date2) returns a DayTimeIntervalType value.
- Spark 2.4.x: Returns CalendarIntervalType.
- Spark 3.3.x: Returns DayTimeIntervalType.
To restore the previous behavior, set spark.sql.legacy.interval.enabled to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the default type of the date difference return result has changed.
Mapping Type Change of Unit-to-Unit Interval
- Explanation:
- Spark 2.4.x: In Spark 2.4.x, unit-to-unit intervals (e.g., INTERVAL '1-1' YEAR TO MONTH) and unit list intervals (e.g., INTERVAL '3' DAYS '1' HOUR) are converted to CalendarIntervalType.
- Spark 3.3.x: In Spark 3.3.x, unit-to-unit intervals and unit list intervals are converted to ANSI interval types: YearMonthIntervalType or DayTimeIntervalType.
To restore the mapping type to that before Spark 2.4.x in Spark 3.3.x, set the configuration item spark.sql.legacy.interval.enabled to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the mapped data type has changed.
Return Value Type Change in timestamps Subtraction Expression
- Explanation:
- Spark 2.4.x: In Spark 2.4.x, the timestamps subtraction expression (for example, select timestamp'2021-03-31 23:48:00' – timestamp'2021-01-01 00:00:00') returns a CalendarIntervalType value.
- Spark 3.3.x: In Spark 3.3.x, the timestamps subtraction expression (for example, select timestamp'2021-03-31 23:48:00' – timestamp'2021-01-01 00:00:00') returns a DayTimeIntervalType value.
To restore the mapping type to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.interval.enabled to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the mapped data type has changed.
Mixed Use of Year-Month and Day-Time Fields No Longer Supported
- Explanation:
- Spark 2.4.x: Unit list interval literals can mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND).
- Spark 3.3.x: Unit list interval literals cannot mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND). Invalid input is indicated.
To restore the usage to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.interval.enabled to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact.
Reserved Properties Cannot Be Used in CREATE TABLE .. LIKE .. Command
- Explanation:
Reserved properties cannot be used in the CREATE TABLE .. LIKE .. command.
- Spark 2.4.x: In Spark 2.4.x, the CREATE TABLE .. LIKE .. command can use reserved properties.
For example, TBLPROPERTIES('location'='/tmp') does not change the table location but creates an invalid property.
- Spark 3.3.x: In Spark 3.3.x, the CREATE TABLE .. LIKE .. command cannot use reserved properties.
For example, using TBLPROPERTIES('location'='/tmp') or TBLPROPERTIES('owner'='yao') will fail.
- Spark 2.4.x: In Spark 2.4.x, the CREATE TABLE .. LIKE .. command can use reserved properties.
- Is there any impact on jobs after the engine version upgrade?
There is an impact.
- Sample code 1:
Prepare data:
CREATE TABLE test0(id int, name string); CREATE TABLE test_like_properties LIKE test0 LOCATION 'obs://bucket1/test/test_like_properties';
Execute SQL:
DESCRIBE FORMATTED test_like_properties;
- Sample code 2:
Prepare data:
CREATE TABLE test_like_properties0(id int) using parquet LOCATION 'obs://bucket1/dbgms/test_like_properties0'; CREATE TABLE test_like_properties1 like test_like_properties0 tblproperties('location'='obs://bucket1/dbgms/test_like_properties1');
Execute SQL:
DESCRIBE FORMATTED test_like_properties1;
- Spark 2.4.5
DLI.0005: mismatched input 'tblproperties' expecting {<EOF>, 'LOCATION'}
- Spark 3.3.1
The feature is not supported: location is a reserved table property, please use the LOCATION clause to specify it.
- Spark 2.4.5
Failure to Create a View with Auto-Generated Aliases
- Explanation:
- Spark 2.4.x: If the statement contains an auto-generated alias, it will execute normally without any prompt.
- Spark 3.3.x: If the statement contains an auto-generated alias, creating/changing the view will fail.
To restore the usage to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.allowAutoGeneratedAliasForView to true.
- Is there any impact on jobs after the engine version upgrade?
There is an impact.
- Sample code:
Prepare data:
create table test_view_alis(id1 int,id2 int); INSERT INTO test_view_alis VALUES(1,2);
Execute SQL:
create view view_alis as select id1 + id2 from test_view_alis;
- Spark 2.4.5
- Spark 3.3.1
Error Not allowed to create a permanent view `view_alis` without explicitly assigning an alias for expression (id1 + id2)
If the following parameter is added in Spark 3.3.1, the SQL will execute successfully:
spark.sql.legacy.allowAutoGeneratedAliasForView = true
Change in Return Value Type After Adding/Subtracting Time Field Intervals to/from Dates
- Explanation:
The return type changes when adding/subtracting a time interval (for example, 12 hours) to/from a date-time field (for example, date'2011-11-11').
- Spark 2.4.x: In Spark 2.4.x, when performing date arithmetic operations on JSON attributes defined as FloatType or DoubleType, such as date'2011-11-11' plus or minus a time interval (such as 12 hours), the return type is DateType.
- Spark 3.3.x: The return type changes to a timestamp (TimestampType) to maintain compatibility with Hive.
- Is there any impact on jobs after the engine version upgrade?
There is an impact.
- Sample code:
Execute SQL:
select date '2011-11-11' - interval 12 hour
- Spark 2.4.5
2011-11-10
- Spark 3.3.1
1320897600000
- Spark 2.4.5
Support for Char/Varchar Types in Spark SQL
- Explanation:
- Spark 2.4.x: Spark SQL table columns do not support Char/Varchar types; when specified as Char or Varchar, they are forcibly converted to the String type.
- Spark 3.3.x: Spark SQL table columns support CHAR/CHARACTER and VARCHAR types.
- Is there any impact on jobs after the engine version upgrade?
There is no impact.
- Sample code:
Prepare data:
create table test_char(id int,name varchar(24),name2 char(24));
Execute SQL:
show create table test_char;
- Spark 2.4.5
create table `test_char`(`id` INT,`name` STRING,`name2` STRING) ROW FORMAT...
- Spark 3.3.1
create table test_char(id INT,name VARCHAR(24),name2 VARCHAR(24)) ROW FORMAT...
- Spark 2.4.5
Different Query Syntax for Null Partitions
- Explanation:
- Spark 2.4.x:
In Spark 3.0.1 or earlier, if the partition column is of type string, it is parsed as its text representation, such as the string "null".
Querying null partitions using part_col='null'.
- Spark 3.3.x:
PARTITION(col=null) always parses to null in partition specification, even if the partition column is of type string.
Querying null partitions using part_col is null.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
There is an impact; queries for null partitions need to be adapted.
- Sample code:
Prepare data:
CREATE TABLE test_part_null (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); INSERT INTO TABLE test_part_null PARTITION (p1 = null) SELECT 0;
Execute SQL:
select * from test_part_null;
- Spark 2.4.5
0 null
Executing select * from test_part_null where p1='null' can find partition data in Spark 2.4.5.
- Spark 3.3.1
0
Executing select * from test_part_null where p1 is null can find data in Spark 3.3.1.
- Spark 2.4.5
Different Handling of Partitioned Table Data
- Explanation:
In datasource v1 partition external tables, non-UUID partition path data already exists.
Performing the insert overwrite partition operation in Spark 3.3.x will clear previous non-UUID partition data, whereas Spark 2.4.x will not.- Spark 2.4.x:
Retains data under non-UUID partition paths.
- Spark 3.3.x:
Deletes data under non-UUID partition paths.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
There is an impact; it will clean up dirty data.
- Sample code:
Prepare data:
Create a directory named pt=pt1 under obs://bucket1/test/overwrite_datasource and import a Parquet data file into it.
create table overwrite_datasource(id int,name string,pt string) using parquet PARTITIONED by(pt) LOCATION 'obs://bucket1/test/overwrite_datasource'; SELECT * FROM overwrite_datasource1 where pt='pt1' Both versions do not query data.
Execute SQL:
insert OVERWRITE table overwrite_datasource partition(pt='pt1') values(2,'aa2');
Retaining Quotes for Special Characters When Exporting CSV Files
- Explanation:
- Spark 2.4.x:
When exporting CSV files in Spark 2.4.x, if field values contain special characters such as newline (\n) and carriage return (\r), and these special characters are surrounded by quotes (e.g., double quotes "), Spark automatically handles these quotes, omitting them in the exported CSV file.
For example, the field value "a\rb" will not include quotes when exported.
- Spark 3.3.x:
In Spark 3.3.x, the handling of exporting CSV files is optimized; if field values contain special characters and are surrounded by quotes, Spark retains these quotes in the final CSV file.
For example, the field value "a\rb" retains quotes when exported.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
No impact on query results, but it affects the export file format.
- Sample code:
Prepare data:
create table test_null2(str1 string,str2 string,str3 string,str4 string); insert into test_null2 select "a\rb", null, "1\n2", "ab";
Execute SQL:
SELECT * FROM test_null2;
- Spark 2.4.5
a b 1 2 ab
- Spark 3.3.1
a b 1 2 ab
Export query results to OBS and check the CSV file content:
- Spark 2.4.5
a b,"","1 2",ab
- Spark 3.3.1
"a b",,"1 2",ab
- Spark 2.4.5
New Support for Adaptive Skip Partial Agg Configuration
- Explanation:
Spark 3.3.x introduces support for adaptive Skip partial aggregation. When partial aggregation is ineffective, it can be skipped to avoid additional performance overhead. Related parameters:
- spark.sql.aggregate.adaptivePartialAggregationEnabled: controls whether to enable adaptive Skip partial aggregation. When set to true, Spark dynamically decides whether to skip partial aggregation based on runtime statistics.
- spark.sql.aggregate.adaptivePartialAggregationInterval: configures the analysis interval, that is, after processing how many rows, Spark will analyze whether to skip partial aggregation.
- spark.sql.aggregate.adaptivePartialAggregationRatio: Threshold to determine whether to skip, based on the ratio of Processed groups/Processed rows. If the ratio exceeds the configured threshold, Spark considers pre-aggregation ineffective and may choose to skip it to avoid further performance loss.
During usage, the system first analyzes at intervals configured by spark.sql.aggregate.adaptivePartialAggregationInterval. When the processed rows reach the interval, it calculates Processed groups/Processed rows. If the ratio exceeds the threshold, it considers pre-aggregation ineffective and can directly skip it.
- Is there any impact on jobs after the engine version upgrade?
Enhances DLI functionality.
New Support for Parallel Multi-Insert
- Explanation:
Spark 3.3.x adds support for Parallel Multi-Insert. In scenarios with multi-insert SQL, where multiple tables are inserted in the same SQL, this type of SQL is serialized in open-source Spark, limiting performance. Spark 3.3.x introduces multi-insert parallelization optimization in DLI, allowing all inserts to be executed concurrently, improving performance.
Enable the following features by setting them to true (default false):
spark.sql.lazyExecutionForDDL.enabled=true
spark.sql.parallelMultiInsert.enabled=true
- Is there any impact on jobs after the engine version upgrade?
Enhances DLI functionality, improving the reliability of jobs with multi-insert parallelization features.
New Support for Enhance Reuse Exchange
- Explanation:
Spark 3.3.x introduces support for Enhance Reuse Exchange. When the SQL plan includes reusable sort merge join conditions, setting spark.sql.execution.enhanceReuseExchange.enabled to true allows reuse of SMJ plan nodes.
Enable the following features by setting them to true (default false):
spark.sql.execution.enhanceReuseExchange.enabled=true
- Is there any impact on jobs after the engine version upgrade?
Enhances DLI functionality.
Difference in Reading TIMESTAMP Fields
- Explanation:
Differences in reading TIMESTAMP fields for the Asia/Shanghai time zone. For values before 1900-01-01 08:05:43, values written by Spark 2.4.5 and read by Spark 3.3.1 differ from values read by Spark 2.4.5.
- Spark 2.4.x:
For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5 and read by Spark 2.4.5 returns -2209017600000.
- Spark 3.3.x:
For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5, read by Spark 3.3.1 with spark.sql.parquet.int96RebaseModeInRead=LEGACY returns -2209017943000.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
There is an impact; the usage of TIMESTAMP fields needs to be evaluated.
- Sample code:
Configure in the SQL interface:
spark.sql.session.timeZone=Asia/Shanghai
- Spark 2.4.5
create table parquet_timestamp_test (id int, col0 string, col1 timestamp) using parquet; insert into parquet_timestamp_test values (1, "245", "1900-01-01 00:00:00");
Execute SQL to read data:
select * from parquet_timestamp_test;
Query results:
id col0 col1 1 245 -2209017600000
- Spark 3.3.1
spark.sql.parquet.int96RebaseModeInRead=LEGACY
Execute SQL to read data:
select * from parquet_timestamp_test;
Query results:
id col0 col1 1 245 -2209017943000
- Spark 2.4.5
- Configuration description:
These configuration items specify how Spark handles DATE and TIMESTAMP type fields for specific times (times that are contentious between the proleptic Gregorian and Julian calendars). For example, in the Asia/Shanghai time zone, it refers to how values before 1900-01-01 08:05:43 are processed.
- Configuration items in a datasource Parquet table
Table 1 Configuration items in a Spark 3.3.1 datasource Parquet table Configuration Item
Default Value
Description
spark.sql.parquet.int96RebaseModeInRead
LEGACY (default for Spark SQL jobs)
Takes effect when reading INT96 type TIMESTAMP fields in Parquet files.
- EXCEPTION: Throws an error for specific times, causing the read operation to fail.
- CORRECTED: Reads the date/timestamp as is without adjustment.
- LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.
This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.
spark.sql.parquet.int96RebaseModeInWrite
LEGACY (default for Spark SQL jobs)
Takes effect when writing INT96 type TIMESTAMP fields in Parquet files.
- EXCEPTION: Throws an error for specific times, causing the write operation to fail.
- CORRECTED: Writes the date/timestamp as is without adjustment.
- LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.
spark.sql.parquet.datetimeRebaseModeInRead
LEGACY (default for Spark SQL jobs)
Takes effect when reading DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.
- EXCEPTION: Throws an error for specific times, causing the read operation to fail.
- CORRECTED: Reads the date/timestamp as is without adjustment.
- LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.
This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.
spark.sql.parquet.datetimeRebaseModeInWrite
LEGACY (default for Spark SQL jobs)
Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.
- EXCEPTION: Throws an error for specific times, causing the write operation to fail.
- CORRECTED: Writes the date/timestamp as is without adjustment.
- LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.
- Configuration items in a datasource Avro table
Table 2 Configuration items in a Spark 3.3.1 datasource Avro table Configuration Item
Default Value
Description
spark.sql.avro.datetimeRebaseModeInRead
LEGACY (default for Spark SQL jobs)
Takes effect when reading DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical type fields.
- EXCEPTION: Throws an error for specific times, causing the read operation to fail.
- CORRECTED: Reads the date/timestamp as is without adjustment.
- LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.
This setting only takes effect when the write information for the Avro file (e.g., Spark, Hive) is unknown.
spark.sql.avro.datetimeRebaseModeInWrite
LEGACY (default for Spark SQL jobs)
Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.
- EXCEPTION: Throws an error for specific times, causing the write operation to fail.
- CORRECTED: Writes the date/timestamp as is without adjustment.
- LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Avro files.
- Configuration items in a datasource Parquet table
Difference in from_unixtime Function
- Explanation:
- Spark 2.4.x:
For the Asia/Shanghai time zone, -2209017600 returns 1900-01-01 00:00:00.
- Spark 3.3.x:
For the Asia/Shanghai time zone, -2209017943 returns 1900-01-01 00:00:00.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
There is an impact; usage of this function needs to be checked.
- Sample code:
Configure in the SQL interface:
spark.sql.session.timeZone=Asia/Shanghai
Difference in unix_timestamp Function
- Explanation:
For values less than 1900-01-01 08:05:43 in the Asia/Shanghai time zone.
- Spark 2.4.x:
For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017600.
- Spark 3.3.x:
For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017943.
- Spark 2.4.x:
- Is there any impact on jobs after the engine version upgrade?
There is an impact; usage of this function needs to be checked.
- Sample code:
Configure in the SQL interface:
spark.sql.session.timeZone=Asia/Shanghai
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot