หน้านี้ยังไม่พร้อมใช้งานในภาษาท้องถิ่นของคุณ เรากำลังพยายามอย่างหนักเพื่อเพิ่มเวอร์ชันภาษาอื่น ๆ เพิ่มเติม ขอบคุณสำหรับการสนับสนุนเสมอมา

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
On this page

Show all

Differences in SQL Queues Between Spark 2.4.x and Spark 3.3.x

Updated on 2025-02-27 GMT+08:00

DLI has summarized the differences in SQL queues between Spark 2.4.x and Spark 3.3.x to help you understand the impact of upgrading the Spark version on jobs running in the SQL queues with the new engine.

Difference in the Return Type of the histogram_numeric Function

  • Explanation:

    The histogram_numeric function in Spark SQL returns an array of structs (x, y), where the type of x varies between different engine versions.

    • Spark 2.4.x: In Spark 3.2 or earlier, x is of type double.
    • Spark 3.3.x: The type of x is equal to the input value type of the function.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; related usages need to be adapted.

  • Sample code:

    Prepare data:

    create table test_histogram_numeric(val int);
    INSERT INTO test_histogram_numeric VALUES(1),(5),(8),(6),(7),(9),(8),(9);

    Execute SQL:

    select histogram_numeric(val,3) from test_histogram_numeric;
    • Spark 2.4.5
      [{"x":1.0,"y":1.0},{"x":5.5,"y":2.0},{"x":8.200000000000001,"y":5.0}]
    • Spark 3.3.1
      [{"x":1,"y":1.0},{"x":5,"y":2.0},{"x":8,"y":5.0}]

Spark 3.3.x No Longer Supports Using "0$" to Specify the First Argument

  • Explanation:

    In format_string(strfmt, obj, ...) and printf(strfmt, obj, ...), strfmt will no longer support using 0$ to specify the first argument; the first argument should always be referenced by 1$ when using argument indexing to indicate the position of the argument in the parameter list.

    • Spark 2.4.x: Both %0 and %1 can represent the first argument.
    • Spark 3.3.x: %0 is no longer supported.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; usages involving %0 need to be modified to adapt to Spark 3.3.x.

  • Sample code 1:

    Execute SQL:

    SELECT format_string('Hello, %0$s! I\'m %1$s!', 'Alice', 'Lilei');
    • Spark 2.4.5
      Hello, Alice! I'm Alice!
    • Spark 3.3.1
      DLI.0005: The value of parameter(s) 'strfmt' in `format_string` is invalid: expects %1$, %2$ and so on, but got %0$.
  • Sample code 2:

    Execute SQL:

    SELECT format_string('Hello, %1$s! I\'m %2$s!', 'Alice', 'Lilei');
    • Spark 2.4.5
      Hello, Alice! I'm Lilei!
    • Spark 3.3.1
      Hello, Alice! I'm Lilei!

Spark 3.3.x Empty String Without Quotes

  • Explanation:

    By default, in the CSV data source, empty strings are represented as "" in Spark 2.4.5. After upgrading to Spark 3.3.1, empty strings have no quotes.

    • Spark 2.4.x: Empty strings in the CSV data source are represented as "".
    • Spark 3.3.x: Empty strings in the CSV data source have no quotes.

      To restore the format of Spark 2.4.x in Spark 3.3.x, you can set spark.sql.legacy.nullValueWrittenAsQuotedEmptyStringCsv to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the storage format of null values in exported ORC files will be different.

  • Sample code:

    Prepare data:

    create table test_null(id int,name string) stored as parquet;
    insert into test_null values(1,null);

    Export a CSV file and check the file content:

    • Spark 2.4.5
      1,""
    • Spark 3.3.1
      1,

Different Return Results of the describe function

  • Explanation:

    If the function does not exist, describe function will fail.

    • Spark 2.4.x: The DESCRIBE function can still run and print Function: func_name not found.
    • Spark 3.3.x: The error message changes to failure if the function does not exist.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the return information of the describe function related APIs is different.

  • Sample code:

    Execute SQL:

    describe function dli_no (dli_no does not exist)
    • Spark 2.4.5

      Successfully executed, function_desc content:

      Function:func_name not found
    • Spark 3.3.1
      Execution failed, DLI.0005:
      Undefined function: dli_no...

Clear Indication That Specified External Table Property is Not Supported

  • Explanation:

    The external property of the table becomes reserved. If the external property is specified, certain commands will fail.

    • Spark 2.4.x: Commands succeed when specifying the external property via CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES, but the external property is silently ignored, and the table remains a managed table.
    • Spark 3.3.x:

      Commands will fail when specifying the external property via CREATE TABLE ... TBLPROPERTIES and ALTER TABLE ... SET TBLPROPERTIES.

      To restore the Spark 2.4.x usage in Spark 3.3.x, set spark.sql.legacy.notReserveProperties to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; related usages need to be adapted.

  • Sample code:

    Execute SQL:

    CREATE TABLE test_external(id INT,name STRING) TBLPROPERTIES('external'=true);
    • Spark 2.4.5

      Successfully executed.

    • Spark 3.3.1
      DLI.0005: The feature is not supported: external is a reserved table property, please use CREATE EXTERNAL TABLE.

New Support for Parsing Strings of "+Infinity", "+INF", and "-INF" Types

  • Explanation:
    • Spark 2.4.x: When reading values from JSON properties defined as FloatType or DoubleType, Spark 2.4.x only supports parsing Infinity and -Infinity.
    • Spark 3.3.x: In addition to supporting Infinity and -Infinity, Spark 3.3.x also supports parsing strings of +Infinity, +INF, and -INF.
  • Is there any impact on jobs after the engine version upgrade?

    Function enhancement, no impact.

Default Configuration spark.sql.adaptive.enabled = true

  • Explanation:
    • Spark 2.4.x: In Spark 2.4.x, the default value of the spark.sql.adaptive.enabled configuration item is false, meaning that the Adaptive Query Execution (AQE) feature is disabled.
    • Spark 3.3.x: Starting from Spark 3.3.x-320, AQE is enabled by default, that is, spark.sql.adaptive.enabled is set to true.
  • Is there any impact on jobs after the engine version upgrade?

    DLI functionality is enhanced, and the default value of spark.sql.adaptive.enabled has changed.

Change in the Schema of SHOW TABLES Output

  • Explanation:

    The schema of the SHOW TABLES output changes from database: string to namespace: string.

    • Spark 2.4.x: The schema of the SHOW TABLES output is database: string.
    • Spark 3.3.x:

      The schema of the SHOW TABLES output changes from database: string to namespace: string.

      For the built-in catalog, the namespace field is named database; for the v2 catalog, there is no isTemporary field.

      To revert to the style of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.keepCommandOutputSchema to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; check the usages related to SHOW TABLES in jobs and adapt them to meet the new version's usage requirements.

  • Sample code:

    Execute SQL:

    show tables;
    • Spark 2.4.5
      database    tableName    isTemporary
      db1            table1            false
    • Spark 3.3.1
      namespace    tableName    isTemporary
      db1                table1            false

Schema Change in SHOW TABLE EXTENDED Output

  • Explanation:

    The schema of the SHOW TABLE EXTENDED output changes from database: string to namespace: string.

    • Spark 2.4.x: The schema of the SHOW TABLE EXTENDED output is database: string.
    • Spark 3.3.x:
    • The schema of the SHOW TABLE EXTENDED output changes from database: string to namespace: string.

      For the built-in catalog, the namespace field is named database; there is no change for the v2 catalog.

      To revert to the style of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.keepCommandOutputSchema to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; check the usages related to SHOW TABLES in jobs and adapt them to meet the new version's usage requirements.

  • Sample code:

    Execute SQL:

    show table extended like 'table%';
    • Spark 2.4.5
      database    tableName    isTemporary    information
      db1            table1            false                Database:db1...
    • Spark 3.3.1
      namespace    tableName    isTemporary    information
      db1            table1            false                Database:db1...

Impact of Table Refresh on Dependent Item Cache

  • Explanation:

    After upgrading to Spark 3.3.x, table refresh will clear the table's cache data but keep the dependent item cache.

    • Spark 2.4.x: In Spark 2.4.x, when performing a table refresh operation (for example, REFRESH TABLE), the cache data of dependent items (for example, views) is not retained.
      ALTER TABLE .. ADD PARTITION
      ALTER TABLE .. RENAME PARTITION
      ALTER TABLE .. DROP PARTITION
      ALTER TABLE .. RECOVER PARTITIONS
      MSCK REPAIR TABLE
      LOAD DATA
      REFRESH TABLE
      TRUNCATE TABLE
      spark.catalog.refreshTable
    • Spark 3.3.x: After upgrading to Spark 3.3.x, table refresh will clear the table's cache data but keep the dependent item cache.
  • Is there any impact on jobs after the engine version upgrade?

    The upgraded engine version will increase the cache data of the original dependencies.

Impact of Table Refresh on Other Cached Operations Dependent on the Table

  • Explanation:
    • Spark 2.4.x: In Spark 2.4.x, refreshing a table will trigger the uncache operation for all other caches referencing the table only if the table itself is cached.
    • Spark 3.3.x: After upgrading to the new engine version, refreshing the table will trigger the uncache operation for other caches dependent on the table, regardless of whether the table itself is cached.
  • Is there any impact on jobs after the engine version upgrade?

    DLI functionality is enhanced to ensure that the table refresh operation can affect the cache, improving program robustness.

New Support for Using Typed Literals in ADD PARTITION

  • Explanation:
    • Spark 2.4.x:

      In Spark 2.4.x, using typed literals (for example, date'2020-01-01') in ADD PARTITION will parse the partition value as a string date'2020-01-01', resulting in an invalid date value and adding a partition with a null value.

      The correct approach is to use a string value, such as ADD PARTITION(dt = '2020-01-01').

    • Spark 3.3.x: In Spark 3.3.x, partition operations support using typed literals, supporting ADD PARTITION(dt = date'2020-01-01') and correctly parsing the partition value as a date type instead of a string.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the handling of typed literals in ADD PARTITION has changed.

  • Sample code:

    Prepare data:

    create table test_part_type (id int,name string,pt date) PARTITIONED by (pt);
    insert into test_part_type partition (pt = '2021-01-01') select 1,'name1';
    insert into test_part_type partition (pt = date'2021-01-01') select 1,'name1';

    Execute SQL:

    select id,name,pt from test_part_type;
    (Set the parameter spark.sql.forcePartitionPredicatesOnPartitionedTable.enabled to false.)
    • Spark 2.4.5
      1 name1 2021-01-01
      1 name1
    • Spark 3.3.1
      1 name1 2021-01-01
      1 name1 2021-01-01

Mapping Type Change of DayTimeIntervalType to Duration

  • Explanation:

    In the ArrowWriter and ArrowColumnVector developer APIs, starting from Spark 3.3.x, the DayTimeIntervalType in Spark SQL is mapped to Apache Arrow's Duration type.

    • Spark 2.4.x: DayTimeIntervalType is mapped to Apache Arrow's Interval type.
    • Spark 3.3.x: DayTimeIntervalType is mapped to Apache Arrow's Duration type.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the mapping type of DayTimeIntervalType has changed.

Type Change in the Return Result of Date Difference

  • Explanation:

    The date subtraction expression (e.g., date1 – date2) returns a DayTimeIntervalType value.

    • Spark 2.4.x: Returns CalendarIntervalType.
    • Spark 3.3.x: Returns DayTimeIntervalType.

      To restore the previous behavior, set spark.sql.legacy.interval.enabled to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the default type of the date difference return result has changed.

Mapping Type Change of Unit-to-Unit Interval

  • Explanation:
    • Spark 2.4.x: In Spark 2.4.x, unit-to-unit intervals (e.g., INTERVAL '1-1' YEAR TO MONTH) and unit list intervals (e.g., INTERVAL '3' DAYS '1' HOUR) are converted to CalendarIntervalType.
    • Spark 3.3.x: In Spark 3.3.x, unit-to-unit intervals and unit list intervals are converted to ANSI interval types: YearMonthIntervalType or DayTimeIntervalType.

      To restore the mapping type to that before Spark 2.4.x in Spark 3.3.x, set the configuration item spark.sql.legacy.interval.enabled to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the mapped data type has changed.

Return Value Type Change in timestamps Subtraction Expression

  • Explanation:
    • Spark 2.4.x: In Spark 2.4.x, the timestamps subtraction expression (for example, select timestamp'2021-03-31 23:48:00' – timestamp'2021-01-01 00:00:00') returns a CalendarIntervalType value.
    • Spark 3.3.x: In Spark 3.3.x, the timestamps subtraction expression (for example, select timestamp'2021-03-31 23:48:00' – timestamp'2021-01-01 00:00:00') returns a DayTimeIntervalType value.

      To restore the mapping type to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.interval.enabled to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the mapped data type has changed.

Mixed Use of Year-Month and Day-Time Fields No Longer Supported

  • Explanation:
    • Spark 2.4.x: Unit list interval literals can mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND).
    • Spark 3.3.x: Unit list interval literals cannot mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND). Invalid input is indicated.

      To restore the usage to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.interval.enabled to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

Reserved Properties Cannot Be Used in CREATE TABLE .. LIKE .. Command

  • Explanation:

    Reserved properties cannot be used in the CREATE TABLE .. LIKE .. command.

    • Spark 2.4.x: In Spark 2.4.x, the CREATE TABLE .. LIKE .. command can use reserved properties.

      For example, TBLPROPERTIES('location'='/tmp') does not change the table location but creates an invalid property.

    • Spark 3.3.x: In Spark 3.3.x, the CREATE TABLE .. LIKE .. command cannot use reserved properties.

      For example, using TBLPROPERTIES('location'='/tmp') or TBLPROPERTIES('owner'='yao') will fail.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

  • Sample code 1:

    Prepare data:

    CREATE TABLE test0(id int, name string);
    CREATE TABLE test_like_properties LIKE test0 LOCATION 'obs://bucket1/test/test_like_properties';

    Execute SQL:

    DESCRIBE FORMATTED test_like_properties;
    • Spark 2.4.5

      The location is properly displayed.

    • Spark 3.3.1

      The location is properly displayed.

  • Sample code 2:

    Prepare data:

    CREATE TABLE test_like_properties0(id int) using parquet LOCATION 'obs://bucket1/dbgms/test_like_properties0';
    CREATE TABLE test_like_properties1 like test_like_properties0 tblproperties('location'='obs://bucket1/dbgms/test_like_properties1');

    Execute SQL:

    DESCRIBE FORMATTED test_like_properties1;
    • Spark 2.4.5
      DLI.0005:
      mismatched input 'tblproperties' expecting {<EOF>, 'LOCATION'}
    • Spark 3.3.1
      The feature is not supported: location is a reserved table property, please use the LOCATION clause to specify it.

Failure to Create a View with Auto-Generated Aliases

  • Explanation:
    • Spark 2.4.x: If the statement contains an auto-generated alias, it will execute normally without any prompt.
    • Spark 3.3.x: If the statement contains an auto-generated alias, creating/changing the view will fail.

      To restore the usage to that before Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.allowAutoGeneratedAliasForView to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

  • Sample code:

    Prepare data:

    create table test_view_alis(id1 int,id2 int);
    INSERT INTO test_view_alis VALUES(1,2);

    Execute SQL:

    create view view_alis as select id1 + id2 from test_view_alis;
    • Spark 2.4.5

      Successfully executed.

    • Spark 3.3.1
      Error
      Not allowed to create a permanent view `view_alis` without explicitly assigning an alias for expression (id1 + id2)

      If the following parameter is added in Spark 3.3.1, the SQL will execute successfully:

      spark.sql.legacy.allowAutoGeneratedAliasForView = true

Change in Return Value Type After Adding/Subtracting Time Field Intervals to/from Dates

  • Explanation:

    The return type changes when adding/subtracting a time interval (for example, 12 hours) to/from a date-time field (for example, date'2011-11-11').

    • Spark 2.4.x: In Spark 2.4.x, when performing date arithmetic operations on JSON attributes defined as FloatType or DoubleType, such as date'2011-11-11' plus or minus a time interval (such as 12 hours), the return type is DateType.
    • Spark 3.3.x: The return type changes to a timestamp (TimestampType) to maintain compatibility with Hive.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

  • Sample code:

    Execute SQL:

    select date '2011-11-11' - interval 12 hour
    • Spark 2.4.5
      2011-11-10
    • Spark 3.3.1
      1320897600000

Support for Char/Varchar Types in Spark SQL

  • Explanation:
    • Spark 2.4.x: Spark SQL table columns do not support Char/Varchar types; when specified as Char or Varchar, they are forcibly converted to the String type.
    • Spark 3.3.x: Spark SQL table columns support CHAR/CHARACTER and VARCHAR types.
  • Is there any impact on jobs after the engine version upgrade?

    There is no impact.

  • Sample code:

    Prepare data:

    create table test_char(id int,name varchar(24),name2 char(24));

    Execute SQL:

    show create table test_char;
    • Spark 2.4.5
      create table `test_char`(`id` INT,`name` STRING,`name2` STRING)
      ROW FORMAT...
    • Spark 3.3.1
      create table test_char(id INT,name VARCHAR(24),name2 VARCHAR(24))
      ROW FORMAT...

Different Query Syntax for Null Partitions

  • Explanation:
    • Spark 2.4.x:

      In Spark 3.0.1 or earlier, if the partition column is of type string, it is parsed as its text representation, such as the string "null".

      Querying null partitions using part_col='null'.

    • Spark 3.3.x:

      PARTITION(col=null) always parses to null in partition specification, even if the partition column is of type string.

      Querying null partitions using part_col is null.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; queries for null partitions need to be adapted.

  • Sample code:

    Prepare data:

    CREATE TABLE test_part_null (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1);
    INSERT INTO TABLE test_part_null PARTITION (p1 = null) SELECT 0;

    Execute SQL:

    select * from test_part_null;
    • Spark 2.4.5
      0 null

      Executing select * from test_part_null where p1='null' can find partition data in Spark 2.4.5.

    • Spark 3.3.1
      0

      Executing select * from test_part_null where p1 is null can find data in Spark 3.3.1.

Different Handling of Partitioned Table Data

  • Explanation:

    In datasource v1 partition external tables, non-UUID partition path data already exists.

    Performing the insert overwrite partition operation in Spark 3.3.x will clear previous non-UUID partition data, whereas Spark 2.4.x will not.
    • Spark 2.4.x:

      Retains data under non-UUID partition paths.

    • Spark 3.3.x:

      Deletes data under non-UUID partition paths.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; it will clean up dirty data.

  • Sample code:

    Prepare data:

    Create a directory named pt=pt1 under obs://bucket1/test/overwrite_datasource and import a Parquet data file into it.

    create table overwrite_datasource(id int,name string,pt string) using parquet PARTITIONED by(pt) LOCATION 'obs://bucket1/test/overwrite_datasource';
    SELECT * FROM overwrite_datasource1 where pt='pt1' Both versions do not query data.

    Execute SQL:

    insert OVERWRITE table overwrite_datasource partition(pt='pt1') values(2,'aa2');
    • Spark 2.4.5

      Retains the pt=pt1 directory.

    • Spark 3.3.1

      Deletes the pt=pt1 directory.

Retaining Quotes for Special Characters When Exporting CSV Files

  • Explanation:
    • Spark 2.4.x:

      When exporting CSV files in Spark 2.4.x, if field values contain special characters such as newline (\n) and carriage return (\r), and these special characters are surrounded by quotes (e.g., double quotes "), Spark automatically handles these quotes, omitting them in the exported CSV file.

      For example, the field value "a\rb" will not include quotes when exported.

    • Spark 3.3.x:

      In Spark 3.3.x, the handling of exporting CSV files is optimized; if field values contain special characters and are surrounded by quotes, Spark retains these quotes in the final CSV file.

      For example, the field value "a\rb" retains quotes when exported.

  • Is there any impact on jobs after the engine version upgrade?

    No impact on query results, but it affects the export file format.

  • Sample code:

    Prepare data:

    create table test_null2(str1 string,str2 string,str3 string,str4 string);
    insert into test_null2 select "a\rb", null, "1\n2", "ab";

    Execute SQL:

    SELECT * FROM test_null2;
    • Spark 2.4.5
      a b  1 2 ab
    • Spark 3.3.1
      a b  1 2 ab

    Export query results to OBS and check the CSV file content:

    • Spark 2.4.5
      a
      b,"","1
      2",ab
    • Spark 3.3.1
      "a
      b",,"1
      2",ab

New Support for Adaptive Skip Partial Agg Configuration

  • Explanation:

    Spark 3.3.x introduces support for adaptive Skip partial aggregation. When partial aggregation is ineffective, it can be skipped to avoid additional performance overhead. Related parameters:

    • spark.sql.aggregate.adaptivePartialAggregationEnabled: controls whether to enable adaptive Skip partial aggregation. When set to true, Spark dynamically decides whether to skip partial aggregation based on runtime statistics.
    • spark.sql.aggregate.adaptivePartialAggregationInterval: configures the analysis interval, that is, after processing how many rows, Spark will analyze whether to skip partial aggregation.
    • spark.sql.aggregate.adaptivePartialAggregationRatio: Threshold to determine whether to skip, based on the ratio of Processed groups/Processed rows. If the ratio exceeds the configured threshold, Spark considers pre-aggregation ineffective and may choose to skip it to avoid further performance loss.

    During usage, the system first analyzes at intervals configured by spark.sql.aggregate.adaptivePartialAggregationInterval. When the processed rows reach the interval, it calculates Processed groups/Processed rows. If the ratio exceeds the threshold, it considers pre-aggregation ineffective and can directly skip it.

  • Is there any impact on jobs after the engine version upgrade?

    Enhances DLI functionality.

New Support for Parallel Multi-Insert

  • Explanation:

    Spark 3.3.x adds support for Parallel Multi-Insert. In scenarios with multi-insert SQL, where multiple tables are inserted in the same SQL, this type of SQL is serialized in open-source Spark, limiting performance. Spark 3.3.x introduces multi-insert parallelization optimization in DLI, allowing all inserts to be executed concurrently, improving performance.

    Enable the following features by setting them to true (default false):

    spark.sql.lazyExecutionForDDL.enabled=true

    spark.sql.parallelMultiInsert.enabled=true

  • Is there any impact on jobs after the engine version upgrade?

    Enhances DLI functionality, improving the reliability of jobs with multi-insert parallelization features.

New Support for Enhance Reuse Exchange

  • Explanation:

    Spark 3.3.x introduces support for Enhance Reuse Exchange. When the SQL plan includes reusable sort merge join conditions, setting spark.sql.execution.enhanceReuseExchange.enabled to true allows reuse of SMJ plan nodes.

    Enable the following features by setting them to true (default false):

    spark.sql.execution.enhanceReuseExchange.enabled=true

  • Is there any impact on jobs after the engine version upgrade?

    Enhances DLI functionality.

Difference in Reading TIMESTAMP Fields

  • Explanation:
    Differences in reading TIMESTAMP fields for the Asia/Shanghai time zone. For values before 1900-01-01 08:05:43, values written by Spark 2.4.5 and read by Spark 3.3.1 differ from values read by Spark 2.4.5.
    • Spark 2.4.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5 and read by Spark 2.4.5 returns -2209017600000.

    • Spark 3.3.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5, read by Spark 3.3.1 with spark.sql.parquet.int96RebaseModeInRead=LEGACY returns -2209017943000.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the usage of TIMESTAMP fields needs to be evaluated.

  • Sample code:

    Configure in the SQL interface:

    spark.sql.session.timeZone=Asia/Shanghai
    • Spark 2.4.5
      create table parquet_timestamp_test (id int, col0 string, col1 timestamp) using parquet;
      insert into parquet_timestamp_test values (1, "245", "1900-01-01 00:00:00");

      Execute SQL to read data:

      select * from parquet_timestamp_test;

      Query results:

      id   col0    col1
      1    245     -2209017600000
    • Spark 3.3.1
      spark.sql.parquet.int96RebaseModeInRead=LEGACY

      Execute SQL to read data:

      select * from parquet_timestamp_test;

      Query results:

      id   col0    col1
      1    245     -2209017943000
  • Configuration description:

    These configuration items specify how Spark handles DATE and TIMESTAMP type fields for specific times (times that are contentious between the proleptic Gregorian and Julian calendars). For example, in the Asia/Shanghai time zone, it refers to how values before 1900-01-01 08:05:43 are processed.

    • Configuration items in a datasource Parquet table
      Table 1 Configuration items in a Spark 3.3.1 datasource Parquet table

      Configuration Item

      Default Value

      Description

      spark.sql.parquet.int96RebaseModeInRead

      LEGACY (default for Spark SQL jobs)

      Takes effect when reading INT96 type TIMESTAMP fields in Parquet files.

      • EXCEPTION: Throws an error for specific times, causing the read operation to fail.
      • CORRECTED: Reads the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.

      This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.

      spark.sql.parquet.int96RebaseModeInWrite

      LEGACY (default for Spark SQL jobs)

      Takes effect when writing INT96 type TIMESTAMP fields in Parquet files.

      • EXCEPTION: Throws an error for specific times, causing the write operation to fail.
      • CORRECTED: Writes the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.

      spark.sql.parquet.datetimeRebaseModeInRead

      LEGACY (default for Spark SQL jobs)

      Takes effect when reading DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the read operation to fail.
      • CORRECTED: Reads the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.

      This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.

      spark.sql.parquet.datetimeRebaseModeInWrite

      LEGACY (default for Spark SQL jobs)

      Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the write operation to fail.
      • CORRECTED: Writes the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.
    • Configuration items in a datasource Avro table
      Table 2 Configuration items in a Spark 3.3.1 datasource Avro table

      Configuration Item

      Default Value

      Description

      spark.sql.avro.datetimeRebaseModeInRead

      LEGACY (default for Spark SQL jobs)

      Takes effect when reading DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the read operation to fail.
      • CORRECTED: Reads the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.

      This setting only takes effect when the write information for the Avro file (e.g., Spark, Hive) is unknown.

      spark.sql.avro.datetimeRebaseModeInWrite

      LEGACY (default for Spark SQL jobs)

      Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the write operation to fail.
      • CORRECTED: Writes the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Avro files.

Difference in from_unixtime Function

  • Explanation:
    • Spark 2.4.x:

      For the Asia/Shanghai time zone, -2209017600 returns 1900-01-01 00:00:00.

    • Spark 3.3.x:

      For the Asia/Shanghai time zone, -2209017943 returns 1900-01-01 00:00:00.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; usage of this function needs to be checked.

  • Sample code:

    Configure in the SQL interface:

    spark.sql.session.timeZone=Asia/Shanghai
    • Spark 2.4.5

      Execute SQL to read data:

      select from_unixtime(-2209017600);
      Query results:
       1900-01-01 00:00:00
    • Spark 3.3.1

      Execute SQL to read data:

      select from_unixtime(-2209017600);

      Query results:

      1900-01-01 00:05:43

Difference in unix_timestamp Function

  • Explanation:
    For values less than 1900-01-01 08:05:43 in the Asia/Shanghai time zone.
    • Spark 2.4.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017600.

    • Spark 3.3.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017943.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; usage of this function needs to be checked.

  • Sample code:

    Configure in the SQL interface:

    spark.sql.session.timeZone=Asia/Shanghai
    • Spark 2.4.5

      Execute SQL to read data:

      select unix_timestamp('1900-01-01 00:00:00');
      Query results:
       -2209017600
    • Spark 3.3.1

      Execute SQL to read data:

      select unix_timestamp('1900-01-01 00:00:00');

      Query results

      -2209017943

เราใช้คุกกี้เพื่อปรับปรุงไซต์และประสบการณ์การใช้ของคุณ การเรียกดูเว็บไซต์ของเราต่อแสดงว่าคุณยอมรับนโยบายคุกกี้ของเรา เรียนรู้เพิ่มเติม

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback