このページは、お客様の言語ではご利用いただけません。Huawei Cloudは、より多くの言語バージョンを追加するために懸命に取り組んでいます。ご協力ありがとうございました。

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ Data Lake Insight/ Product Bulletin/ Version Support Bulletin/ Differences Between Spark 2.4.x and Spark 3.3.x/ Differences in General-Purpose Queues Between Spark 2.4.x and Spark 3.3.x

Differences in General-Purpose Queues Between Spark 2.4.x and Spark 3.3.x

Updated on 2025-02-27 GMT+08:00

DLI has summarized the differences in general-purpose queues between Spark 2.4.x and Spark 3.3.x to help you understand the impact of upgrading the Spark version on jobs running in the general queues with the new engine.

Log4j Dependency Updated from 1.x to 2.x

  • Explanation:

    Log4j dependency is updated from 1.x to 2.x.

    • Spark 2.4.x: Log4j dependency version 1.x (no longer supported by the community).
    • Spark 3.3.x: Log4j dependency version 2.x.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

Spark 3.3.x Does Not Support v1 Tables

  • Explanation:

    Spark 2.4.x supports datasource v1 and v2 tables. Spark 3.3.x does not support datasource v1 tables.

    For details, see DLI Datasource V1 Table and Datasource V2 Table.

    • Spark 2.4.x supports datasource v1 and v2 tables.
    • Spark 3.3.x does not support datasource v1 tables.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact. You are advised to migrate to v2 tables in Spark 2.4.5 before upgrading to Spark 3.3.1. For details, refer to the examples in DLI Datasource V1 Table and Datasource V2 Table.

Empty Input Splits Do Not Create Partitions by Default

  • Explanation:
    • Spark 2.4.x: Empty input splits create partitions by default.
    • Spark 3.3.x: Empty input splits do not create partitions by default.

      When using Spark 3.3.x, spark.hadoopRDD.ignoreEmptySplits is set to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; it needs to be determined if partition names are used for service judgments.

Event Log Compression Format Set to zstd

  • Explanation:

    In Spark 3.3.x, the default value for spark.eventLog.compression.codec is set to zstd. Spark will no longer use the value of spark.io.compression.codec for compressing event logs.

    • Spark 2.4.x: Uses the value of spark.io.compression.codec for event log compression format.
    • Spark 3.3.x: spark.eventLog.compression.codec is set to zstd by default.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the event log compression format changes.

Change in spark.launcher.childConectionTimeout Configuration

  • Explanation:
    • Spark 2.4.x: The configuration name is spark.launcher.childConectionTimeout.
    • Spark 3.3.x: The configuration name is changed to spark.launcher.childConnectionTimeout.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; configuration parameter names change.

Spark 3.3.x No Longer Supports Apache Mesos as a Resource Manager

  • Explanation:
    • Spark 2.4.x: Uses Apache Mesos as a resource manager.
    • Spark 3.3.x: No longer supports using Apache Mesos as a resource manager.
  • Is there any impact on jobs after the engine version upgrade?

    Functional enhancement. If you were using Mesos as a resource manager in Spark 2.4.x, you need to consider switching to another resource manager after upgrading to Spark 3.3.x.

Spark 3.3.x Deletes Kubernetes Driver When the Application Terminates Itself

  • Explanation: Spark 3.3.x deletes the Kubernetes driver when the application terminates itself.
  • Is there any impact on jobs after the engine version upgrade?

    Functional enhancement. After upgrading to Spark 3.3.x, this affects jobs that rely on Kubernetes as a resource manager. Spark 3.3.x automatically deletes the driver pod when the application terminates, which may affect resource management and cleanup processes.

Spark 3.3.x Supports Custom Kubernetes Schedulers

  • Explanation:
    • Spark 2.4.x: Does not support using a specified Kubernetes scheduler to manage resource allocation and scheduling for Spark jobs.
    • Spark 3.3.x: Supports custom Kubernetes schedulers.
  • Is there any impact on jobs after the engine version upgrade?

    Functional enhancement; supports custom schedulers for resource allocation and scheduling management.

Spark Converts Non-Nullable Schemas to Nullable

  • Explanation:

    In Spark 2.4.x, when the user-specified schema contains non-nullable fields, Spark converts these non-nullable schemas to nullable.

    In Spark 3.3.x, Spark respects the nullability specified in the user schema, that is, if a field is defined as non-nullable, Spark retains this requirement and does not automatically convert it to a nullable field.

    • Spark 2.4.x: In Spark 2.4.x, when the user-specified schema contains non-nullable fields, Spark converts these non-nullable schemas to nullable.
    • Spark 3.3.x: Does not automatically convert non-nullable fields to nullable.

      To revert to the behavior of Spark 2.4.x in Spark 3.3.x, set spark.sql.legacy.respectNullabilityInTextDatasetConversion to true.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

  • Sample code:

    Execute SQL:

    spark.read.schema(StructType(
    StructField("f1", LongType, nullable = false) ::
    StructField("f2", LongType, nullable = false) :: Nil)
    ).option("mode", "DROPMALFORMED").json(Seq("""{"f1": 1}""").toDS).show(false);
    • Spark 2.4.5
      |f1 |f2 |
      +---+---+
      |1  |0  |
    • Spark 3.3.1
      |f1 |f2  |
      +---+----+
      |1  |null|

Change in Spark Scala Version

  • Explanation:

    The Spark Scala version changes.

    • Spark 2.4.x: Uses Scala 2.11.
    • Spark 3.3.x: Upgrades to Scala 2.12.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; jars need to be recompiled with the updated Scala version.

Change in Supported Python Versions for PySpark

  • Explanation:

    The supported Python versions for PySpark change.

    • Spark 2.4.x: PySpark supports Python 2.6+ to 3.7+.
    • Spark 3.3.x: PySpark supports Python 3.6 or later.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact due to dependency version changes. You need to check whether this issue is involved.

Change in Supported Pandas Versions for PySpark

  • Explanation:
    • Spark 2.4.x: PySpark does not specify a Pandas version.
    • Spark 3.3.x: From Spark 3.3.x, PySpark requires Pandas 0.23.2 or later to use Pandas-related functions like toPandas and createDataFrame from Pandas DataFrame.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact due to dependency version changes. You need to check whether this issue is involved.

Change in Supported PyArrow Versions for PySpark

  • Explanation:
    • Spark 2.4.x: PySpark does not specify a PyArrow version.
    • Spark 3.3.x: From Spark 3.3.x, PySpark requires PyArrow 0.12.1 or later to use PyArrow-related functions like Pandas_udf and toPandas.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact due to dependency version changes. You need to check whether this issue is involved.

DataFrameWriter Triggered Queries Named as Command

In Spark 3.2.x, when DataFrameWriter triggered queries are sent to QueryExecutionListener, these queries are always named as command. In Spark 3.1 or earlier, these queries may be named as save, insertInto, or saveAsTable, depending on the specific operation.

  • Explanation:

    When a query execution is triggered by DataFrameWriter, it is always named as command when sent to QueryExecutionListener.

    • Spark 2.4.x: Named as save, insertInto, or saveAsTable.
    • Spark 3.3.x: Named as command.
  • Is there any impact on jobs after the engine version upgrade?

    There is an impact.

Differences in Reading DATE and TIMESTAMP Fields

  • Explanation:
    Differences in reading DATE and TIMESTAMP fields for the Asia/Shanghai time zone. For values before 1900-01-01 08:05:43, values written by Spark 2.4.5 and read by Spark 3.3.1 differ from values read by Spark 2.4.5.
    • Spark 2.4.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5 and read by Spark 2.4.5 returns 1900-01-01 00:00:00.

    • Spark 3.3.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00, written in Spark 2.4.5, read by Spark 3.3.1 with spark.sql.parquet.int96RebaseModeInRead=LEGACY returns 1900-01-01 00:00:00. However, with spark.sql.parquet.int96RebaseModeInRead=CORRECTED, the resulting value is 1900-01-01 00:05:43.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; the usage of DATE and TIMESTAMP fields needs to be evaluated.

  • Sample code:

    Configure the following information in the job:

    spark.sql.session.timeZone=Asia/Shanghai
    • Spark 2.4.5
      spark.sql("create table parquet_timestamp_test (id int, col0 string, col1 timestamp) using parquet");
      spark.sql("insert into parquet_timestamp_test values (1, '245', '1900-01-01 00:00:00')");

      Execute SQL to read data:

      spark.sql("select * from parquet_timestamp_test").show();

      Query results:

      +---+----+-------------------+
      | id|col0|               col1|
      +---+----+-------------------+
      |  1| 245|1900-01-01 00:00:00|
      +---+----+-------------------+
    • Spark 3.3.1
      spark.sql.parquet.int96RebaseModeInRead=LEGACY

      Execute a job to read data:

      spark.sql("select * from parquet_timestamp_test").show();

      Query results

      +---+----+-------------------+
      | id|col0|               col1|
      +---+----+-------------------+
      |  1| 245|1900-01-01 00:00:00|
      +---+----+-------------------+

      Modify the configuration:

      spark.sql.parquet.int96RebaseModeInRead=CORRECTED

      Re-run the following SQL statement to read data:

      spark.sql("select * from parquet_timestamp_test").show();

      Query results:

      +---+----+-------------------+
      | id|col0|               col1|
      +---+----+-------------------+
      |  1| 245|1900-01-01 00:05:43|
      +---+----+-------------------+
  • Configuration description:

    These configuration items specify how Spark handles DATE and TIMESTAMP type fields for specific times (times that are contentious between the proleptic Gregorian and Julian calendars). For example, in the Asia/Shanghai time zone, it refers to how values before 1900-01-01 08:05:43 are processed.

    • Configuration items in a datasource Parquet table
      Table 1 Configuration items in a Spark 3.3.1 datasource Parquet table

      Configuration Item

      Default Value

      Description

      spark.sql.parquet.int96RebaseModeInRead

      EXCEPTION (default for Spark jobs)

      Takes effect when reading INT96 type TIMESTAMP fields in Parquet files.

      • EXCEPTION: Throws an error for specific times, causing the read operation to fail.
      • CORRECTED: Reads the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.

      This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.

      spark.sql.parquet.int96RebaseModeInWrite

      EXCEPTION (default for Spark jobs)

      Takes effect when writing INT96 type TIMESTAMP fields in Parquet files.

      • EXCEPTION: Throws an error for specific times, causing the write operation to fail.
      • CORRECTED: Writes the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.

      spark.sql.parquet.datetimeRebaseModeInRead

      EXCEPTION (default for Spark jobs)

      Takes effect when reading DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the read operation to fail.
      • CORRECTED: Reads the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.

      This setting only takes effect when the write information for the Parquet file (e.g., Spark, Hive) is unknown.

      spark.sql.parquet.datetimeRebaseModeInWrite

      EXCEPTION (default for Spark jobs)

      Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the write operation to fail.
      • CORRECTED: Writes the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Parquet files.
    • Configuration items in a datasource Avro table
      Table 2 Configuration items in a Spark 3.3.1 datasource Avro table

      Configuration Item

      Default Value

      Description

      spark.sql.avro.datetimeRebaseModeInRead

      EXCEPTION (default for Spark jobs)

      Takes effect when reading DATE, TIMESTAMP_MILLIS, TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the read operation to fail.
      • CORRECTED: Reads the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the traditional hybrid calendar (Julian + Gregorian) to the proleptic Gregorian calendar.

      This setting only takes effect when the write information for the Avro file (e.g., Spark, Hive) is unknown.

      spark.sql.avro.datetimeRebaseModeInWrite

      EXCEPTION (default for Spark jobs)

      Takes effect when writing DATE, TIMESTAMP_MILLIS, and TIMESTAMP_MICROS logical type fields.

      • EXCEPTION: Throws an error for specific times, causing the write operation to fail.
      • CORRECTED: Writes the date/timestamp as is without adjustment.
      • LEGACY: Adjusts the date/timestamp from the proleptic Gregorian calendar to the traditional hybrid calendar (Julian + Gregorian) when writing Avro files.

Difference in from_unixtime Function

  • Explanation:
    • Spark 2.4.x:

      For the Asia/Shanghai time zone, -2209017600 returns 1900-01-01 00:00:00.

    • Spark 3.3.x:

      For the Asia/Shanghai time zone, -2209017943 returns 1900-01-01 00:00:00.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; usage of this function needs to be checked.

  • Sample code:

    Configure the following information in the job:

    spark.sql.session.timeZone=Asia/Shanghai
    • Spark 2.4.5

      Run the following statement to read data:

      select from_unixtime(-2209017600);
      Query results:
      +-----------------------------------------------+
      |from_unixtime(-2209017600, yyyy-MM-dd HH:mm:ss)|
      +-----------------------------------------------+
      |                            1900-01-01 00:00:00|
      +-----------------------------------------------+
    • Spark 3.3.1

      Run the following statement to read data:

      select from_unixtime(-2209017600);

      Query results

      +-----------------------------------------------+
      |from_unixtime(-2209017600, yyyy-MM-dd HH:mm:ss)|
      +-----------------------------------------------+
      |                            1900-01-01 00:05:43|
      +-----------------------------------------------+

Difference in unix_timestamp Function

  • Explanation:
    For values less than 1900-01-01 08:05:43 in the Asia/Shanghai time zone.
    • Spark 2.4.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017600.

    • Spark 3.3.x:

      For the Asia/Shanghai time zone, 1900-01-01 00:00:00 returns -2209017943.

  • Is there any impact on jobs after the engine version upgrade?

    There is an impact; usage of this function needs to be checked.

  • Sample code:

    Configure the following information in the job:

    spark.sql.session.timeZone=Asia/Shanghai
    • Spark 2.4.5

      Run the following statement to read data:

      select unix_timestamp('1900-01-01 00:00:00');
      Query results:
      +--------------------------------------------------------+
      |unix_timestamp(1900-01-01 00:00:00, yyyy-MM-dd HH:mm:ss)|
      +--------------------------------------------------------+
      |                                             -2209017600|
      +--------------------------------------------------------+
    • Spark 3.3.1

      Run the following statement to read data:

      select unix_timestamp('1900-01-01 00:00:00');

      Query results

      +--------------------------------------------------------+
      |unix_timestamp(1900-01-01 00:00:00, yyyy-MM-dd HH:mm:ss)|
      +--------------------------------------------------------+
      |                                             -2209017943|
      +--------------------------------------------------------+

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback