Deze pagina is nog niet beschikbaar in uw eigen taal. We werken er hard aan om meer taalversies toe te voegen. Bedankt voor uw steun.

On this page

Show all

Help Center/ MapReduce Service/ Component Operation Guide (Normal)/ Using Spark2x (for MRS 3.x or Later)/ Common Issues About Spark2x/ Spark SQL and DataFrame/ Why Does 16 Terabytes of Text Data Fails to Be Converted into 4 Terabytes of Parquet Data?

Why Does 16 Terabytes of Text Data Fails to Be Converted into 4 Terabytes of Parquet Data?

Updated on 2022-09-15 GMT+08:00

Question

When the default configuration is used, 16 terabytes of text data fails to be converted into 4 terabytes of parquet data, and the error information below is displayed. Why?

Job aborted due to stage failure: Task 2866 in stage 11.0 failed 4 times, most recent failure: Lost task 2866.6 in stage 11.0 (TID 54863, linux-161, 2): java.io.IOException: Failed to connect to /10.16.1.11:23124
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:92)

Table 1 lists the default configuration.

Table 1 Parameter Description

Parameter

Description

Default Value

spark.sql.shuffle.partitions

Number of shuffle data blocks during the shuffle operation.

200

spark.shuffle.sasl.timeout

Timeout interval of SASL authentication for the shuffle operation. Unit: second

120s

spark.shuffle.io.connectionTimeout

Timeout interval for connecting to a remote node during the shuffle operation. Unit: second

120s

spark.network.timeout

Timeout interval for all network connection operations. Unit: second

360s

Answer

The current data volume is 16 TB, but the number of partitions is only 200. As a result, each task is overloaded and the preceding problem occurs.

To solve the preceding problem, you need to adjust the parameters.

  • Increase the number of partitions to divide the task into smaller ones.
  • Increase the timeout interval during task execution.

Configure the following parameters in the spark-defaults.conf file on the client:

Table 2 Parameter Description

Parameter

Description

Recommended Value

spark.sql.shuffle.partitions

Number of shuffle data blocks during the shuffle operation.

4501

spark.shuffle.sasl.timeout

Timeout interval of SASL authentication for the shuffle operation. Unit: second

2000s

spark.shuffle.io.connectionTimeout

Timeout interval for connecting to a remote node during the shuffle operation. Unit: second

3000s

spark.network.timeout

Timeout interval for all network connection operations. Unit: second

360s

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback