Why Does 16 Terabytes of Text Data Fails to Be Converted into 4 Terabytes of Parquet Data?
Question
When the default configuration is used, 16 terabytes of text data fails to be converted into 4 terabytes of parquet data, and the error information below is displayed. Why?
Job aborted due to stage failure: Task 2866 in stage 11.0 failed 4 times, most recent failure: Lost task 2866.6 in stage 11.0 (TID 54863, linux-161, 2): java.io.IOException: Failed to connect to /10.16.1.11:23124 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167) at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:92)
Table 1 lists the default configuration.
Parameter |
Description |
Default Value |
---|---|---|
spark.sql.shuffle.partitions |
Number of shuffle data blocks during the shuffle operation. |
200 |
spark.shuffle.sasl.timeout |
Timeout interval of SASL authentication for the shuffle operation. Unit: second |
120s |
spark.shuffle.io.connectionTimeout |
Timeout interval for connecting to a remote node during the shuffle operation. Unit: second |
120s |
spark.network.timeout |
Timeout interval for all network connection operations. Unit: second |
360s |
Answer
The current data volume is 16 TB, but the number of partitions is only 200. As a result, each task is overloaded and the preceding problem occurs.
To solve the preceding problem, you need to adjust the parameters.
- Increase the number of partitions to divide the task into smaller ones.
- Increase the timeout interval during task execution.
Configure the following parameters in the spark-defaults.conf file on the client:
Parameter |
Description |
Recommended Value |
---|---|---|
spark.sql.shuffle.partitions |
Number of shuffle data blocks during the shuffle operation. |
4501 |
spark.shuffle.sasl.timeout |
Timeout interval of SASL authentication for the shuffle operation. Unit: second |
2000s |
spark.shuffle.io.connectionTimeout |
Timeout interval for connecting to a remote node during the shuffle operation. Unit: second |
3000s |
spark.network.timeout |
Timeout interval for all network connection operations. Unit: second |
360s |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot