Failed to Load Data to a Hive Table Using spark-shell Command

Issue

When the spark-shell command is used to execute SQL statements or the spark-submit command is used to submit Spark tasks, the load command of SQL statements exists, and the source data and target table are not stored in the same file system. An error is reported when the MapReduce task is started in the preceding two modes.

Cause Analysis

When the load command is used to import data to the Hive table across file systems (for example, the original data is stored in HDFS but the Hive table data is stored in OBS), and the file length is greater than the threshold (32 MB by default). In this case, the MapReduce task that uses DistCp is triggered to migrate data. The MapReduce task configuration is directly extracted from the Spark task configuration. However, the net.topology.node.switch.mapping.impl configuration item of the Spark task does not retain the default value of Hadoop. Therefore, the JAR package of Spark needs to be used. As a result, MapReduce reports an error indicating that the class cannot be found.

Procedure

Solution 1:

If the file size is small, set the default file size to a value greater than the maximum file size. For example, if the maximum file size is 95 MB, run the following command:

hive.exec.copyfile.maxsize=104857600

Solution 2:

If the file size is large, use DistCp to improve the data migration efficiency. Add the following parameters when starting the Spark task:

--conf spark.hadoop.net.topology.node.switch.mapping.impl=org.apache.hadoop.net.ScriptBasedMapping

Parent topic: Using Spark

Previous topic: Disk Space Is Insufficient Due to Long-Term Running of JDBCServer

Next topic: Spark Task Submission Failure

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.