Failed to Load Data to a Hive Table Across File Systems by Running SQL Statements Using Spark Shell
Issue
When the spark-shell command is used to execute SQL statements or the spark-submit command is used to submit Spark tasks, the load command of SQL statements exists, and the source data and target table are not stored in the same file system. An error is reported when the MapReduce task is started in the preceding two modes.
Cause Analysis
When the load command is used to import data to the Hive table across file systems (for example, the original data is stored in HDFS but the Hive table data is stored in OBS), and the file length is greater than the threshold (32 MB by default). In this case, the MapReduce task that uses DistCp is triggered to migrate data. The MapReduce task configuration is directly extracted from the Spark task configuration. However, the net.topology.node.switch.mapping.impl configuration item of the Spark task does not retain the default value of Hadoop. Therefore, the JAR package of Spark needs to be used. As a result, MapReduce reports an error indicating that the class cannot be found.
Procedure
Solution 1:
If the file size is small, set the default file size to a value greater than the maximum file size. For example, if the maximum file size is 95 MB, run the following command:
hive.exec.copyfile.maxsize=104857600
Solution 2:
If the file size is large, use DistCp to improve the data migration efficiency. Add the following parameters when starting the Spark task:
--conf spark.hadoop.net.topology.node.switch.mapping.impl=org.apache.hadoop.net.ScriptBasedMapping
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.