Why Does Spark Fail to Export a Table with the Same Field Name?

Updated on 2024-12-13 GMT+08:00

View PDF

Question

The following code fails to be executed on spark-shell of Spark:

val acctId = List(("49562", "Amal", "Derry"), ("00000", "Fred", "Xanadu"))
val rddLeft = sc.makeRDD(acctId)
val dfLeft = rddLeft.toDF("Id", "Name", "City")
//dfLeft.show
val acctCustId = List(("Amal", "49562", "CO"), ("Dave", "99999", "ZZ"))
val rddRight = sc.makeRDD(acctCustId)
val dfRight = rddRight.toDF("Name", "CustId", "State")
//dfRight.show
val dfJoin = dfLeft.join(dfRight, dfLeft("Id") === dfRight("CustId"), "outer")
dfJoin.show
dfJoin.repartition(1).write.format("com.databricks.spark.csv").option("delimiter", "\t").option("header", "true").option("treatEmptyValuesAsNulls", "true").option("nullValue", "").save("/tmp/outputDir")

Answer

When Spark exports tables with the same field name, the export fails.

In Spark, the duplicate field name of the join statement is checked. You need to modify the code to ensure that no duplicate field exists in the saved data.

Parent topic: Spark Troubleshooting

Previous topic: Apps Cannot Be Displayed on the JobHistory Page When an Empty Part File Is Loaded

Next topic: Why JRE fatal error after running Spark application multiple times?