Clearing Residual Files When a Spark Job Fails to Be Configured

This section is available for MRS 3.3.1-LTS or later version only.

Scenario

When Spark jobs fail, residual files may linger, potentially triggering disk space alarms due to their accumulation over time. It is advisable to routinely clean up these residual files.

Constraints

To use this feature, you need to start the Spark JDBCServer service. The resident process of the JDBCServer service is used to periodically delete residual files.
This feature also requires the configuration and modification of Spark client parameters and Spark JDBCServer server parameters.
The following directories can be cleared:
- /user/User/.sparkStaging/
- /tmp/sparkhive-scratch/User
This feature supports only the scenario where Yarn is used as the resource scheduler.

Parameter Configuration

Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter	Mandatory	Default Value
spark.yarn.session.to.application.clean.enabled	If this parameter is set to true, Spark periodically deletes residual files.	false

Log in to FusionInsight Manager, choose Cluster > Services > Spark, click Configurations, and click All Configurations. On the displayed page, click JDBCServer(Role) and then Custom. Add the following parameters in the custom area, and restart the JDBCServer service.

Parameter	Mandatory	Default Value
spark.yarn.session.to.application.clean.enabled	If this parameter is set to true, Spark periodically deletes residual files.	false
spark.clean.residual.tmp.dir.init.delay	Initial delay for clearing files, in minutes.	5
spark.clean.residual.tmp.dir.period.delay	Interval for deleting files, in minutes.	10