Updated on 2024-12-13 GMT+08:00

Clearing Residual Files When a Spark Job Fails to Be Configured

This section is available for MRS 3.3.1-LTS or later version only.

Scenario

When Spark jobs fail, residual files may linger, potentially triggering disk space alarms due to their accumulation over time. It is advisable to routinely clean up these residual files.

Constraints

  • To use this feature, you need to start the Spark JDBCServer service. The resident process of the JDBCServer service is used to periodically delete residual files.
  • This feature also requires the configuration and modification of Spark client parameters and Spark JDBCServer server parameters.
  • The following directories can be cleared:
    • /user/User/.sparkStaging/
    • /tmp/sparkhive-scratch/User
  • This feature supports only the scenario where Yarn is used as the resource scheduler.

Parameter Configuration

  1. Modify the following parameters in the Client installation directory/Spark/spark/conf/spark-defaults.conf file on the Spark client.

Parameter

Mandatory

Default Value

spark.yarn.session.to.application.clean.enabled

If this parameter is set to true, Spark periodically deletes residual files.

false

  1. Log in to FusionInsight Manager, choose Cluster > Services > Spark, click Configurations, and click All Configurations. On the displayed page, click JDBCServer(Role) and then Custom. Add the following parameters in the custom area, and restart the JDBCServer service.

Parameter

Mandatory

Default Value

spark.yarn.session.to.application.clean.enabled

If this parameter is set to true, Spark periodically deletes residual files.

false

spark.clean.residual.tmp.dir.init.delay

Initial delay for clearing files, in minutes.

5

spark.clean.residual.tmp.dir.period.delay

Interval for deleting files, in minutes.

10