Help Center/ MapReduce Service/ Troubleshooting/ Using Spark/ Disk Space Is Insufficient Due to Long-Term Running of JDBCServer
Updated on 2025-12-01 GMT+08:00

Disk Space Is Insufficient Due to Long-Term Running of JDBCServer

Issue

When the JDBCServer service connected to Spark submits a spark-sql task to the Yarn cluster, the data disk of the Core node is fully occupied after the task runs for a period of time.

Symptom

When the JDBCServer service of a customer connected to Spark submits a spark-sql task to the Yarn cluster, the data disk of the Core node is fully occupied after the task runs for a period of time.

After checking the disk usage in the background, it is found that there are too many APP temporary files (files generated by shuffle) of the JDBCServer service, and the files are not cleared, occupying a large amount of memory.

Cause Analysis

After checking the directories that contain a large number of files on the Core node, it is found that most of the directories are similar to blockmgr-033707b6-fbbb-45b4-8e3a-128c9bcfa4bf, which stores temporary shuffle files generated during computing.

The dynamic resource allocation function of Spark is enabled on JDBCServer, and shuffle is hosted by NodeManager. NodeManager only manages these files based on the running period of the application, and does not check whether the container where a single executor is located exists. Therefore, the temporary files are deleted only when the app is stopped. When a task runs for a long time, a large number of temporary files occupy a large amount of disk space.

Solution

Versions earlier than MRS 3.2.1-LTS:

Start a scheduled task to delete shuffle files that have been stored for a specified period of time. For example, delete shuffle files that have been stored for more than 6 hours each hour.

  1. Create the clean_appcache.sh script. If there are multiple data disks, change the value of data1 in BASE_LOC based on the actual situation.

    • Security cluster
      #!/bin/bash
      BASE_LOC=/srv/BigData/data1/nm/localdir/usercache/spark2x/appcache/application_*/blockmgr*
      find $BASE_LOC/ -mmin +360 -exec rmdir {} \; 
      find $BASE_LOC/ -mmin +360 -exec rm {} \;
    • Common cluster
      #!/bin/bash 
      BASE_LOC=/srv/BigData/data1/nm/localdir/usercache/omm/appcache/application_*/blockmgr*  
      find $BASE_LOC/ -mmin +360 -exec rmdir {} \; 
      find $BASE_LOC/ -mmin +360 -exec rm {} \;
    • Before executing the script, check whether the script to be executed is consistent with that in the document to prevent data in other paths from being deleted by mistake.
    • Before executing the script, check whether the path where the BASE_LOC variable is located exists in the cluster. If the path does not exist, add the path.

  2. Run the following commands to change the permission to the script:

    chmod 755 clean_appcache.sh

  3. Add a scheduled task to start the clearance script. Change the script path to the actual path.

    Run the crontab -l command to view the scheduled task.

    Run the crontab -e command to edit the scheduled task.

    0 * * * * sh /root/clean_appcache.sh > /dev/null 2>&1

MRS 3.2.1-LTS or later:

  1. Log in to Manager, choose Cluster > Services > Spark > Configurations > All Configurations, click JDBCServer, select Customization then custom, and add the custom configuration item spark.shuffle.service.removeShuffle=true.

  2. Click Save. Choose Cluster > Services > Spark > Instances, select all JDBCServer instances, and choose More > Instance Rolling Restart to restart all JDBCServer instances.