Disk Space Is Insufficient Due to Long-Term Running of JDBCServer
Issue
When the JDBCServer service connected to Spark submits a spark-sql task to the Yarn cluster, the data disk of the Core node is fully occupied after the task runs for a period of time.
Symptom
When the JDBCServer service of a customer connected to Spark submits a spark-sql task to the Yarn cluster, the data disk of the Core node is fully occupied after the task runs for a period of time.
After checking the disk usage in the background, it is found that there are too many APP temporary files (files generated by shuffle) of the JDBCServer service, and the files are not cleared, occupying a large amount of memory.
Cause Analysis
After checking the directories that contain a large number of files on the Core node, it is found that most of the directories are similar to blockmgr-033707b6-fbbb-45b4-8e3a-128c9bcfa4bf, which stores temporary shuffle files generated during computing.
The dynamic resource allocation function of Spark is enabled on JDBCServer, and shuffle is hosted by NodeManager. NodeManager only manages these files based on the running period of the application, and does not check whether the container where a single executor is located exists. Therefore, the temporary files are deleted only when the app is stopped. When a task runs for a long time, a large number of temporary files occupy a large amount of disk space.
Solution
Versions earlier than MRS 3.2.1-LTS:
Start a scheduled task to delete shuffle files that have been stored for a specified period of time. For example, delete shuffle files that have been stored for more than 6 hours each hour.
- Create the clean_appcache.sh script. If there are multiple data disks, change the value of data1 in BASE_LOC based on the actual situation.
- Security cluster
#!/bin/bash BASE_LOC=/srv/BigData/data1/nm/localdir/usercache/spark2x/appcache/application_*/blockmgr* find $BASE_LOC/ -mmin +360 -exec rmdir {} \; find $BASE_LOC/ -mmin +360 -exec rm {} \; - Common cluster
#!/bin/bash BASE_LOC=/srv/BigData/data1/nm/localdir/usercache/omm/appcache/application_*/blockmgr* find $BASE_LOC/ -mmin +360 -exec rmdir {} \; find $BASE_LOC/ -mmin +360 -exec rm {} \;
- Before executing the script, check whether the script to be executed is consistent with that in the document to prevent data in other paths from being deleted by mistake.
- Before executing the script, check whether the path where the BASE_LOC variable is located exists in the cluster. If the path does not exist, add the path.
- Security cluster
- Run the following commands to change the permission to the script:
chmod 755 clean_appcache.sh
- Add a scheduled task to start the clearance script. Change the script path to the actual path.
Run the crontab -l command to view the scheduled task.
Run the crontab -e command to edit the scheduled task.
0 * * * * sh /root/clean_appcache.sh > /dev/null 2>&1
MRS 3.2.1-LTS or later:
- Log in to Manager, choose Cluster > Services > Spark > Configurations > All Configurations, click JDBCServer, select Customization then custom, and add the custom configuration item spark.shuffle.service.removeShuffle=true.

- Click Save. Choose Cluster > Services > Spark > Instances, select all JDBCServer instances, and choose More > Instance Rolling Restart to restart all JDBCServer instances.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.