HPC Resumable Computing Solution
Scenarios
Many HPC applications support resumable computing, such as LAMMPS and GROMACS. In addition, common HPC scheduling software can have resumable computing integrated, such as PBS, Slurm, and LSF.
This section uses LAMMPS as an example to describe how to perform HPC resumable computing.
Step 1: Install FFTW
Run the following commands to install FFTW:
yum install gcc-gfortran gcc-c++
wget http://www.fftw.org/fftw-3.3.8.tar.gz
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/mpi/gcc/openmpi-2.1.2a1/lib64/
export PATH=/usr/mpi/gcc/openmpi-2.1.2a1/bin:$PATH
tar -zxvf fftw-3.3.8.tar.gz
cd fftw-3.3.8/
./configure --prefix=/opt/fftw CC=gcc MPICC=mpicc --enable-mpi --enable-openmp --enable-threads --enable-avx --enable-shared
make && make install
Step 2: Install LAMMPS
- Run the following commands to install LAMMPS:
yum install libpng12-*
wget https://lammps.sandia.gov/tars/lammps-2Aug18.tar.gz
tar -zxvf lammps-2Aug18.tar.gz
cd lammps-2Aug18/src
vi MAKE/Makefile.mpi
- Modify the data marked in red boxes in Figure 1 and Figure 2. Change the version based on site requirements.
- Run the following command to compile LAMMPS and copy the obtained lmp_mpi file to /share:
make mpi
Step 3: Configure LAMMPS
- Configure the example input file.
Melt is used as an example to generate example file melt.in. For example, a checkpoint file is automatically generated for every 100 iterative operations, and the file is stored in /share. The file is as follows:
# 3d Lennard-Jones melt units lj atom_style atomic lattice fcc 0.8442 region box block 0 20 0 20 0 20 create_box 1 box create_atoms 1 box mass 1 1.0 velocity all create 1.44 87287 loop geom pair_style lj/cut 2.5 pair_coeff 1 1 1.0 1.0 2.5 neighbor 0.3 bin neigh_modify delay 5 every 1 fix 1 all nve dump 1 all xyz 100 /share/sample.xyz run 10000 every 100 "write_restart /share/lennard.restart"
- Obtain the melt.restart.in input file for resumable checkpoint computing.
# 3d Lennard-Jones melt read_restart /share/lennard.restart run 10000 every 100 "write_restart /share/lennard.restart"
- Obtain the PBS job script job.pbs.
#!/bin/sh #PBS -l ncpus=2 #PBS -o lammps_pbs.log #PBS -j oe export PATH=/usr/mpi/gcc/openmpi-2.1.2a1/bin:$PATH export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/mpi/gcc/openmpi-2.1.2a1/lib64/module if [ ! -e "/share/lennard.restart" ]; then echo "run at the beginning" mpiexec --allow-run-as-root -np 2 /share/lmp_mpi -in /share/melt.in else echo "run from the last checkpoint" mpiexec --allow-run-as-root -np 2 /share/lmp_mpi -in /share/melt.restart.in fi
Step 4: Submit a Job and Ensure that the Job Is Not Interrupted During Running
Submit and run the job without interrupting it, and check the job running duration.
- Run the following command to submit a job:
- After the job is complete, run the following command to view the job information:
As shown in Figure 3, the job runs for 4 minutes and 10 seconds.
Step 5: Submit a Job, Emulate a Computing Interrupt, and Use Resumable Computing to Complete the Computing
After submitting a job, stop the compute node to emulate a computing interruption. Then, check the job running durations before and after the interruption.
- Run the following command to submit a job:
- After the job runs for about 1 minute and 30 seconds, stop the compute node on which the job runs to emulate example release.
- Run the following command to check the job information after the compute node is stopped:
Figure 4 Job running before interruption
In such a case, the job returns back to the queued state, waiting for available computing resources.
- Start the compute node to provide available computing resources.
- After the job is complete, run the following command to view the job information:
As shown in Figure 5, the job runs for 3 minutes and 3 seconds. It is shown that the job computing is resumed at the time when the computing is interrupted.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot