Checkpointing to Save Job Progress
Saving Checkpoints for Long Simulations
What is Checkpointing?
Checkpointing is the process of saving necessary data from a running simulation, usually implemented either to restart a job or as a safe point in case of system failure.
Good Practices for Checkpointing
Plan Ahead
- Many software applications provide options for checkpointing/restarting your simulations. Before starting your simulations, ensure that you have the relevant checkpointing
FLAGS
,ON
. - It is important to note that every application is going to have different nomenclature and formatting. ensure that you have the appropriate flags for the application in use.
- Alternatively, the Rescale platform has a native checkpointing function Snapshot, that enables users to easily store intermediate files. Note that this method is not optimal for restarts.
Software based checkpointing and restart procedures
- Abaqus – Abaqus Tutorial to Restart Simulation
- Converge – How do I manually restart a Converge job?
- ANSYS Fluent – How do I insert a check/exit file to my running job?
- ANSYS CFX – How do I insert a stop/restart file to my running job?
- Star-CCM+ – How do I insert a checkpoint/stop file to my running job?
- LS-DYNA – LS-DYNA Restart Tutorial
Checkpoint Relevant Information only
- It is good practice to save only relevant information that is needed for restarting your simulations.
- Excessive writing of data could lead to
Out of Memory
related system failures or slow down the simulation process. - Generally, most applications allow for the writing of restart files, which can be used to restart the simulations. For example, Abaqus writes
.rst
files that can be used to restart the simulation from the last computed iteration/step.
Monitoring Simulation
- For long job simulations, it is recommended that you monitor your job at regular intervals, as doing so will enable you to catch any potential errors that can arise
- In addition to identifying errors, regular monitoring will allow you to check progress and stop simulations in cases where the applications does not automatically stop after an error.
Try to Avoid the Following
Checkpointing Too Often
- Excessive checkpointing takes up the available storage on the
cloud instance. This will interrupt the simulation and lead to insufficient memory based system errors. - Excessively writing output files will also lead to slowing down the the simulation process and increasing overall job time.
No Checkpointing
- Failure to preform regular checkpointing could result in the loss of progress and data in the event of a system failure.
- For example, if you have a simulation run for several days, we advice to checkpoint every few hours in simulation time.