Automatic Job Checkpointing and Restart on Odyssey
This method of checkpointing does not work with most applications. We have had trouble in general with blcr, therefore it is not yet a supported feature of the Odyssey cluster. The following is an example of how it did work at one point, but this may not work now or in the future.
It is possible to automatically checkpoint and restart serial jobs on Odyssey. To do so, please use the following modifications to your script:
#BSUB -q long_serial
#BSUB -J mrbays_test
#BSUB -n 1
#BSUB -u firstname.lastname@example.org
#BSUB -o mrbays_lsf.out
#BSUB -e mrbays_lsf.err
#BSUB -k "mrbays.ckpt 60 method=blcr"
mb test.bay 1> mb.out 2> mb.err
This should create a checkpoint directory called mrbays.ckpt in the directory you are running in. If a node crashes, or something, you can just say
And the job just starts!
If your job gets suspended you may use bkill to kill your job and brestart to restart your job.