Harvard |  FAS |  GSAS |  Division of Science |  HUIT 

Automatic Job Checkpointing and Restart on Odyssey

 

WARNING

This method of checkpointing does not work with most applications. We have had trouble in general with blcr, therefore it is not yet a supported feature of the Odyssey cluster. The following is an example of how it did work at one point, but this may not work now or in the future.



It is possible to automatically checkpoint and restart serial jobs on Odyssey. To do so, please use the following modifications to your script:

#!/bin/sh
#BSUB -q long_serial
#BSUB -J mrbays_test
#BSUB -n 1
#BSUB -u hptc@fas.harvard.edu
#BSUB -o mrbays_lsf.out
#BSUB -e mrbays_lsf.err
#BSUB -k "mrbays.ckpt 60 method=blcr"

export LD_PRELOAD=libcr_run.so.0
mb test.bay 1> mb.out 2> mb.err
exit 0


This should create a checkpoint directory called mrbays.ckpt in the directory you are running in. If a node crashes, or something, you can just say

brestart mrbays.ckpt

And the job just starts!

If your job gets suspended you may use bkill to kill your job and brestart to restart your job.

Site last updated June 7, 2013