Scheduled Cluster Maintenance: July 7th 2014

The first of our monthly planned maintenances will take place on Monday, July 7th, 7am-11am.

Maintenance work will be conducted on:

  • All the computing nodes (all SLURM jobs will terminate)
  • Specific storage systems: hernquistfs2, rcnfs09, fink1, ghernquist09, panlsf2, aagfs1, regal, holyscratch
  • Network connectivity between campus and MGHPCC in Holyoke MA.

We have also added a piece of downtime code to the SLURM system where if you don't specify -t in your jobs, and they are planning to run over the downtime window, they will bounce with a message like the one noted below. This allows us to carefully manage the cluster and not have to terminate many hundreds of thousands of jobs, and it also doesn't just spit back the standard error "Batch job submission failed: Requested time limit is invalid".

If you have any questions feel free to email rchelp@fas.harvard.edu, or pop by one of our regular office hours here at 38 Oxford Street 12pm-3pm each and every Wednesday.

Example Message for Jobs that would run over time:

[jwm@slurm-test:pts/0 ~> sbatch -p general -t 08-00:00 --wrap=hostname
sbatch: error:

Your job has not been submitted.

The Odyssey cluster has a scheduled maintenance downtime
starting at 2014-07-07 07:00:00 EDT.

Your job will not end before the downtime. Please specify
a shorter time limit for your job, such as:

-t 06-17:38

This will give your job the most possible time to run before
the downtime. If your job does not finish before the downtime
starts, it will be terminated then.

sbatch: error: Batch job submission failed: Requested time limit is
invalid (missing or exceeds some limit)

