#

First Monday Cluster Maintenance

Starting in July 2014, Research Computing will be instituting a monthly scheduled maintenance window to make hardware repairs, software and firmware updates, and perform general manufacturer recommended maintenance on our environment. W​e have avoided this practice in the past​ to provide as much uptime to our customers as possible. However, our infrastructure has grown at such an increasing rate that we now find it necessary to schedule monthly maintenance to ensure​ efficiency ​and avoid potential issues.

We plan to use the first Monday of every month from 7AM to 11AM as our maintenance window. While we are reserving this time monthly, we may not have any activities planned for that month. We will be providing a notice of activities one week prior to the monthly on the RC website in the events section, and we will also notify users by email​. Should a Monday fall on a holiday, we will use the following Monday for any scheduled work.

Most work will not affect all of our services. We will announce which ones will be affected one week prior to the activity.

Thank you for your anticipated cooperation. If you have any questions/comments, please email us at rchelp@fas.harvard.edu.

Q and A

Q: Why are you starting this now? RC never had maintenance windows before and everything worked fine.

A: Due to growth in our compute environment and the increasing complexity of the systems we deploy, we felt it prudent to arrange for a regular time when we could comfortably and without pressure fix problems or update facilities with minimal impact to our customers. Most, if not all, major HPC centers have regular maintenance schedules.

Q: Why Mondays at 7AM-11AM? Why not do this late at night?

A: We have observed that the least busy time for our services is on Mondays in the morning hours. Using this time period should not interrupt most of our users. If the remote possibility of a problem that extends past the scheduled downtime occurs, we would have our full staff fresh and available to assist in repairs and quickly restore service.

Q: I have a job currently running that has been active for XX days. Since the compute nodes will be unavailable, what will happen to all that work? Help!

A: For long-running jobs, we strongly recommend checkpointing your results on a periodic basis. We will be providing checkpoint instructions on our website soon.

CC BY-NC 4.0 This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Permissions beyond the scope of this license may be available at Attribution.