#

Cluster Maintenance, Monday April 4th 7am – 11am

Please note that we have our regularly scheduled Cluster Maintenance happening this Monday, April 4th, from 7 am to 11 am.

As during all maintenance periods, the login nodes and NX/NoMachine nodes will be rebooted, and retention on Regal scratch will be run. Due to a SLURM upgrade, jobs will be paused during maintenance.

Please Note: Next month's maintenance will occur on May 17th (not May 2nd) as our MGHPCC Holyoke data center will be implementing planned power maintenance which will require a shutdown of all our equipment in that data center. More information will follow as the date approaches.

Subject: FASRC planned maintenance window - April 4 2016

We are now pulling into the home stretch for the spring semester - hope your teaching and research efforts are being rewarded. Much of our work this month is proactive and service-enhancing.

Important note to those for you using the Odyssey "bigmem" partition: the memory requirement for jobs in this partition must be equal to or greater than 250GB for the job to be accepted and run.

We are performing the following work this month:

* The regal scratch filesystem hardware will receive firmware updates to fix minor bugs and improve reliability. This work will be performed in a "rolling" fashion. No outage is expected.

* We will be adding two more nodes into the login system pool, reducing individual system load and improving response.

* A minor upgrade to the slurm scheduling system will be performed. Included are bugs fixes and performance enhancements. Jobs will be paused during this work.

* A minor modification to networking devices to allow for an increased number of virtual LANS. This will allow us to improve the RC network's logical structure.

* Updating storage network configurations prior to the release of new scratch area.

* As part of a continuous upgrade strategy, we plan to replace several networking devices that have reached the end of their useful life. This activity will provide us new features and reduce the risk of outages due to networking hardware failure.

* continued efforts to improve the response time and reliability of our authentication services

* Upgrade firmware and restart a client storage server.

* Restart login and nx nodes

* Perform regal retention. All files not modified within the last 90 days will be deleted.

As always, should you have questions, please contact us at
rchelp@fas.harvard.edu