FASRC and Harvard work hard to decrease our environmental footprint and leverage technologies that improve our sustainability efforts. From our LEED Platinum primary data center to efforts to reduce overall power consumption and e-waste, we actively look for ways to improve. Below are some statements on the sustainability efforts of FASRC's groups:
FASRC’s Operations group is working towards Sustainability on several fronts:
- We have positioned our clusters in the Massachusetts Green High Performance Computing Center (MGHPCC).
- This facility is a LEED Platinum certified facility that derives 100% of its energy from carbon free generation sources. MGHPCC is located in Holyoke MA, where a local hydroelectric facility provides a majority of the power.
- Year round Power Usage Effectiveness (PUE) is 1.2, compared to a global between 1.5 and 1.8 (Abdilla et al., Relating Measured PUE to the Cooling Strategy and Operating Conditions Through a Review of a Number of Maltese Data Centres, 2024). PUE is the the ratio of power used by the data center and how much is delivered to servers within it.
- MGHPCC employs numerous methods to minimize energy consumption, including support for direct to chip, liquid cooling, ability to run without chillers for 70% of the year, and high voltage power distribution to minimize energy loss.
- For more information, see https://www.mghpcc.org/green-design/
- We use EPEAT Bronze certified Lenovo nodes that are liquid cooled (decreases power use by up to 23%) and lower their CPU frequencies when unused.
- When nodes no longer provide sufficient compute with respect to the power they draw (usually 5 - 6 years), they are stripped down and either recycled or sent for precious metals recovery. This is far better than shipping them off to a landfill.
We are investigating or supporting the investigation of:
- Slurm’s power saving features to put nodes to sleep when they are unused
- Shutting down nodes to meet Power Cap requirements,
- Changing configuration of nodes, such as BIOS items, to establish the most efficient settings.
As an extension of this effort, the Data Science and Research Facilitation (DSRF) group at FASRC is reaching out to various labs that have stood out on either ends of the distribution that signifies optimal use of cluster resources. This is known as the Fairshare distribution and labs with a Fairshare score of either 0 (overutilized) or 1 (underutilized) for shared resources are being actively contacted regarding their respective workflows. To better inform these labs of their usage patterns, we have created dashboards and tools to show them their current job efficiency chart that includes average CPU and memory usage of a lab. The goal of this effort is to provide guidance to lab members on how to efficiently and optimally utilize shared resources to ensure minimal wastage of allocated resources. The results from this study will be further used to better guide FASRC’s power and energy consumption along with storage and purchase requirements.
To that end, the group has also compiled a Job Efficiency and Optimization Best Practices FASRC DOCS document and emphasized on the guiding principles to educate our users on the best practices on the use of computational resources and guiding principles of optimizing one’s code/job to achieve efficiency. Additionally, we have also rolled out user job efficiency and fairshare stats, averaged over the previous day, on all our login nodes. This enables any user logging onto the cluster to get these stats as part of the MOTD and be aware of their usage at login itself.
DSRF, in collaboration with groups within and outside of FASRC, is also ramping up its user training efforts by reaching out to a wider user base. The goal of this exercise is to advertise our current training program and add new training modules to the program based on user survey feedback to ensure effective knowledge dissemination of utilizing various tools and software on the cluster in an effective manner. In addition, we provide guidance on code and compiler optimization to all our users, as in when required. In an effort to better serve our AI/ML users and ensure safe ways to install and run AI/ML tools on the cluster, we have developed FASRC Guidelines for use of OpenAI and developed sample scripts to install and run popular AI/ML tools on the cluster.
The RC Support Services and Communication group works closely with DSRF to facilitate FASRC's users' needs and provide services for accounts, allocations, documentation, and communication to the FASRC community so that they can perform their work more quickly and efficiently.
Additional details to follow.
The Systems Software Group develops, maintains, and supports the systems that FASRC relies on to function. This is includes change management, monitoring, alerting, and virtual infrastructure.
Ongoing virtualization consolidation and re-architecting helps reduce power inefficiencies surrounding single-use servers and under-utilized hardware.
Additional details to follow.
The Storage Services group architects and maintains the large storage systems necessary for the FASRC clusters and our lab groups.
One area where great effort has been put is the decommissioning of inefficient stand-alone storage systems purchased by labs. Leveraging the appliance model means that FASRC can stand up and provide storage which is both faster, expandable, and more power efficient.
Additional details to follow.
FASRC Research Data Management helps labs and users both deal with the complexities of data requirements, but also the management and efficient use of their storage.
A major push is for more labs to move data to 'cold' storage (tape) which reduces the need for ever-growing 'hot' storage.
Additional details to follow.
FASRC's Security group works to both keep the environment secure and to monitor our compliance so that researchers can know their data is being stored and used wisely and efficiently.
Additional details to follow.
FASRC's Project Management group helps the organization as a whole to reduce duplicated effort, plan for more effective deployments and maintenance events, and to plan for the future.
Additional details to follow.