Odyssey Architecture

Odyssey is a large scale heterogeneous high performance computing cluster supporting the core of scientific modeling and simulation for thousands of Harvard researchers.  Assembled with the support of the Faculty of Arts and Sciences, it occupies more than 10,000 square feet with 190 racks spanning three data centers separated by 100 miles. The Odyssey cluster and associated storage consumes over a megawatt of power yearly.

Compute: Most of Research Computing’s computational power is housed at our partner facility MGHPCC in Holyoke, MA.  Processing core counts per node vary from older 8 core processors to newer 64 core units implemented in a variety of Intel and AMD based x86_64 architectures.  Available memory per node ranges from 12GB to 512GB with 4 GB/core on average.  Additionally, there are over 1,000,000 NVIDIA GPU cores which can greatly increase the speed of parallel processing jobs. In 2015, Odyssey completed 25.7 million compute jobs using 240 million CPU hours.

Storage:  RC maintains over 35 PB of storage spread out over various form factors with differing characteristics. Use case examples include: Robust home directories on enterprise storage, Lustre filesystem-based and performance driven scratch and research repositories, and middle tier laboratory storage using Gluster and NFS filesystems.  See our storage page for more details.

Interconnect: Odyssey has two underlying networks: A traditional TCP/IP network and a low-latency 56 Gb/s FDR InfiniBand network that enables high-throughput messaging for inter-node parallel-computing and fast access to Lustre mounted storage. The IP network topology connects the three data centers together and presents them as a single contiguous environment to RC users. The IP network includes over 8 miles of CAT-5/6 cabling connecting over 5000 ports at speeds of 1 to 10 Gb/s.  Most of the compute and large scale storage at MGHPCC are interconnected using our InfiniBand fabric.

Software:  Our core operating system is CentOS.  We maintain the configuration of Odyssey with 300k lines of Puppet code.  RC utilizes SLURM (Simple Linux Utility for Resource Management) for the scheduling of compute jobs cluster-wide. This instance typically handles 30-40k jobs concurrently; See our documentation on running jobs.  In addition to supporting Odyssey users, we manage over a thousand different scientific software tools and programs.  The RC user portal is a great place to find out more information on software modules, check on job status, and submit a help request.  A license manager service is maintained for software requiring license checkout at run-time. Other services to the community include the distribution of individual use software packages and a Citrix instance that allows for remote display and execution of various commercial software packages.

Hosted Machines: In addition to the HPC cluster, RC also manages more than 300 virtual machines for researchers. These are provisioned conforming to three tiers of service, depending on the needs of the requester.  Some use examples include lab websites, web portals, database access points, and project development boxes.  RC also provisions and manages workstations that are connected to the numerous instruments in labs for data collection and analysis.

Copyright © 2013. All Rights Reserved.
Information about how to reuse or republish this work may be available at Attribution.