Research Data Management (RDM) is a term used to describe the management of research data throughout the entirety of a research project. This webpage offers tools, resources, and guidance to help researchers manage their data effectively and prepare it for sharing, and reuse.
Data Management Planning
Create a Data Management and Storage Plan
- Data Management Plans are formalized documents outlining how research data will be collected, analyzed, stored and shared throughout a project.
- Generating a Data Management Plan at the start of a project or when entering a new lab can save time, funding, and effort in the long run, as it ensures research data is organized, findable, and shareable. Many funding agencies now require submission of a data management and/or sharing plan with grant applications.
- Harvard specific guidance provided in the DMPTool, a template for creating DMSPs offered through Harvard Library
Review data policies and procedures
- University policies:
- Research Data Ownership Policy
- Harvard Research Data Security Policy (HRDSP)
- Research Safety Application (Sensitive Research)
- Data Use Agreements (DUA): A binding contract governing access to nonpublic data, often required by external parties.
- Retention and Maintenance of Research Records and Data Frequently Asked Questions (“FAQs”): General institutional policy stipulates that ‘essential research records’ need to be retained, for a period of no fewer than seven (7) years after the end of a research project or activity.
- Harvard University General Records Schedule
- Funder requirements and policies:
- Additional policies:
Develop standardized data organizational procedures
- Establish consistent file naming conventions
- Create a framework for naming files that describes what the files contain and how they relate to one other.
- File naming conventions can help with data organization and the identification of records, especially when working in a collaborative environment.
- Example: ExperimentName_InstrumentName_CaptureTime_ImageID.tif
- Develop streamlined directory structures
- Folder structures should correspond to how the records are generated and how they function, while complementing proposed or existing workflows.
Assign roles and responsibilities within the lab, identifying data stewards
- PI Responsibilities on the Cluster
- FASRC recommends identifying an individual within your group or lab that can act as a primary contact with FASRC’s Research Data Manager, responding to issues that may arise within the lab related to research data management.
- A lab Data Manager can also help to promote and support data management best practices, organize directory structures, establish file naming conventions, and identify data for retention, long term storage, or deletion.
Data Storage and Security
Review storage options
- Identify where to store research data based on its behavior, performance, and means of access. A master copy of raw data should be retained, with further changes to subsequent versions well documented.
- FASSE Cluster: A secure cluster environment providing Harvard researchers with access to a secure enclave for analysis of sensitive datasets (level 3).
- Home Directories: Every user with a FASRC account when receiving cluster access will be granted a 100 GB home directory. Designed as a personal storage space for analysis, scripts, or documentation.
- Lab Directories: Every lab receives a collaborative group folder with a 4TB storage limit. It is a general lab folder intended for data, scripts or documentation. It is not designed for high performance data analysis.
- Cluster Storage (Tier 0): High performance storage with no backups, designed as a location for data analysis, as it has high read/write speeds.
- Lab Storage (Tier 1): Lab storage, ideal for general file sharing. A primary storage location for the lab, as it maintains backups. Best utilized for irrecoverable data like raw datasets.
- Lab Storage (Tier 2): Less active storage location for lab data. Not designed for high throughput jobs as it has lower read/write speeds.
- Long-Term Storage (Tier 3/Tape): Long term storage for data that needs to meet compliance, publishing or institutional retention requirements. Data will not be accessible to the lab or researcher and should not be retrieved (except in extraneous circumstances).
- Data moved to Tape must be in the proper format. File sizes should range from 100MB to 1 TB, and will be allocated in 20TB increments. If the data is not within these size ranges, the files will need to be tarred.
- Google Drive:
- Google Drive storage request form: This form can be used to request a new personal or shared Google Drive, or request additional storage space for an existing Google Drive. The default Google Shared Drive storage limit is 5 GB. Eligible users may request additional storage using the form. The requests will need to be approved by FAS, as they are currently responsible for the costs.
Connect to your new storage folder from a desktop computer
Request a new storage allocation or increase an existing storage allocation in Coldfront
- Coldfront is a resource allocation management system FASRC adapted to manage allocations on the FASRC cluster. The platform enables the viewing and management of lab groups (Projects) and storage or cluster allocations (Allocations).
View information about storage folders associated with a group/lab
- Utilize the Starfish Zones tool to view key information about your group or labs storage folders
- The Starfish Zone User Interface is a self-service visual tool that enables users to view group storage amounts and locations. Users can navigate folder structures to access detailed information about files and storage. Labs and groups are strongly recommended to utilize this tool to assist with their data organization and cleanup efforts.
Storage Service Center Overview and Pricing
Review and comply with data safety and security requirements
- Harvard University Information Security
- Cluster security levels
- Harvard Data Safety Website
- Data Safety System
Data Migrations
Transferring data between research platforms can be challenging. However, we’ve provided a list of possible tools that can assist with data transfers; the selection of the tool will depend on the size of dataset, the data security level, and who will need access to the dataset.
Data transfer tools:
- Transferring data on the cluster
- Globus: Enables file sharing with external collaborators without the need for a FASRC account
- Rsync: A fast and versatile file-copying tool. Migrates only modified files from source to destination.
- Filezilla: An open-source client that is available cross platform (i.e. Mac, Windows, Linux)
- Rclone: A convenient and performant command-line tool for transferring files and synchronizing directories directly between FAS RC filesystems and Google Drive
Data Sharing
Researchers should make every effort to ensure research data is available to others for reuse when completing a project or study. It has become increasingly common, and often a requirement of a publisher or funder, that data be shared following the completion of a project.
There are many resources and tools available that researchers can use to share datasets, including government-sponsored repositories, disciplinary repositories, third-party repositories, and Harvard Dataverse, a free data repository open to all researchers at Harvard. A key aspect of data sharing means also making public the code and other materials that support the research. Data repositories are a centralized place to hold, share, and organize data in a logical manner.
Review data repository options and select one for each project and publication
- Global registry of research data repositories (re3data.org)
- Harvard Dataverse
- FASRC suggestions for sharing data
Additional resources
- SEAS Research Data Management: Support and consultation on Data Management Plans
- Longwood Research Data Management(RDM): Information and resources on NIH Data Management Plans
- DMPTool: Web-based platform to assist with the creation and sharing of Data Management Plans. DMPTool provides step-by-step guidance for drafting DMPs, including NIH-specific templates and samples to address specific requirements.
- Harvard Library Research Data Management Program: Connects members of the Harvard community to services and resources that span the research data lifecycle, to help ensure that Harvard’s multi-disciplinary research data is findable, accessible, interoperable, and reusable (FAIR)
Contact
For additional support, please contact Sarah Marchese, Research Data Manager for FASRC: sarah_marchese@fas.harvard.edu