February 15, 2022
Anthony J. Dellureficio, MLS, MSc.
Associate Librarian for Research Data Management
Memorial Sloan Kettering Cancer Center Library
ORCID ID: http://orcid.org/0000-0001-8339-4989
Donna S. Gibson, MLS
Director of Library Services
Memorial Sloan Kettering Cancer Center Library
ORCID ID: http://orcid.org/0000-0003-3333-6742
The MSK Library (https://library.mskcc.org/) supports one of the oldest cancer centers in the country committed to cutting-edge research, exceptional educational programs, and a focus on innovative and exemplary patient care. The institution was one of the first to receive the National Cancer Institute’s Comprehensive Cancer Center designation with state-of-the-art research being conducted in tandem with quality patient care (National Cancer Institute, 2019). The library’s mission is aligned with the mission of the Center, and we believe understanding users’ information needs directly relates to the value the library contributes and our visibility as an active research partner. We proactively partner with library users to deliver targeted services that enhance and support their research and medical activities. These types of services play a key role in how users perceive, engage, and collaborate with us.
We intend to describe the origins of our new program, the challenges we’ve faced, how we have adapted to changing needs, and what services we have and plan to execute to best serve our evolving research community.
In June of 2015, The Rockefeller University and Memorial Sloan Kettering Cancer Research Center libraries designed a joint study to appraise the state of data management at these two highly collaborative institutions. Our goal was to provide insight into current data management practices and to make recommendations for data management policy development and implementation.
Funding agencies had already begun requiring data management plan statements in new grant applications, with future expectations of requiring more detailed plans. Increasingly data was being viewed as a valuable product of research, worthy of being reported, managed, and made accessible – as legitimate a source of knowledge as a scholarly research paper and evidence of research dollars well invested (National Institutes of Health. 2015, February).
The study was designed in two phases. In the first phase librarians identified and interviewed researchers at their respective institutions. Efforts were made to single out key researchers whose projects were generating large amounts of data. The information and insights gathered from these conversations were helpful for the second phase of the study, an online survey. Sent to all researchers, this survey included questions about their laboratories’ workflows, and their own personal practices for labeling, describing, storing, managing, and sharing data.
There were 215 respondents who provided usable information with a ratio of three MSK responses for each Rockefeller response. There was a core of 96 respondents who completed all questions in the survey. Key survey findings (contact authors for more information) included:
- Data was being generated in several dozen different file formats with four predominant format types: images, spreadsheets, text, and sequencing.
- Only one in five researchers had ever been asked to prepare a data management plan.
- One half of the journals in which they publish required submission of data sets with the manuscript.
- Researchers appear to have little concept of the:
- useful life of data, effective shelf life of digital files,
- standard protocols for data file naming and management,
- the necessity for accurate file descriptions and metadata to increase discoverability of their data sets,
- the role that data management activities play in the reproducibility crisis,
- and the difference between data files, data formats, and the tools to manage data files.
- Seventy-five percent of the respondents said they had never received any instruction on managing data, and 45% say they would like help with this.
The results from this survey and targeted interviews helped produce the roadmap for building our service. In comparing the results from each institution, we were also able to confirm that many of the issues and pain points experienced by researchers regarding data management activities were similar.
In order to design a research data management program, we need to consider its fundamental purpose and guiding tenets. In 2016, a consortium of researchers and institutional representatives defined the FAIR principles of Findability, Accessibility, Interoperability, and Reuse as a baseline for the creation and ongoing care of research data (Wilkinson et al., 2016). Our program has been founded on a commitment to this global standard, as well as to a set of socially responsible guidelines introduced by the Global Indigenous Data Alliance. The CARE principles of Collective Benefit, Authority to Control, Responsibility, and Ethics remind us that any tools and services we develop at our institution have a social context (Global Indigenous Data Alliance, n.d.). Additionally, the forthcoming NIH Data Sharing and Management Policy (https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html) has influenced our priorities as we prepare to provide compliance support to our researchers. Ultimately, our program needs to focus on integrating with researchers’ workflows, lowering administrative burden, and forming partnerships to support researchers throughout the life of their experiments, from planning to publication, and beyond.
A research data management librarian was hired to spearhead the development of an action plan based on the survey feedback, client interviews, and above-described principles. The plan has specifically included launching a data catalog with MSK-specific integrations and enhancements, data management planning assistance, DOI minting support, and developing strong ties within the institution and the RDM library community. Our implementation components fall into the following categories: data discovery and library platform integration, internal and external relationship building, engagement with our researchers, and maintaining and enhancing our services over time.
As a result of the survey, the first application we implemented was the MSK Data Catalog (https://datacatalog.mskcc.org/), a metadata-only database of dataset records, code, analytical tools, or other research outputs not traditionally included in a bibliographic catalog. The software we used is open-source code developed at NYU Langone (NYU Health Sciences Library, 2016) with implementation and ongoing development at several other institutions. The records in our catalog include curated, enhanced descriptive metadata in order to supply additional access points, provide access instructions or explain data restrictions, connect researchers working on common topics, connect datasets with relevant analytical tools, track data reuse through publications, and highlight MSK datasets while still accommodating concerns over protected health information (PHI) exposure. Highlights of our instance of this data catalog include enhancing metadata with subject descriptors from MeSH (Medical Subject Headings) as well as the Oncotree taxonomy (a widely-used cancer-type classification system created at MSK) (Kundra et al., 2021), inclusion of persistent identifiers for data authors (ORCID) as well as datasets (DOI) wherever possible, and connecting data authors and associated publications with our library’s institutional author and publication database, Synapse (https://synapse.mskcc.org). The MSK Data Catalog currently includes 450+ records reflecting datasets created by or used by MSK researchers in support of publications. Our development plans for the MSK Data Catalog include:
- Harvesting the records into our library discovery platform (ExLibris’ Primo) so that datasets can be discoverable alongside article searching,
- Strengthening its integrations with Synapse,
- Providing analytics and data usage reports, and
- Exploring coding solutions to support augmented cataloging methods to facilitate import and enhancement of new catalog records.
Internal and External Relationships
The Library is not the only part of our organization that is interested in research data, so it is important to cultivate relationships and seek collaboration with those groups and departments that have overlapping interests. We met with our Core Facilities team, which provides centralized services and technology to our basic and translational research programs, to understand their workflows and whether capturing their contributions to data creation and analysis in our data catalog would assist them with internal reporting and analytics. Externally, we joined the DMPTool as an institutional member, an open-source online application supporting the creation and sharing of data management plans. We also have been working with our Research and Project Administration to create institutional templates and guidance for scientists in the earliest stages of their research. We recently became DataCite members with administrative access to creating DOI minting repositories for MSK. We hope to leverage DataCite Fabrica’s APIs to offer DOI minting options directly through our repositories simultaneously generating stub records for our data catalog and emphasizing the value of research data.
We recently rolled out a ‘Best Practices in Research Data Management’ class and hope to offer additional classes in the future. The decision to start with this specific topic was based on the findings from the above-mentioned survey. We expect to release an RDM-specific LibGuide by early 2022 which will include a locally-developed, searchable database of funder and publisher data sharing and management policies. The Library is conducting research throughout the organization to inform a three-year strategic plan which will include data management outreach. We hope to generate engagement and enthusiasm with senior leadership while also eliciting constructive feedback from the MSK community regarding how we can shape the RDM program to best anticipate and fulfill the needs of our researchers.
One of the most gratifying components of standing up a new service in a library is discovering that there is an abundance of inter-institutional community interest. In 2019, the MSK Library joined the Data Discovery Collaboration (DDC) (https://datadiscoverycollaboration.org/), a multi-institutional consortium providing a platform-agnostic community-of-practice to support research data discoverability through metadata, outreach, and software development. We have also found support for research data management in libraries through numerous conferences, workshops, and professional organizations. These activities will help us to maintain dynamic services and provide insights to enhance our current offerings.
The challenges we’ve faced so far in implementing an RDM program are recognizable and likely experienced by other libraries trying to bring up a novel program. Leadership and institutional buy-in inevitably represents an early administrative hurdle, as does deciding whether to rely on in-house skills, invest in staff skillset development, or hire candidates to fill newly created positions. Likewise, it’s important to strike a balance between open-source software, in-house development, and purchasing third party software.
We made a conscious decision to focus our RDM services beginning with the Sloan Kettering Institute (SKI), our organization’s basic science research division, rather than on clinical research because of the challenges presented by working with PHI.
With the current environment and the data sharing and management mandate from NIH, research libraries need to determine what is the best approach for integrating a data management service at their organization. Each institution has its own unique culture and structure which necessitates understanding, navigating, and managing relationships to be flexible and forward-thinking. They need to focus attention on what support should be identified as service offerings and the level of customization required based on researchers’ information and scholarly research needs.
Global Indigenous Data Alliance. (n.d.). CARE Principles for Indigenous Data Governance. Retrieved September 15, 2021, from https://www.gida-global.org/care
Kundra, R., Zhang, H., Sheridan, R., Sirintrapun, S. J., Wang, A., Ochoa, A., Wilson, M., Gross, B., Sun, Y., Madupuri, R., Satravada, B. A., Reales, D., Vakiani, E., Al-Ahmadie, H. A., Dogan, A., Arcila, M., Zehir, A., Maron, S., Berger, M. F., … Schultz, N. (2021). OncoTree: A cancer classification system for precision oncology. JCO Clinical Cancer Informatics, 5, 221–230. https://doi.org/10.1200/cci.20.00108
National Cancer Institute. (2019, October 30). Memorial Sloan-Kettering Cancer Center. Retrieved September 8, 2021, from https://www.cancer.gov/research/infrastructure/cancer-centers/find/memorialsloankettering
National Institutes of Health. (2015, February). Plan for Increasing Access to Scientific Publications and Digital Scientific Data from NIH Funded Scientific Research. Retrieved January 4, 2022, from https://grants.nih.gov/grants/NIH-Public-Access-Plan.pdf
NYU Health Sciences Library. (2016). NYUHSL Data Catalog. Retrieved September 15, 2021, from https://github.com/nyuhsl/data-catalog
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, Article 160018. https://doi.org/10.1038/sdata.2016.18