Bespoke Research Data Management in the Age of Big Data: The Value of a Cross Unit Collaborative Data Professional Community

April 11, 2022

Jordan Wrigley, MSLS, MA
University of Colorado Boulder
ORCID: https://orcid.org/0000-0003-0176-5980

Aditya Ranganath, PhD
University of Colorado Boulder
ORCID: https://orcid.org/0000-0002-4721-2313

Ryan Caillet, MLIS
University of Colorado Boulder
ORCID: https://orcid.org/0000-0002-4025-4801

Abstract

Research involving “big data” poses novel challenges for librarians and other data professionals in research data management and support roles. To take one example, researchers increasingly view institutional repositories as viable platforms on which to archive and disseminate their datasets, in accordance with the requirements of funders and the norms of open data. However, these institutional repositories are rarely configured to host complex datasets in the realm of big data (i.e. on a terabyte scale and larger), which creates unique curation and preservation challenges. More generally, such data require bespoke approaches to data management and support across a variety of interrelated domains: storing active datasets, provisioning computing environments (for instance, to train machine learning algorithms on large training datasets), archiving and preserving large datasets over long time horizons, and articulating complex data management workflows in grants. 

In this editorial, we describe our institution’s model of collaborative consultation to develop bespoke workflows that address the data management needs of projects involving large and complex datasets. This model is underpinned by a partnership between data librarians and research computing experts, which has evolved into a platform that is able to respond to the rapidly growing and dynamic “big” data management needs of our institution. We provide case studies of how this model of collaborative consultation addresses these diverse needs. We also discuss strategies for fostering collaborative communities that can address data management challenges in the realm of big data, and explore decentralized versus centralized approaches to creating and sustaining such communities.

Introduction

Meeting the big data management needs of researchers might appear a purely technological or logistical challenge. In our experience, however, the need for bespoke solutions to the challenges of big data management underscores the importance of expert communities of practice for addressing management challenges in the big data era. It is the intrinsically social process of collaborative consultation and dialogue within communities, rather than the use of technology or codified workflows per se, that allows librarians and data professionals to successfully address the storage, analysis, curation, and preservation challenges associated with big data research. We provide three case studies demonstrating how decentralized and informal communication can nurture a sustainable community of expertise that successfully identifies and implements bespoke data management solutions. 

Case Study #1: 

The long-term preservation of big datasets in institutional repositories (IRs) is a more complex endeavor than the preservation of conventional datasets. It requires substantial computational resources over long time-horizons, and the corresponding financial costs may hinder access to researchers. Moreover, the process of integrating existing repository infrastructures with computational and storage facilities required for the long-term preservation of large datasets also poses technical challenges. 

In 2019, researchers from the Laboratory for Atmospheric and Space Physics at University of Colorado Boulder contacted our Institutional Repository manager about preserving data associated with a publication under review, in accordance with FAIR principles and the publisher’s data policy. At the time, we had not curated or stored datasets on the terabyte scale, and had to develop a workflow to accommodate these researchers’ needs. This workflow emerged through an organic collaboration that required coordination between several Research Computing and Libraries members. In particular, a Data Librarian within CU’s Norlin Library curated the data, and ensured that it adhered to FAIR principles, while a Research Computing staff member within the Office of Information Technology facilitated data transfer and storage through the PetaLibrary, which draws on CU Boulder’s supercomputing infrastructure. The head of the Data Services and Scholarly Communication unit within the Libraries, and the Institutional Repository manager, worked with both the librarian and computing expert on tasks at the intersection of data curation and infrastructure provision, such as synchronizing the CU Scholar repository landing page and metadata with the underlying data stored in the PetaLibrary. Throughout this process, the researchers themselves worked with both the librarian (to answer questions about their data and metadata) and the computing specialist (to transfer their data to a new storage environment). 

In short, the long-term preservation and accessibility of the large dataset was achieved through a successful effort to integrate repository services with a supercomputer-enabled data infrastructure. This success was driven by open communication, which allowed stakeholders with different (yet equally essential skills) to converge on a bespoke solution through a process of “learning by doing.” Communication and mutual adjustment allowed for the quick and flexible adaptation of existing resources to address novel researcher needs, resulting in the ingestion and publication of a “big” dataset that is open, accessible, and discoverable. There was nothing inevitable about this technological solution, however; it was only identified through a process of collaborative and informal consultation. A more structured or siloed organizational setting may not have been conducive to the creative problem solving and expertise-pooling that resulted in our success. 

The solution developed in this case resulted in a basic workflow for processing and disseminating “big” datasets via our IR. This workflow involves the use of the Globus file transfer platform as a nexus between the CU Scholar repository (which holds the relevant project’s metadata), and the PetaLibrary, where large datasets (that would strain the repository’s native storage capacity) are archived. While the workflow developed in this setting has provided a useful framework for our subsequent work with big data, our experience has been that it is impossible to apply this basic workflow in a mechanical way to other cases (such as those described below), each of which pose distinctive challenges that require novel solutions. Indeed, these solutions are best-identified through the very same framework of open communication and ad-hoc problem-solving that led to the development of the big data curation and storage workflow that was developed in the context of this initial case. 

Case Study #2:

Our campus’s grant-writing support personnel in the Research Innovation Office (RIO) referred a research group seeking a bespoke solution for their big-data publication goals to us. We are currently working with this group to assist with their Passive and Active Spectrum Sharing (PASS) project. PASS is an interdisciplinary project that aims to enable spectrum sharing between passive and active systems, so as to promote the more efficient use of the radio frequency (RF) spectrum (a scarce resource), and thereby attenuate radio frequency noise. PASS researchers are collecting up to 5 TB of relevant RF data, and are committed to publishing this as open data via our IR. The researchers conceived their data deposits and the IR as a “prototype database for storing and retrieving RF data” that could set standards for their field. 

To implement this, PASS researchers contacted us before their data collection phase, which allowed us to pre-structure both the storage and publication environments, raising important questions about user-experience. In particular, the dataset’s specific landing page(s) on the IR needed to be configured for intuitive previews of data, as well as partitioning data downloads for specific selections (including metadata) that are immediately relevant. These are important challenges, especially when a dataset is inherently unwieldy by virtue of its size. We expect to work with the researchers, as well as research computing specialists, to provide end-users with an intuitive way to navigate the collection and access the data. The early referral of these researchers to us (before data collection began) facilitated communication and exploration that marked the previous case study, and has set the stage for the bespoke creation of data environments envisioned by the researchers. 

Case Study #3:

Active data storage for computational analysis may be required for a research project. Universities often provision cloud storage options that host data with or without computational functionalities. Such storage options are viable for conventional datasets, but are impractical for datasets several to hundreds of TB in size. Customized environments may be costly to provision, especially when they require security protocols to host sensitive data in contexts requiring cross-institutional collaboration. In such cases, researchers often seek out librarians and data professionals for help finding appropriate environments. 

An Intermountain Neuroimaging Consortium researcher sought our assistance addressing these challenges in a project that involved upwards of 100 TB of sensitive neuroimaging data. This was complicated by the lack of allocated grant funding for data storage, and the need to retroactively locate a suitable data-storage infrastructure. Because our IR is not equipped to handle the storage and controlled release of sensitive data, a colleague and community member suggested we approach the Inter-university Consortium for Political and Social Research (ICPSR) about infrastructure for handling sensitive and restricted data. We arranged a meeting between ourselves, the researcher, and an ICPSR representative to further discuss this possibility. During the conversation, it became clear that ICPSR would not be appropriate for the data, since researchers will run scripts against the data over several years, and supporting such activity falls outside the scope of ICPSR services. However, the conversation generated ideas for additional avenues to pursue within CU Boulder; in particular, CU Boulder’s Research Cyber Security Program which facilitates sensitive data computing environments that could meet this research team’s specific needs. These discussions are ongoing, and could not have occurred without our wide-ranging conversation with the ICPSR specialist and the researcher; conversation and dialogue, in our experience, are often the source of serendipitous ideas or connections that help identify bespoke solutions for big data projects’ distinctive challenges.

Final Thoughts

Our decentralized approach to big data support is only one model among many possibilities. However, it suggests broader implications for the future of libraries’ data services workflows in the world of big and complex data. There is considerable interest among academic libraries in the prospect of programmatically automating workflows to increase efficiency and facilitate the scaling of library data services. Yet, meeting the unique needs of individual big data projects relies on tacit knowledge that is intrinsically difficult (perhaps even impossible) to codify in formal workflows. After all, the innovation and serendipity that arise within the social context of human communities of practice (which are needed to identify tailored solutions to these bespoke needs), are difficult to automate. Ironically, while big data may accelerate the automation of human labor, its successful management and stewardship is an intrinsically social, intersubjective process nurtured in professionally diverse communities of practice for which there is no technological or algorithmic substitute. These communities might emerge in different ways, but the use of such communities to uncover socially-sourced solutions to big data management challenges is likely to be scalable across institutions with diverse resources.  

In our case, this community has emerged organically, through a process of informal knowledge-sharing. Although this community was subsequently formalized as the Center for Research Data and Digital Scholarship (CRDDS), it has its roots in decentralized efforts to generate bespoke big data solutions, and continues to grow outward to include informal members as novel challenges arise. We believe this lens will continue to successfully help researchers discover bespoke solutions to challenges in the future, and that the relevance of socially embedded and collective expertise will not diminish amid ongoing efforts to automate data services and data management workflows. 

css.php