What Would it Take? Building a Topically Relevant Data Repository Amidst the COVID-19 Pandemic

January 13, 2022

Kyrani Reneau
Inter-university Consortium for Political and Social Research
University of Michigan

Introduction

From the onset of the pandemic, researchers, scientists, and journalists recognized that the ever-changing nature of the coronavirus disease (COVID-19) required capturing data across different timeframes, as well as the need for providing immediate access to these data for secondary data analysis (Gardner, Ratcliff, Dong, & Katz, 2021; OECD Global Science Forum, 2021). In 2020, the Research Data Alliance (RDA) developed the COVID-19 Recommendations and Guidelines on Data Sharing in order to influence the sharing of COVID-19 data as openly and quickly as possible across multiple disciplines. The Inter-university Consortium for Political and Social Research (ICPSR) acknowledged the need for open data sharing by launching the COVID-19 Data Repository — a free self-publishing repository for data examining the social, behavioral, public health, and economic impact of the novel coronavirus. This resource provides a way for researchers across an array of disciplines to share COVID-19 related data and promotes the replication and reproducibility of studies to better understand and respond to future outbreaks. 

What it Took

With nearly 60 years of data stewardship and archiving experience, ICPSR possessed the infrastructure to readily build the COVID-19 Data Repository. ICPSR is an international consortium of nearly 800 members worldwide that offers data management and data curation services. It hosts 22 specialized collections — totalling over 16,000 studies — of data in education, aging, criminal justice, substance abuse, and several other fields. ICPSR funds the curation of studies through its membership or in collaboration with various government agencies and foundations. 

In addition to specialized collections, ICPSR also maintains a free self-publishing repository known as openICPSR; the COVID-19 Data Repository is housed within openICPSR. More than 5,000 projects have been published in openICPSR, comprising roughly one-third of the overall ICPSR catalog. Data published in openICPSR are not curated, therefore becoming available nearly immediately to secondary users. OpenICPSR is well-suited for the deposit of replication datasets and for researchers who want to publish their raw data associated with a journal article. Organizations can build their own fully-branded repository for data sharing within the openICPSR data repository service. The American Economic Association, Journal of Economic History, and the American Educational Research Association are organizations that utilize this service. 

Toward the start of the pandemic in 2020, ICPSR recognized the need for COVID-19 data to be archived and shared, and quickly formed a working group of ICPSR staff members to determine how to best address this need. Due to the response to the pandemic changing so quickly and frequently (e.g., state responses to the pandemic), the working group decided that the ability to share data quickly with other researchers was of the utmost importance. Rather than creating a specialized curated collection, which would require time to curate incoming data prior to their release, the working group opted to utilize the technology that already existed in openICPSR as it would allow for data to be published immediately. After making some branding and content decisions, the working group collaborated with ICPSR’s IT and web teams to see the project to fruition. By April 2020, ICPSR had created a location for researchers to share COVID-19 data with one another.

What the COVID-19 Data Repository Offers

The COVID-19 Data Repository accepts all data formats across a broad range of disciplines. Currently, there are over 60 COVID-19 data collections available on a range of topics, including mental health impact, vaccine hesitancy, and a growing database of state-level COVID-19 policies. Projects published here receive an immediate persistent identifier (DOI), and are indexed and searchable within the ICPSR catalog and major search engines (e.g., Google). Because the COVID-19 Repository is operated by ICPSR, COVID-19 data are findable, accessible, interoperable, and reusable, or FAIR, an important criterion for a dependable repository (RDA COVID-19 Working Group, 2020). 

In order to (re)use data, they must first be found. This crucial component of the FAIR principles, or “F,” states that “metadata and data should be easy to find for both humans and computers” (GO FAIR, 2021). ICPSR — and by association the 22 active archives housed under it — makes data findable through persistent identifiers, or direct object identifiers (DOIs) that allow the data to be easily accessible. The next principle, or “A” for accessible, allows users to retrieve the data and metadata through universal download methods and possibly through a free and open protocol (GO FAIR, 2021). Users that come to ICPSR have access to metadata exports as well as a creative commons license to attribute the metadata. In order for data to be “Interoperable,” they “usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing” (GO FAIR, 2021).This principle is visible in how ICPSR shares its metadata, our standardized thesauri and controlled vocabularies. Finally, the “R,” which stands for reusable, means that “metadata and data should be well-described so that they can be replicated and/or combined in different settings” (GO FAIR, 2021). ICPSR’s descriptive metadata are complete, thorough and at bare-minimum, supply secondary-users with information on the original data supplier and the data collection.  

Other benefits provided by ICPSR, particularly the COVID-19 Data Repository include: 

Meeting the Standards 

The creation of the COVID-19 Data Repository aligns with the idea of sharing COVID-19 data as openly and quickly as possible, as outlined in the Research Data Alliance’s (RDA) COVID-19 Recommendations and Guidelines on Data Sharing. RDA is an international network of 11,000 members whose mission is to build the social and technical bridges that enable open sharing and re-use of data. In March 2020, RDA formed the COVID-19 Working Group to address concerns related to the pandemic (Callaghan, 2020). ICPSR Research Scientist Amy Pienta, PhD, was co-chair of the Social Sciences section of the working subgroup to create these guidelines. When it comes to reuse and reproducibility, the subgroup suggests social science researchers include thorough documentation about the data and data elements, and have awareness of metadata standards (RDA COVID-19 Working Group, 2020). As mentioned previously, the ability to link data over time produced by various entities is critical for COVID data sharing. For this reason, the subgroup urges researchers to utilize repositories that exhibit long-term preservation practices such as enabling data linkages (RDA COVID-19 Working Group, 2020). ICPSR is a CoreTrustSeal core certified repository. As a trustworthy repository, the consortium is well-equipped to “facilitate data sharing and increase the FAIRness of data” as recommended in the RDA guidelines (Callaghan, 2020).  

Regarding ethical and privacy considerations, the RDA COVID-19 Working Group (2020) recommends that research access “should be as open as possible and as closed as necessary, to protect participant privacy and reduce the risk of data misuse” (p. 9). ICPSR has a long history of safely storing and disseminating restricted-use data to protect participant confidentiality. All of ICPSR’s deposit options allow for restricted-use data to be archived, including the COVID-19 Data Repository.  

Lessons Learned

Although the implementation of the COVID-19 Data Repository was beneficial to the research community at large, especially the social sciences community, there are certain considerations to be made when creating self-publishing repositories. When openICPSR originally launched in 2014, it was built on a different operating system than ICPSR’s other data collections in order to keep the metadata separate. Since these two systems aren’t integrated, moving a self-published project to another collection at ICPSR for curation is quite challenging. Additionally, compared to ICPSR’s curated collections, there is more variability in the quality of metadata in the COVID-19 Data Repository. This is compounded by the fact that the self-publishing system doesn’t pull from the main ICPSR Thesaurus, so a lack of controlled vocabularies can affect consistency in how terms are applied (Lyle, Goforth, & Reneau, 2021). Those planning to build a self-publishing system (within an already established repository), should avoid multiple deposit and dissemination streams and create a single stream for all collections (Lyle, Goforth, & Reneau, 2021). Another important factor to consider when building self-publishing repositories, is to allow opportunities for continuous user feedback. In order to meet the needs of secondary users, and to yield the most long-term benefits, archival entities must remain forward thinking and remain testbeds for new features to continually improve their systems. The CoreTrustSeal guidelines provide a framework for creating a sustainable and trustworthy repository. Self-assessment statements provided for certification must be accompanied by evidence. We encourage those interested in building a repository to conduct the internal self-assessment in order to gain insights on what it takes to manage a quality and transparent data service (CoreTrustSeal Standards and Certification Board, 2019).

While the content in ICPSR’s self-publishing repository may not be as fully described or enhanced as professionally curated studies, the trade-off is the ability to provide rapid releases and dissemination of data (Lyle, Goforth, & Reneau, 2021). Principal Investigator (PI) and summary information are required to publish all projects. Secondary users have access to this descriptive metadata and are therefore able to contact the PI for any data-related questions.  Researchers may browse current holdings in the COVID-19 Repository, and deposit instructions are also available for those interested in sharing their COVID-19 related data.  Deposits should include all data, annotated program code, command files, and documentation necessary to understand the data collection and replicate research findings. Prospective depositors are encouraged to email ICPSR User Support at ICPSR-help@umich.edu with questions about publishing data in the COVID-19 Data Repository.

In launching the COVID-19 Data Repository, ICPSR quickly answered the call to provide a central resource for data sharing, with the end goal of contributing to faster advancement of science. Without such an ability to make an investment in repository infrastructure to provide access, our response to future pandemics will be limited (Gardner, Ratcliff, Dong, & Katz, 2021; OECD Global Science Forum, 2021).

References 

Callaghan, S. (2020). Data sharing in a time of pandemic. Patterns, 1(5), 100086. DOI: 10.1016/j.patter.2020.100086 

CoreTrustSeal Standards and Certification Board. (2019). CoreTrustSeal Trustworthy Data Repositories Requirements: Extended Guidance 2020–2022 (v02.00-2020-2022). Retrieved from https://zenodo.org/record/3632533#.YZ5zS9DMKUk 

GO FAIR. (2021). FAIR Principles. Retrieved from https://www.go-fair.org/fair-principles/ 

Gardner, L., Ratcliff, J., Dong, E., & Katz, A. (2021). A need for open public data standards and sharing in light of COVID-19. The Lancet Infectious Diseases, 21(4), e80. DOI: 10.1016/S1473-3099(20)30635-6 

Goforth,C. (2020, November 4). ICPSR’s COVID-19 data repository. Digital Preservation Coalition. Retrieved from https://www.dpconline.org/blog/wdpd/blog-chelsea-goforth-wdpd 

Lyle, J., Goforth, C., & Reneau, K. [iassistdata]. (2021, June 18). Integrating self-publishing platforms within established data repositories [Video]. Youtube. https://www.youtube.com/watch?v=yf04jUcv0kA 

OECD Global Science Forum. (2021). Enhancing access to research data during crises: lessons learned from the COVID-19 Pandemic. Retrieved from https://one.oecd.org/document/DSTI/STP/GSF(2021)13/FINAL/en/pdf 

RDA COVID-19 Working Group. (2020). Recommendations and guidelines on data sharing (Research Data Alliance). DOI: 10.15497/rda00052