March 15, 2022
Limor Peer, PhD
Imagine a paper published in a prestigious social science journal. In accordance with current conventions and journal policies, the authors made the underlying data and code publicly available in a trusted repository. Two years later, another researcher accessed the data but had to spend several hours trying to decipher what the variables represent, could not use the code because she did not have license to use the statistical program in which it was written and, once she secured a license, failed to computationally reproduce the results due to an error that prevented the code from running in full. These problems are not uncommon (e.g., Stodden et al., 2018).
This compiled anecdote illustrates some of the challenges of stewarding reproducible research. Being able to use other people’s data and code is hard enough. It assumes that the authors are willing to share the full research compendium (Gentleman & Lang, 2007) in the first place. The ability to use data and code rests on the expectation that the materials are deposited in a standards-based repository that will do some amount of curation and enhancement, that users have little difficulty finding and easily accessing the materials, and that users can understand and interpret the materials to allow productive use. When there is an additional expectation that using the data, the code, and their interaction will computationally reproduce a pre-specified result, the bar is higher still. And when the objective is to archive the research compendium for the long term, additional curation, review, and verification are advised.
Curators can help enable reproducibility over time with proper stewardship of the research compendium. A key principle guiding the curation and stewardship of the research compendium is the ability to reuse it and its component artifacts without help from the original author (NASEM, 2019). The idea dates back to Gary King’s “replication standard” (1995) and echoed in the Royal Society’s concept of “intelligent openness” (2012) and in the OAIS model, whereby digital objects need to be “independently understandable to (and usable by) the Designated Community… without needing the assistance of the experts who produced the information” (CCSDS, 2012). Curation activities are imperative for extracting value from data and other research outputs and for meeting the fairly recent but widely accepted FAIR principles (Wilkinson et al., 2016).
Sharing reproducible research at ISPS
The ISPS Data Archive, launched in 2010, is a digital repository for research output produced by scholars affiliated with the Institution for Social and Policy Studies (ISPS) at Yale University. The main collection in the Archive focuses on experimental design and methods; it includes over 100 studies and about 2,500 associated digital files. Since its inception, the Archive has been curating the research artifacts that underlie the scientific claims and verifying that they are computationally reproducible, using the Data Quality Review framework (Peer et al., 2014). We take care to share only high-quality research output, by which we mean: Data and code that can be used by humans and machines and a data-code interaction that can reproduce a pre-specified outcome.
Early on, we reflected on our experience with the Archive and flagged some salient issues for the research community (Peer, 2011). Among the questions we highlighted at the time: What are the criteria for deciding which data are worth sharing and preserving? What does it mean to care for these artifacts so as to facilitate reproducibility in the long term? What steps need to be taken and what standards met to ensure that the data are usable and useful? What standards should systems be required to adhere to so they can communicate effectively? What are the constraints on any such endeavors?
Since 2018, ISPS implemented a workflow tool, YARD (Yale Application for Research Data), which has improved the Archive workflow immensely. YARD was developed to facilitate the data curation and code review process as well as to help standardize the curation workflow, create high quality FAIR data packages that can be pushed into any repository, and promote research transparency by connecting the activities of researchers, curators, and publishers through a single pipeline (Peer & Dull, 2020). YARD is successfully used by depositors, curators, and Archive administrators.
Changing landscape and positive signs
Over the past 10 years, the landscape has changed dramatically for data and code sharing and preservation. Journals increasingly require data and code sharing (Christian et al., 2020) as a matter of course. Changing expectations from journals and scholarly communities around data and code sharing incentivize researchers to prepare and share high quality data and code (see, for example, the AEA guidelines). Repository solutions, both general and discipline- and data-specific, abound and “available by request” is not acceptable except when warranted due to access restrictions. Particularly encouraging, there’s a movement toward trusted repositories and third-party verification services (e.g,. Odum Institute, CASCaD). Concurrently, advances in digital preservation as applied to research output have been made (e.g., ReproZip, RO-Crate, EaaSI). Arguably, more options and more tools are available now than ever.
The academic community has also taken significant strides toward reproducibility in both discourse and practice. The National Academies of Sciences, Engineering, and Medicine in the United States posited that, “reproducibility is strongly associated with transparency; a study’s data and code have to be available in order for others to reproduce and confirm results” (2019, p.2). The report identified the obsolescence of these digital artifacts as an important source of non-reproducibility: “over time, the digital artifacts in the research compendium are compromised because of technological breakdown and evolution or lack of continued curation” (p.67). Several other initiatives in support of reproducibility have also advanced policies, standards, and recommendations (e.g., Baker et al., 2020; Sandve et al., 2013; Stodden, 2015; Wilkinson et al., 2016).
Corresponding to these changes, research practices have also evolved. In my work over the last decade, I have noticed that more researchers engage in practices that support open science. Researchers are more likely to plan ahead. The availability of tools such as RMarkdown, Jupyter notebooks, github, etc. make it easier for researchers to work reproducibly throughout the research lifecycle. Policy instruments such as data management plans and pre-analysis plans incentivize researchers to be more transparent about methods and to be deliberate about the data and code they share with future users. At ISPS, this has been expressed most notably in better organization and documentation of the deposited materials (e.g., detailed readme file, commented code) and increased awareness of the requirements of verifying computational reproducibility on independent systems.
Remaining challenges and lessons learned
All in all, significant progress has been made by researchers, curators, publishers, and funders to ensure that reproducible research is properly shared and stewarded. Still, challenges remain. The RDA CURE-FAIR working group has recently published a report on the wide-ranging challenges to curating for reproducible and FAIR research output (Peer et al., 2021a). Here I highlight three observations from our experience at the ISPS Data Archive:
- Code breaks. A given computation has many opportunities to fail. Failure can present itself shortly after the original computation and at any time in the future: Code that successfully executes on the day of deposit may break for various reasons (Ivie & Thain, 2018). Organizations involved in sharing reproducible research, must make assurances about computational reproducibility that clearly state what it is they are guaranteeing and for how long, and make the appropriate investment to deliver on these promises. For example, what should be required from a journal claiming that a computation is reproducible? Providing written assurances that the computation produced a pre-specified result at the time of the article publication? Sharing an image of the computation environment and associated input and output for a pre-determined length of time? Or committing to ongoing intervention to guarantee that the original materials will function as intended (see Peer et al., 2021b)? In each case, systems, best practices, policies, and staff have to be put in place to carry out the claim that the research is reproducible.
- Curation involves humans. End-to-end automation in data collection, transformation, and analysis will undoubtedly facilitate reproducibility of these processes. The same holds for the curation process. It is true that some aspects of curation lend themselves to automation more readily (e.g., assigning persistent ids, periodic bit checks, generating a citation). In our experience, however, humans in the loop are necessary for review and verification. And while some argue that manual curation is indispensable (see Wang et al., 2021), it has been noted that it is under-resourced (Leonelli, 2016). That said, aspects of manual curation that currently rely on human intelligence and expertise – such as, assessing disclosure risk as a result of the presence of personally identifiable information in the data or associated materials, or deciphering the order of operations of various program files in lieu of a master file or documentation with that information) — could be made more efficient by streamlining the process or requiring proactive actions from researchers.
- Pre-publication is the best time for curation. Researchers are clearly the best source of information about their data collection, cleaning, and analysis and about the code, software, and hardware in which these operations are encoded. A model whereby curators are available to work alongside researchers throughout the process is most successful. Post-publication curation is problematic for various reasons: In addition to the risk of information loss over time, crowd-sourced curation may work in some settings but not others, and professional curators themselves are often not experts (Akmon et al., 2017) in the research and they may disagree on terminology (Palmer et al., 2013).
When we launched the ISPS Data Archive 10 years ago, we simply wanted to share carefully reviewed data and code. Along the way we learned that each choice – about what to share, how to share, what it means to carefully review – must rely on good practices, standards, protocols, and infrastructures being embraced by all the different actors involved. This one Archive’s experience indicates that sharing reproducible research is an effort worthy of sustained cooperation and coordination as well as clear governance. Now that the scientific community has turned its attention to this topic, we can expect the next 10 years to codify current best practices and highlight new issues to address.
Akmon, D., Hedstrom, M., Myers, J.D., Ovchinnikova, A., & Kouper, I. (2017). Building Tools to Support Active Curation: Lessons Learned from SEAD. International Journal of Data Curation, 12:2. https://doi.org/10.2218/ijdc.v12i2.552
Baker, L., Cristea, I., Errington, T., et al., (2020). Reproducibility of Scientific Results in the EU: Scoping Report, Lusoli, W.(editor), European Commission, Directorate-General for Research and Innovation, Publications Office. https://data.europa.eu/doi/10.2777/341654
Christian T-M., Gooch A., Vision T., & Hull E. (2020). Journal Data Policies: Exploring How the Understanding of Editors and Authors Corresponds to the Policies Themselves. PLoS ONE 15 (3): e0230281. https://doi.org/10.1371/journal.pone.0230281
Consultative Committee for Space Data Systems. (2012). Reference Model for an Open Archival Information System (OAIS). Washington, DC: CCSDS Secretariat. Retrieved from http://public.ccsds.org/publications/archive/650x0m2.pdf, January 7, 2022.
Gentleman, R., & D.T. Lang (2007). Statistical Analyses and Reproducible Research. Journal of Computational and Graphical Statistics 16 (1): 1–23. https://doi.org/10.1198/106186007X178663
Ivie, P. & Thain, D. (2018). Reproducibility in Scientific Computing. ACM Computing Surveys. 51. 1-36. https://doi.org/10.1145/3186266
King, G. (1995). Replication, Replication. PS: Political Science and Politics. 28 (3): 444-452. https://doi.org/10.2307/420301
Leonelli, S. (2016). Open Data: Curation is Under-Resourced. Nature, 538, 41. https://doi.org/10.1038/538041d
National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and Replicability in Science. Washington, DC: The National Academies Press. https://doi.org/10.17226/25303
Palmer, C., Weber, N., Muñoz, T, & Renar, A. (2013). Foundations of Data Curation: The Pedagogy and Practice of “Purposeful Work” with Research Data. Archives Journal. Vol 3. http://hdl.handle.net/2142/78099
Peer, L. (2011). Building an Open Data Repository: Lessons and Challenges (September 15, 2011). Social Science Research Network (SSRN) http://dx.doi.org/10.2139/ssrn.1931048
Peer, L., Green, A., & Stephenson, E. (2014). Committing to Data Quality Review. International Journal of Digital Curation, 9 (1). https://doi.org/10.2218/ijdc.v9i1.317
Peer, L. & Dull, J. (2020). YARD: A Tool for Curating Research Outputs. Data Science Journal, 19 (1): 28. DOI: http://doi.org/10.5334/dsj-2020-028
Peer, L., Arguillas, F., Honeyman, T., Miljković, N., Peters-von Gehlen, K., & CURE-FAIR subgroup 3. (2021a). Challenges of Curating for Reproducible and FAIR Research Output (2.1). Zenodo https://doi.org/10.15497/RDA00063
Peer, L., Orr, L., & Coppock, A. (2021b). Active Maintenance: A Proposal for the Long-Term Computational Reproducibility of Scientific Results. PS: Political Science & Politics, 54 (3), 462-466. https://doi.org/10.1017/S1049096521000366
The Royal Society Science. (2012). Science as an Open Enterprise, Policy Centre report 02/12. Retrieved from 2012-06-20-saoe.pdf (royalsociety.org), January 7, 2022.
Sandve, G.K., Nekrutenko, A., Taylor, J., & Hovig, E, (2013) Ten Simple Rules for Reproducible Computational Research. PLoS Computational Biology, 9(10): e1003285. https://doi.org/10.1371/journal.pcbi.1003285
Stodden, V. (2015). Reproducing Statistical Results. Annual Review of Statistics and Its Application, 2:1-19. https://doi.org/10.1146/annurev-statistics-010814-020127
Stodden, V., Seiler, J., & Ma, Z. (2018). An Empirical Analysis of Journal Policy Effectiveness for Computational Reproducibility. Proceedings of the National Academy of Sciences, 115 (11): 2584–2589. https://doi.org/10.1073/pnas.1708290115
Wang, D., Liao, Q.V., Zhang, Y., Khurana, U., Samulowitz, H., Park, S., Muller, M. & Amini, L. (2021). How Much Automation Does a Data Scientist Want?, ArXiv. https://arxiv.org/abs/2101.03970v1
Wilkinson, M. D., M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, et al. (2016). The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data 3 (1): 160018. http://dx.doi.org/10.1038/sdata.2016.18