Emerging topics in biomedical research: data, metadata, and reproducibility

By Martin-Immanuel Bittner
Reproducibility crisis
According to PubMed/ Medline, there are more than 30 million peer-reviewed biomedical research papers with around 60,000 new articles added every month. Many of these studies explore complex questions at the frontier of human knowledge. And it is one of the hallmarks of research, that reproducing previous findings leads to confidence in their validity and the correction of any errors or mistakes.
For this process to work, publications must contain enough detail to allow for independent replication. Ideally, any researcher (suitably equipped and trained) should be able to select a result from the literature and replicate the experiment that led to this specific result. However, achieving the level of detail needed to preserve the ‘self-correcting’ nature of science is tedious and time-consuming at best, if not outright impossible the way lab work is currently conducted. Furthermore, incentives do not favor replication studies, which are hard to publish and rarely help scientists advance their careers, despite being of critical importance for biomedical research as a whole.
Non reproducibility and its implication for biomedical research
A growing body of evidence suggests that this self-correcting mechanism that underpins biomedical research is not functioning properly – but it took an outside perspective to realise. In 2011, German pharma company Bayer’s research teams looked at 4-years’ worth of research projects only to find that less than 25% of the studies examined withstood scrutiny. Five years later, Amgen looked at 53 peer-reviewed studies that were considered potentially groundbreaking for further scrutiny. Even with the support of some of the original authors, the Amgen team was able to reproduce just six of these. In the United States alone, a recent study estimated that approximately USD 28 billion per year are spent on preclinical research that is not reproducible.
The challenge of reproducibility has been brought back to the limelight during the ongoing COVID-19 pandemic. In the past ten months, more than 100,000 papers on COVID-19 have been authored. Publishing such volume of papers requires a faster peer-review track, which can undermine the rigor. For example, two of the world’s leading journals—The Lancet and The New England Journal of Medicine—retracted high-profile papers regarding a potential COVID-19 treatment after concerns about data quality and accessibility surfaced during an independent audit.
Data and metadata can help to make research reproducible
The examples above highlight the need for more careful stewardship of scientific data. Scientific societies and publishers around the world are pushing to increase experimental rigor and reporting transparency. For instance, the Singapore Statement, released by A*STAR in 2011, establishes good research practices and facilitates knowledge transfer between academia and industry. The UK Reproducibility Network (est. in 2014) functions as a grassroots initiative of scientists, and a coordinator for institutional actors that commit to best practices for reproducible research. A trailblazing approach for improving data reporting standards was proposed by a diverse group of stakeholders with experience in academia, industry, funding agencies, and scholarly publishers, coming together to propose FAIR Guidelines for scientific data management in 2016 (based on four foundational principles of Findability, Accessibility, Interoperability, and Reusability).
Yet, calling for appropriate data management is just the first step. Researchers tend to share with their peers only the data and procedures they deem critical to reproduce their results. However, most experiments involve several seemingly minor pieces of information which are often omitted from the records but can nonetheless have a measurable impact on the results. Such information, e.g. a minor step in one of the methods (i.e. shaking vs stirring), a change in temperature in the lab during the experiment, or reagent batch, often referred to as metadata, must be taken into account to ensure experimental reproducibility.
Lack of metadata hampers reproducibility even between closely collaborating world-class research teams. For example, two labs from Harvard Medical School and UC Berkeley collaborated in a project on the heterogeneity of breast cancer. Despite using seemingly identical methods, reagents, and specimens, they were unable to reproduce each other’s results for almost two years. After a year of fruitless search for the underlying cause, the two groups got together in the same lab to run the experiment in parallel. As they worked through what seemed to be an identical protocol, they discovered why their results diverged. At Berkeley, cells were prepared with a shaking platform, whereas at Harvard, more vigorous rotational stirrers were used. This cautionary tale highlights the importance of keeping records not only of data in the conventional sense, but also the tools and protocols that generate these data; or the metadata.
The role of automation
Recent advances in laboratory automation make data and metadata collection possible on an unprecedented level, and thereby foster reproducibility in the biomedical sciences. Automated workflows promote the adoption of unambiguous protocols that can be widely distributed among stakeholders for later replication.
Automated labs, by design, collect rich data and metadata records. These can be used to generate full audit trails tracking all relevant variables throughout the experiment. Furthermore, they enable reproduction of exactly the same experiments in the future. With such fine control over the conditions, and the ability to align them perfectly between runs, researchers can narrow the source of discrepancy between findings down to the underlying biology, which they want to uncover.
While automation could benefit research groups in industry and academia alike, few researchers have access to in-house automation capabilities. The budgetary and expertise requirements are immense and difficult to bear for most centres and groups. This situation is analogous to High-Performance Computing (HPC), where only a handful of companies and universities in the world build their own supercomputers. Instead, when they need to perform demanding simulations, they purchase computing power from one of the HPC centres around the world.
In a similar fashion, biomedical laboratories can conduct their experimental tasks and technical work in remote automated facilities. These remote-access laboratories allow for global access to state-of-the-art equipment and resources, accessible via online configurators that don’t require programming knowledge. Automation can also increase experimental throughput, running 24/7 without human errors or variability while facilitating the data and metadata collection before, during, and at the end-point of an experiment (see Fig.1). There is an emergent ecosystem offering this type of services, including US cloud labs Strateos and Emerald Cloud Lab, Oxford and Singapore-based Arctoris, or US synthetic biology foundries Ginkgo Bioworks and Zymergen.
Conclusions and outlook
Drug discovery is an area which strongly depends on the quality of basic biomedical research. And with data quality problems plaguing this discipline, the progress in discovering new formulations is stalling. As published results which hold a lot of promise cannot be reproduced, labs should turn to reducing data ambiguity—the key issue currently holding drug development back. Thus, collecting more complete dataset, which contain metadata, can improve reproducibility. Introducing automation to labs helps to capture the metadata, prevents human error, and reduces ambiguity. Lab automation also holds the promise of opening new frontiers in research. This is especially relevant now during lockdowns, where such labs provide disruption-free and reliable re-mote access to the full range of wet lab capabilities. In the long run, it will move drug development forward with an unprecedented speed, reliability, and accuracy.
About the Author
Martin-Immanuel Bittner MD DPhil is the Chief Executive Officer of Arctoris, the world’s first fully automated drug discovery platform that he co-founded in 2016. He graduated as a medical doctor from the University of Freiburg in Germany, followed by his DPhil in Oncology as a Rhodes scholar at the University of Oxford. Martin-Immanuel Bittner has extensive research experience covering both clinical trials and preclinical drug discovery. In recognition of his research achievements, he was elected a member of the Young Academy of the German National Academy of Sciences in 2018.