The Glue of Genomics: Will Science’s Unsung Data Heroes Abandon Academia?
Complete the form below to unlock access to ALL audio articles.
Advances in the last twenty years of genomics have turbocharged our ability to decode DNA and other nucleic acids. At the turn of the century, the Human Genome Project completed a 13-year journey to produce the first complete human genome at a great financial cost.
In 2023, genome sequencing is a routine process that has expanded our understanding of biology and fueled a similar enhancement in other -omics disciplines.
These advances owe much to increased raw sequencing power and ever-more ingenious techniques for sampling genomic data, be it from millions of individuals in genomic-wide association sequencing (GWAS) studies or from the heart of a lone cell in single-nucleus RNA sequencing assays.
But all these advances would be for naught if we weren’t able to process and analyze the data torrent that pours from these sequencing projects. Genomic data analysis pipelines have had to progress as well to support this growing field.
A flexible field
Dr. Alison Meynert is a senior research fellow and the bioinformatics analysis core manager at the MRC Institute of Genetics and Cancer (IGC) at the University of Edinburgh. Meynert and her six-person team process data from researcher–clinicians across the University. Meynert’s own background is in computer science and software development, but she has now spent two decades in bioinformatics – basically, she says, “from its infancy.”
While wet-lab researchers in individual niches might be able to laser-focus their projects, Meynert’s team has to stay flexible. “We've had some years where there have been big, nationally funded, whole-genome sequencing projects going on where we'll have hundreds of samples coming in over the course of the year across different cohorts for different research projects,” she says. Now, novel techniques like nanopore and single-cell sequencing have to be considered as well. That requires a deep and broad knowledge base. “Different sequencing machines have totally different error profiles and output formats,” Meynert explains.
One thing that connects these different data sources is that, for Meynert and her team, the goal remains to take complex, raw genomic data and turn it into a form from which the researchers that produced it can extract relevant insights. “It’s a very collaborative process,” says Meynert.
What makes bioinformatics so collaborative?
That collaboration is not just with the wet-lab researchers who come to the core team for help, but with other core labs across the UK, Europe and even globally. Meynert and her team are part of a community of bioinformaticians called nf-core. This project began at the National Genomics Infrastructure in Stockholm, Sweden, which created a set of standards for data analysis pipelines. The project is sponsored by the Chan Zuckerberg Initiative and runs on cloud credits provided by Amazon Web Services and Microsoft Azure, but the team is largely volunteers.
The IGC team is hosting nf-core’s next hackathon, Meynert tells me. “We basically get the best of everybody’s contributions across the bioinformatics community. To develop these pipelines, we have lots of arguments about how to do things. Sometimes you come up with multiple ways of doing things.”
These collaborative events are commonplace across the bioinformatics world. But this underlying ethos of sharing and collaboration, while heartening, stands in stark contrast to processes in biological and biomedical science. Projects like the hugely influential FAIR Guiding Principles, which aimed to exploit the massive increase in digital data to make information in science more findable, accessible, interoperable and re-usable, have increased the amount of lip service paid to data sharing. But a recent study in the Journal of Clinical Investigation skewered the idea that science has significantly embraced open-access principles. This paper trawled through 3,556 articles from over 300 open-access journals that were all published in January 2019.
Just half of these studies indicated that authors were willing to share their data, and of these, a jaw-dropping 93% of authors either did not respond to or declined requests for data access. Is the more relaxed attitude to data sharing in bioinformatics circles a sign that staff in these areas are more magnanimous and beneficent people? Meynert says that the reasons are likely to be more practical. “The bioinformatics community has always been very strongly based around open-source software, I think in large part because if we want to develop a new tool, we're going to need data to test it on. We need someone to have shared their data for us to do that.”
Researchers in genomics and other biological disciplines fear being “scooped” by rival scientists almost as much as they fear a deafening silence at the end of their symposium talk. The resulting culture of secrecy and security around data has proved difficult to rectify. But bioinformatics’ open-source approach has racked up numerous success stories that benefit genomics data analysis researchers like Meynert every day. She points to file formats like binary alignment map (BAM), a compressed version of the text-based sequence alignment map (SAM) format. SAM/BAM (and CRAM, a reference-aligned and compressed version) are some of the most used formats in the genomics field. But they originated from individual research groups becoming frustrated with existing formats and devising changes. Initiatives like the Global Alliance for Genomics and Healthcare (GA4GH) have helped these formats become standardized, enabling them to be widely adopted in the field. Massive data repositories like GitHub make it easy for these innovations to be shared and used by other research groups. GA4GH’s Genome Analysis Toolkit, developed in close collaboration with the Broad Institute, itself a collaboration between MIT and Harvard, is one of the “workhorse tools of genomics,” says Meynert.
The unsung heroes of genomics
The efforts that have gone into creating these resources have arguably advanced our ability to understand the genome as much as technical advances in gene sequencing technology. But the researchers that make these innovations still work in academic circles, where the standard for receiving grants and recognition remains calculated in terms of publication and citations. How can these scientists receive the recognition they deserve for their efforts? It’s a conundrum that has motivated Neil Chue Hong, the founding director and principal investigator of the Software Sustainability Institute (SSI), a project that works with all seven of the UK’s major research councils to improve the practice of using software in science.
Chue Hong noted that one obvious barrier stopping academics who create software and analysis tools from being recognized was the lack of a formal place in the scientific register. At an SSI workshop in 2012, a group coined the term research software engineer (RSE) to codify this position. “It’s a role that has always been present in research for the last maybe four or five decades,” says Chue Hong. “Because there's always been this idea of the researcher – the one who's good at coding – that you ask, ‘How do I fix this piece of software that's not working?’ I think what was causing problems was that there wasn't enough recognition for this role, and as software use became more and more prevalent, that role became more and more important.”
But the increasing influence of RSEs in academia has not been reflected in the amount of recognition they receive. The supremacy of “the final publication” is part of the issue. This gives an undue amount of credit to a mythical figure that Chue Hong calls the “lone hero principal investigator” who is, in theory, meant to be given responsibility for the bulk of work that goes into a scientific paper. In reality, research is a collaborative practice, and a funding system that recognizes the contribution of each lab member to a final publication is an important first step towards realizing this, says Chue Hong: “[Funding bodies] are moving towards recognizing things like narrative CVs and researcher resumes that show it’s not just about the publications, but about the way you disseminate knowledge and the way you pass on skills to other people,” he adds.
There’s progress being made elsewhere toward recognizing the contribution of RSE to genomics and other fields. The American National Standards Institute (ANSI) and National Information Standards Organization (NISO) jointly announced the publication of its CRediT Contributor Roles Taxonomy last year, a framework that divides the practice of science into 14 contributor roles, including conceptualization, funding acquisition, investigation and software. In 2022, the American Chemical Society announced a pilot of CRediT taxonomy in its journals. Chue Hong says that he believes the main sticking point in this area lies in the peer review process. “The last remaining barrier is getting peer reviewers to understand that science is done very differently in 2023 than 20 years ago,” he suggests. This might involve challenging principles that are embedded within researchers’ psyches. “Survivorship bias is unhelpful. Just because someone was successful, by fighting through a particular way of doing things in academic research, doesn't mean that everyone else has to fight that same fight,” says Chue Hong.
Genomics without glue
The battle to get recognition for RSEs is one that should concern all of science. Much of genomics’ analysis toolkit wouldn’t exist without them, and one doesn’t have to look too far ahead to see what might happen if RSEs continue to be shut out of a recognized place in academia. “Increasingly, what people will do is quit academia,” says Chue Hong. You used to have a choice between an academic role and an industry role in research. The difference was that academia was meant to give you more job security, a better pension and a more fruitful working environment, where you've got lots and lots of really interesting people to collaborate with. The tradeoff was salary.
“Now, industry offers you possibly better working conditions, probably a better pension, definitely better people to work with and a better salary. You have to really want to work in academia now, if you're in a software role. The challenge that I think the RSE sector faces is to encourage people to stay in academic research environments and not go into industrial research environments.”
Those struggles will be familiar to many within academia. But RSEs play a vital role at the intersection between an increasingly digital data-rich informatics environment and the wet-lab work that drives biology. Meynert says that her day job will often involve taking tools created by other research groups and fitting them to the task at hand, applying mortar to make different analysis tools work in tandem. “An awful lot of informatics,” she says, “is gluing things together.” If academic research begins to lose RSEs and their contribution, it might start to see how much is reliant on that glue’s grip.