Genome data gaps could stymie search for next COVID variant

A health worker collects the throat swab from a girl for Polymerase chain reaction (PCR) test in Nepal.

Samples of the SARS-CoV-2 virus are sequenced at widely differing rates around the world.Credit: Prabin Ranabhat/SOPA Images/LightRocket/Getty

Many countries sequencing SARS-CoV-2 genomes are sharing only a fraction of them on public repositories — and many sequences are missing important information, according to a global analysis of genomic surveillance. But the study also found that despite these challenges, countries have become faster at sharing sequences over the course of the pandemic.

Openly sharing genome-sequencing data from SARS-CoV-2 samples has allowed researchers to track how the virus is evolving and has become a hallmark of the pandemic. But researchers are concerned that data gaps could make it harder to spot the next COVID-19 variant of concern, and could frustrate efforts to respond to it quickly.

“Sharing data is absolutely important for everybody’s survival,” says Neelika Malavige, an immunologist at the University of Sri Jayewardenepura in Colombo.

In the study1published in Nature Genetics this week, researchers collected genomic data uploaded to public repositories including GISAID between the beginning of the pandemic and 31 October 2021 — comprising roughly 4.9 million genomes from 169 countries.

Hidden variants: Many countries shared fewer than half of their genome sequences from SARS-CoV-2 variants of concern.


They compared those sequences with official reports from individual countries and found that of 62 countries that report those data, 23 — more than one-third — had uploaded fewer than 50% of their sequences from the variants of concern Alpha, Beta, Gamma and Delta (see Hidden variants). About one-quarter of countries had uploaded fewer than 25% of their sequences. The lack of sharing is a global problem, says co-author Andrew Azman, an infectious-disease epidemiologist at Johns Hopkins University in Baltimore, Maryland. “It’s not just a rich or poor country thing.”

Punishing transparency

The authors suggest several reasons that some countries might not share all their sequences on public repositories. It’s possible that some of the samples were not sequenced in the first place, because there are ways of identifying variants of concern without sequencing the full genomes, says Azman. And depending on the sequencing technology that researchers have used, some samples were probably not of sufficient quality to upload, explains Cynthia Saloma, a molecular biologist at the University of the Philippines Diliman in Quezon City.

But a chunk of unshared sequences are probably being held back for political reasons, including the repercussions of being the first country to report a new variant of concern. “Most countries that share those data usually are made to suffer for it,” says Nnaemeka Ndodo, a molecular bioengineer at the Nigeria Center for Disease Control in Abuja. For example, when researchers in South Africa and Botswana alerted the world to the Omicron variant last November, a slew of countries responded by shutting their borders to the region.

In some countries, governments need to review and approve sequences before they are uploaded. Governments of tourism-dependent nations “might ask their labs not to share the data because of the impact it will have”, says Malavige.

But Azman says that data sharing is only part of the story. Some countries share a high proportion of their samples, but have sequenced only a handful of genomes, he says.

The researchers found that 87 countries were routinely sequencing samples, but 31 were not, and the team couldn’t find information about the genomic surveillance strategies for another 76. Overall, no more than 4.5% of confirmed COVID-19 cases were sequenced every week from September 2020 onwards, with large discrepancies across regions, from a total of 3.4% genomes sequenced in Europe during the study period to 0.1% in the eastern Mediterranean. Some countries, including Norway, the United Kingdom and Canada, have sequenced at least 10% of their cumulative cases.

Data on data

The study also assessed the quality of metadata uploaded to GISAID by 169 countries. It found that 63% of sequences did not include information about the age and sex of the person they were sampled from, and more than 95% were missing clinical information such as symptom severity and the vaccination status of the infected person. Higher-income countries tended to provide fewer metadata than did lower-income regions.

Metadata are especially important when a new variant emerges, to assess who is most at risk, how well existing vaccines and drugs will work, and the conditions that could have led to its emergence, say researchers.

Speeding up: The time taken by countries to share COVID-19 sequencing data has reduced during the pandemic.


Again, there could be many reasons for the information gaps, such as data-privacy concerns, and that the collection of metadata can’t keep up with the pace of samples being sequenced. Sometimes a sample might be missing metadata but have come from a remote province, making it too precious not to share, says Nino Susanto, a bioengineer who heads the COVID-19 testing laboratory GSI Lab in Jakarta.

Despite data-sharing challenges, the study also found that countries have become faster at sharing sequences during the pandemic. In 2020, it took close to three months, on average, for researchers in most countries to collect, sequence and upload genomic data to public repositories (see Speeding up). However, that fell to 20 days by the time the Delta variant emerged in 2021. “This pandemic has normalized sharing,” says Azman.

Leave a Comment