Colin Carlson, a biologist at Georgetown University, has started to worry about mousepox.
The virus, discovered in 1930, spreads among mice, killing them with ruthless efficiency. But scientists have never considered it a potential threat to humans. Now Dr. Carlson, his colleagues and their computers aren’t so sure.
Using a technique known as machine learning, the researchers have spent the past few years programming computers to teach themselves about viruses that can infect human cells. The computers have combed through vast amounts of information about the biology and ecology of the animal hosts of those viruses, as well as the genomes and other features of the viruses themselves. Over time, the computers came to recognize certain factors that would predict whether a virus has the potential to spill over into humans.
Once the computers proved their mettle on viruses that scientists had already studied intensely, Dr. Carlson and his colleagues deployed them on the unknown, ultimately producing a short list of animal viruses with the potential to jump the species barrier and cause human outbreaks.
In the latest runs, the algorithms unexpectedly put the mousepox virus in the top ranks of risky pathogens.
“Every time we run this model, it comes up super high,” Dr. Carlson said.
Puzzled, Dr. Carlson and his colleagues rooted around in the scientific literature. They came across documentation of a long-forgotten outbreak in 1987 in rural China. Schoolchildren came down with an infection that caused sore throats and inflammation in their hands and feet.
Years later, a team of scientists ran tests on throat swabs that had been collected during the outbreak and put into storage. These samples, as the group reported in 2012, contained mousepox DNA. But their study garnered little notice, and a decade later mousepox is still not considered a threat to humans.
If the computer programmed by Dr. Carlson and his colleagues is right, the virus deserves a new look.
“It’s just crazy that this was lost in the vast pile of stuff that public health has to sift through,” he said. “This actually changes the way that we think about this virus.”
Scientists have identified about 250 human diseases that arose when an animal virus jumped the species barrier. HIV jumped from chimpanzees, for example, and the new coronavirus originated in bats.
Ideally, scientists would like to recognize the next spillover virus before it has started infecting people. But there are far too many animal viruses for virologists to study. Scientists have identified more than 1,000 viruses in mammals, but that is most likely a tiny fraction of the true number. Some researchers suspect mammals carry tens of thousands of viruses, while others put the number in the hundreds of thousands.
To identify potential new spillovers, researchers like Dr. Carlson are using computers to spot hidden patterns in scientific data. The machines can zero in on viruses that may be particularly likely to give rise to a human disease, for example, and can also predict which animals are most likely to harbor dangerous viruses we don’t yet know about.
“It feels like you have a new set of eyes,” said Barbara Han, a disease ecologist at the Cary Institute of Ecosystem Studies in Millbrook, NY, who collaborates with Dr. Carlson. “You just can’t see in as many dimensions as the model can.”
Dr. Han first came across machine learning in 2010. Computer scientists had been developing the technique for decades, and were starting to build powerful tools with it. These days, machine learning enables computers to spot fraudulent credit charges and recognize people’s faces.
But few researchers had applied machine learning to diseases. Dr. Han wondered if she could use it to answer open questions, such as why less than 10 percent of rodent species harbor pathogens known to infect humans.
She fed a computer information about various rodent species from an online database — everything from their age at weaning to their population density. The computer then looked for features of the rodents known to harbor high numbers of species-jumping pathogens.
Once the computer created a model, she tested it against another group of rodent species, seeing how well it could guess which ones were laden with disease-causing agents. Eventually, the computer’s model reached an accuracy of 90 percent.
Then Dr. Han turned to rodents that have yet to be examined for spillover pathogens and put together a list of high-priority species. Dr. Han and her colleagues predicted that species such as the montane vole and Northern grasshopper mouse of western North America would be particularly likely to carry worrisome pathogens.
Of all the traits Dr. Han and her colleagues provided to their computer, the one that mattered most was the life span of the rodents. Species that die young turn out to carry more pathogens, perhaps because evolution put more of their resources into reproducing than in building a strong immune system.
These results involved years of painstaking research in which Dr. Han and her colleagues combed through ecological databases and scientific studies looking for useful data. More recently, researchers have sped this work up by building databases expressly designed to teach computers about viruses and their hosts.
In March, for example, Dr. Carlson and his colleagues unveiled an open-access database called VIRION, which has amassed half a million pieces of information about 9,521 viruses and their 3,692 animal hosts — and is still growing.
Databases like VIRION are now making it possible to ask more focused questions about new pandemics. When the Covid pandemic struck, it soon became clear that it was caused by a new virus called SARS-CoV-2. Dr. Carlson, Dr. Han and their colleagues created programs to identify the animals most likely to harbor relatives of the new coronavirus.
SARS-CoV-2 belongs to a group of species called betacoronaviruses, which also includes the viruses that caused the SARS and MERS epidemics among humans. For the most part, betacoronaviruses infect bats. When SARS-CoV-2 was discovered in January 2020, 79 species of bats were known to carry them.
But scientists have not systematically searched all 1,447 species of bats for betacoronaviruses, and such a project would take many years to complete.
By feeding biological data about the various types of bats — their diet, the length of their wings, and so on — into their computer, Dr. Carlson, Dr. Han and their colleagues created a model that could offer predictions about the bats most likely to harbor betacoronaviruses. They found over 300 species that fit the bill.
Since that prediction in 2020, researchers have indeed found betacoronaviruses in 47 species of bats — all of which were on the prediction lists produced by some of the computer models they had created for their study.
Daniel Becker, a disease ecologist at the University of Oklahoma who also worked on the betacoronavirus study, said it was striking the way simple features such as body size could lead to powerful predictions about viruses. “A lot of it is the low-hanging fruit of comparative biology,” he said.
Dr. Becker is now following up from his own backyard on the list of potential betacoronavirus hosts. It turns out that some bats in Oklahoma are predicted to harbor them.
If Dr. Becker does find a backyard betacoronavirus, he won’t be in a position to say immediately that it is an imminent threat to humans. Scientists would first have to carry out painstaking experiments to judge the risk.
Pranav Pandit, an epidemiologist at the University of California at Davis, cautions that these models are very much a work in progress. When tested on well-studied viruses, they do substantially better than random chance, but could do better.
“It’s not at a stage where we can just take those results and create an alert to start telling the world, ‘This is a zoonotic virus,’” he said.
Nardus Mollentze, a computational virologist at the University of Glasgow, and his colleagues have pioneered a method that could markedly increase the accuracy of the models. Rather than looking at a virus’s hosts, their models look at its genes. A computer can be taught to recognize subtle features in the genes of viruses that can infect humans.
In their first report on this technique, Dr. Mollentze and his colleagues developed a model that could correctly recognize human-infecting viruses more than 70 percent of the time. Dr. Mollentze can’t yet say why his gene-based model worked, but he has some ideas. Our cells can recognize foreign genes and send out an alarm to the immune system. Viruses that can infect our cells may have the ability to mimic our own DNA as a kind of viral camouflage.
When they applied the model to animal viruses, they came up with a list of 272 species at high risk of spilling over. That’s too many for virologists to study in any depth.
“You can only work on so many viruses,” said Emmie de Wit, a virologist at Rocky Mountain Laboratories in Hamilton, Mont., who oversees research on the new coronavirus, influenza and other viruses. “On our end, we would really need to narrow it down.”
Dr. Mollentze acknowledged that he and his colleagues need to find a way to pinpoint the worst of the worst among animal viruses. “This is only a start,” he said.
To follow up on his initial study, Dr. Mollentze is working with Dr. Carlson and his colleagues to merge data about the genes of viruses with data related to the biology and ecology of their hosts. The researchers are getting some promising results from this approach, including the tantalizing mousepox lead.
Other kinds of data may make the predictions even better. One of the most important features of a virus, for example, is the coating of sugar molecules on its surface. Different viruses end up with different patterns of sugar molecules, and that arrangement can have a huge impact on their success. Some viruses can use this molecular frosting to hide from their host’s immune system. In other cases, the virus can use its sugar molecules to latch on to new cells, triggering a new infection.
This month, Dr. Carlson and his colleagues posted a commentary online asserting that machine learning may gain a lot of insights from the sugar coating of viruses and their hosts. Scientists have already gathered a lot of that knowledge, but it has yet to be put into a form that computers can learn from.
“My gut sense is that we know a lot more than we think,” Dr. Carlson said.
Dr. de Wit said that machine learning models could some day guide virologists like herself to study certain animal viruses. “There’s definitely a great benefit that’s going to come from this,” she said.
But she noted that the models so far have focused mainly on a pathogen’s potential for infecting human cells. Before causing a new human disease, a virus also has to spread from one person to another and cause serious symptoms along the way. She’s waiting for a new generation of machine learning models that can make those predictions, too.
“What we really want to know is not necessarily which viruses can infect humans, but which viruses can cause an outbreak,” she said. “So that’s really the next step that we need to figure out.”