Chemical Legos: A machine learning approach to faster drug discovery

You might think of it as the world’s most elaborate Lego set.

Science now has available a virtual library of chemicals where individual molecular pieces can be interlocked together to create more than 1 billion potential new combinations, all never before synthesized. And a tiny fraction of them might be medical gold.

As mind-boggling as it sounds, these types of virtual, on-demand chemical repositories get bigger every year. They have tantalizing potential for drug discovery, but they are also far from practical in helping scientists find the precious few hits that could actually lead to safe and effective treatments for disease.

Anthony Gitter, an investigator at the Morgridge Institute for Research, and his team have designed a new machine learning strategy that could nimbly navigate this new world, shaving both time and cost off of the drug discovery process. In findings posted October 20, 2021 on the pre-print server Chem RXIV, Gitter describes a new technique that streamlines the screening of these billion-chemical libraries to look just for chemical structures of interest, rather than trying to simulate complicated chemical-protein interactions that take months of computing time.

“I think one of the main points we’re trying to make in this work is that because our predictions are so fast, we now have the power to make predictions on every chemical that’s theoretically commercially available,” says Gitter, also a professor of biostatistics and medical informatics at the University of Wisconsin–Madison. “So it’s no longer screening 1,000 or 100,000 chemicals at a time. We can screen this set of over a billion and can score them all with no problem.”

For the study, Gitter’s team began with the results of a separate UW–Madison project that was searching for potential new antibacterial drugs. That project produced a database of about 400,000 chemicals that had been rigorously screened for their ability to disrupt bacteria. Gitter’s lab used that database to train his machine learning algorithm to find useful patterns that help tell the difference between promising and irrelevant chemicals.

Then they unleashed their well-trained model into the vast sea of these commercial chemical repositories. This first search was done on an “in-stock” library of more than 8 million molecules that had been previously synthesized. The second was done on an “on-demand” library where more than 1 billion different chemical combinations could be constructed.

The “in-stock” search yielded 701 potential compounds of interest — 48 percent of which proved to be relevant to the target of disrupting bacteria. Then they tested within the billion-chemical virtual library, and that produced 68 top predictions. Of those, 31 chemicals — or 46 percent — proved to be hits for their chosen target.

“When I was talking to the grad students who led this work, we gave our list of predictions and people said we’ll be happy if we get one single hit back,” Gitter says. “We had a very low bar and had no idea what to expect. So we were blown away compared to what others in the field have seen.”

Using a map as an analogy, Gitter says that all drugs currently approved by the Food and Drug Administration (FDA) might represent one tiny island within a sea of unexplored chemical space. “And there could be some real home runs out there in that unexplored space,” Gitter says. “But if you want to find them, you need to know what region to look in.”

This initial exploration is essential to drug discovery, Gitter says. The better science gets at narrowing the geography of their searches, the more likely they get to the targets of interest. And finding an interesting compound is only the beginning of a process that includes identifying side effects and toxicity, and ultimately clinical trials.

Fast and efficient machine learning can help sort through the billions to “find the special, special few,” Gitter says.