Machine learning our way through a billion-chemical library

Academic scientists are on the front lines of the search for promising new drugs that might disrupt a disease or fight a microbial invader. But how does a modestly funded research lab find the right match among literally a billion known chemical candidates?

Inspired by the need for smarter and cheaper ways to screen chemicals for potential therapeutic targets, Morgridge Institute for Research investigator Anthony Gitter is turning to machine learning tools to help tame that billion-chemical library. Gitter is developing a computational platform that can make chemical screening accessible to more research groups and enable drug discovery without the massive price tag.

Right now, most scientists can only afford to screen a small fraction of known chemicals, and the odds are slim of getting a medically important “hit,” Gitter says.

“Scientists are not going to get lucky and find the right candidate by testing 10,000 chemicals, or even 100,000 chemicals,” says Gitter. “But we have this really ambitious goal of considering the entire catalog of chemicals, but only testing a much, much smaller fraction in the lab. To get there, you need to have a really good machine learning system that can help you figure out which of the billion to test.”

The essence of Gitter’s approach is having the machine learning algorithms train on smaller batches of chemicals that have already been tested and have well-defined results. Once the algorithm is trained to recognize the features of good chemical candidates, the system can be unleashed on massive commercial chemical databases, such as Enamine REAL, to winnow down a much smaller subset of chemicals worthy of purchasing and testing.

“The more failures you eliminate from the drug discovery pipeline, the more it starts to bring down net costs to develop a new drug.”
Anthony Gitter

Gitter found the ideal test case for his algorithms in the lab of James Keck, a UW–Madison professor of biomolecular chemistry. The Keck lab in 2019 was completing a multi-year search for new antibacterial drug candidates, looking at non-traditional targets such as protein interfaces. New approaches are desperately needed because of the rise of antibiotic-resistant bacteria and limited returns from conventional techniques.

The project yielded a database of about 400,000 chemicals, most of which did not show ability to disrupt bacteria, but a fair number that did. Gitter used a subset of 100,000 chemicals, training the algorithm on 75,000 and keeping 25,000 blind. The algorithm then probed the blind dataset for potential hits and was remarkably accurate.

The next step was to use high-throughput computing to bring the algorithm into the vast virtual library of chemicals, fishing for promising candidates. The Gitter Lab ended up purchasing 68 chemicals from this virtual screening, and an astonishing 31 of them showed some ability to block the targeted bacterial protein-protein interaction. That almost 50 percent hit rate would compare to a normal hit rate of one in 1,000 in random screens.

“This was the first big test that tells us this is more than just a crazy idea,” Gitter says. “And now the challenge is going to be how we do this with less and less data to begin with.”

Keck says the project was valuable to his lab and is already preparing a second partnership looking at cancer chemotherapy targets.

“The unique thing that Tony did was, there were maybe a half-dozen different algorithms people have come up with to achieve this goal, but nobody has brought them all together. The genius of this is he brought them all together and let them learn from each other.”

Keck says this system really plays to the strengths of academic labs, which are very good at identifying interesting biological targets, but less capable of the expensive translational work. “This makes it so you could take a prohibitively expensive screen and essentially enrich for hits before you even have to pay a dollar, based on just an initial piece of the screen,” he says.

The median estimated cost of bringing a drug full-circle to market — from discovery to clinical trials to patient — was $2.6 billion in 2019, according to the Tufts Center for the Study of Drug Development. Chemical screening is only a small fraction of that, but it’s enough of a cost to keep many investigators out of the game.

Making this process more accessible to mainstream science could have an exponential effect on drug discovery, Gitter says. “The more failures you eliminate from the drug discovery pipeline, the more it starts to bring down net costs to develop a new drug,” he says.

The UW–Madison Center for High-Throughput Computing is helping build the machine learning system into a platform that can be used by scientists around the country. The project is supported by the National Institutes of Health.