Building better proteins with machine learning - Morgridge Institute for Research

Neural networks and machine learning once seemed like far-fetched futuristic concepts but are now proven successful tools that can help scientists approach big problems (and big datasets).

In a recent study published in the Proceedings of the National Academy of Sciences (PNAS), Morgridge investigator Anthony Gitter and his lab demonstrated that a machine learning model could predict new protein sequences that could improve protein function.

“There’s this really complicated relationship between how that sequence changes, and what it’s actually going to do to the protein and the property that you care about,” says Gitter.

Proteins are made up of a sequence of up to thousands of characters long—a combination of the 20 different amino acids that serve as building blocks. The sequence determines how the protein folds into a three-dimensional shape, and the shape determines its function.

“Almost everything that happens in a cell is because some protein has the right shape to do that job,” Gitter says.

Changing even a single amino acid in the sequence could drastically alter the shape and function of a protein. In most cases, proteins are unaffected or simply fall apart due to an unstable structure. But what if there was a way to home in on changes that would make the protein better at its job?

A fluorescent protein could shine brighter and improve the way biologists visualize cellular activity under a microscope. Or a protein receptor could bind more efficiently to important biological molecules.

“The first really basic question is, do we have machine learning tools and technology right now that can do a good job of modeling and understanding the data that we’ve already collected?” says Gitter.

Sam Gelman, a graduate student in the Gitter Lab, led the design and testing of multiple neural network models to learn about the biological structure and function from existing datasets for several well-known proteins, like green fluorescent protein (GFP) and protein G B1 domain (GB1) that binds to immunoglobulin G (IgG) antibodies.

“Then if machine learning models are good enough, you kind of let them level up to the next set of more interesting biological questions,” Gitter adds.

Once the machine learning software could extrapolate meaningful information from the sequences, the next step was to test its ability to make predictions about how to design new protein sequences that could affect function.

“We actually came up with a new version of a protein that works much better than anything that’s been observed naturally before or anything that’s been engineered before.”
Anthony Gitter

A major advantage for machine learning is that it can analyze high throughput datasets and pull out the best predictions out of millions of sequence combinations—a needle in a haystack process that would be impossible to do experimentally in the lab.

The Gitter Lab worked in close collaboration with the Phil Romero Lab at UW–Madison to fully realize this intersection of machine learning and protein engineering.

The researchers identified five variations of the GB1 protein sequence, which the Romero Lab synthesized into proteins to test their function.

“We actually came up with a new version of a protein that works much better than anything that’s been observed naturally before or anything that’s been engineered before,” Gitter says.

Their new protein—identified as Design10 because it contained 10 amino acid mutations—had a similar structure to GB1, but could bind to IgG antibodies with more than 20 times the affinity of the natural protein.

While this research is a proof-of-concept, Gitter says that protein engineering has huge potential in biomedical research.

“We could treat disease by creating new avenues to respond to drugs or create antibody-based therapeutics,” he says.

Gitter acknowledges that this computational work is built on the foundation of basic research and the experimental work done by groups like the Romero Lab. And the machine learning models are only as good as the datasets on which they are trained.

“The best models can learn a more accurate predictive model with less example data than other models can,” he says. “Where we would really like to go with this in the future is doing less wet lab experimental work to build up these experimental datasets—let the machine learning models step in so that you’re getting to the needle in the haystack a lot sooner.”