Flip a coin ten times and get ten straight “tails,” and you might have something statistically interesting.
Flip the same coin a million times and get a few cases of ten-straight tails, you can chalk it up to random chance.
Scott Hebbring relates this coin-flip analogy to his scientific efforts to divine meaningful patterns from big data on human genetics and disease. However, there’s one notable difference: His coin needs to be flipped 64 billion times.
Hebbring, a Research Scientist at Marshfield Clinic, is leading a massive human health data project that will produce a combined database of association results for thousands of human diseases and millions of genetic markers that may influence them. Using a unique biobank of Marshfield Clinic patients who donated DNA, Hebbring uses genome-wide data on genetic variants from about 11,000 people. That data provides about 8 million different genetic markers that could have some relevant connection to disease.
In addition, the project combines disease information from Marshfield’s extensive electronic health records, which include about 8,000 different disease phenotypes. The ultimate goal is to be able to query all of this data to identify matches between diseases and genetic variants in a timely and efficient manner.
Needless to say, collecting 8 million markers, 8,000 diseases and 64 billion potential combinations requires serious computing horsepower. Enter the HTCondor high-throughput computing technology pioneered by Miron Livny, director of core computation at the Morgridge Institute for Research and UW–Madison professor of computer science. HTCondor is a system of executing massive amounts of computing work and currently supports more 300 research projects at UW–Madison and across the world.
Hebbring has been using HTCondor for several years, and was one of its biggest users in 2017, trailing only a couple of particle physics mega-groups. In all, Hebbring’s project has generated more than 9 million computer hours on HTCondor — the equivalent of more than 1,000 years of computer time. Hebbring estimates that a cloud-based commercial tool to complete this project would have cost $300,000 or more, while HTCondor is a free technology for scientists. Research computing facilitator Lauren Michael has provided consulting over the life of the project.
“Our goal is to better understand the genetics of human disease and this is going to be a tool to help us do that efficiently,” Hebbring says. “So anytime someone is interested in any disease or any gene we have in our database, we’ll be able to pull that information out within minutes.”
Hebbring’s dataset will contain 8,000 Genome-Wide Association Studies (known as GWAS). A GWAS study starts with one specific disease, then gathers genetic information on scores of people who either have or don’t have the disease. A GWAS can then evaluate whether a variant is more common in individuals with the disease compared to those without the disease.
A quick look at research journals and news archives reveals thousands of published GWAS studies. In the past few months alone, different GWAS studies have focused on asthma, stroke, diabetes and Parkinson’s, and even brain-related questions such as insomnia and depression. All these conditions will be captured simultaneously in Hebbring’s dataset.
Hebbring says the major drawback to GWAS studies is they are time-consuming and each one represents a full-circle research project, meaning scientists are reinventing the wheel over and over again. Whereas Hebbring’s approach incorporates all known diseases to enable scientists to query, within minutes, information on phenotypes and associated genes.
“This project will better show the genetic complexities of diseases, which will help us generate better models to understand and predict individual risk.”
Scott Hebbring
In addition to the thousands of GWASs being generated, Hebbring’s dataset will also include 8 million phenome-wide association studies (PheWAS). Whereas a GWAS searches for variants associated with a disease, a PheWAS identifies diseases associated with a genetic variant. The phenome-based approach may be especially valuable for drug repurposing, Hebbring says. Some diseases will be found to share common genetics and common biology, so drugs that are effective for one disease may be repurposed to treat another disease with a related genetic footprint.
“There also is the hope that eventually all of this genetic information will have predictive value and clinical utility,” he says. “Currently, most of the common variants associated with human disease have fairly weak genetic effects. This project will better show the genetic complexities of diseases, which will help us generate better models to understand and predict individual risk.”
Marshfield Clinic is a leader in research using electronic health records, given its unique history of collecting data across the lifespan of patients. Unlike many major medical centers, where patients may be referred for short-term treatments, Marshfield’s largely rural patient base is often seeking care throughout life, across many generations.
“Even though we’re a small institution, we’ve been able to compete with larger health care centers because of our unique data,” Hebbring says.