The PubMed database contains more than 33 million papers that represent the ‘collective consciousness’ of what humans know about biomedicine. It is impossible for researchers to keep up with this vast literature where more than 1,000 new papers get added daily.
One challenge scientists face is winnowing down a long list of targets to a short list worth pursuing in the wet lab. For example: “Which of the 2,000 transcription factor genes are important for the cardiomyocyte cell state?” (Transcription factors are genes that turn other genes on and off, and thus are critical for establishing cell states.)
One strategy would be to perform database searches that indicate links between each of the 2,000 transcription factors (“targets”) and the key phrase “cardiomyocyte. “If a scientist wanted to do this same thing manually with a search in PubMed, they would have to do more than 2,000 separate searches, one for each transcription factor,” says Stewart. Furthermore, a massive manual search doesn’t provide a robust method for measuring the relevance of each target.
KinderMiner Web tackles the challenge by automatically searching through the 33 million PubMed abstracts for connections between each of the targets and the key phrase. It also provides a fast and comprehensive way to prioritize the most relevant targets, steering scientists to a manageable list of the top targets for wet lab validation.
The web application is now free and available to the scientific world.
KinderMiner works by performing statistical analysis of the co-occurrence of the key phrase with each target in the target list. Users can choose from provided curated lists (such as genes, transcription factors, drugs and devices, or diseases), or the user can provide their own list of target terms.
“This tool gives us a more robust ranking than what humans could do on their own,” Stewart says. “Even if someone had the wherewithal and patience to do thousands of individual searches, humans will always be biased in terms of what they originally thought was important. KinderMiner performs the searches programmatically and provide a robust measure of relevance for ranking the targets.”
The team published the research behind the work in the journal F1000.
“We see the tool as particularly useful when people are coming into a new area of biomedical research,” says Stewart. “They just want to find out, what are the genes that are most important to a particular process? Or what are drugs are most likely to be relevant to this condition?”
One way to test the effectiveness of KinderMiner is to allow it to “read” abstracts only before a scientific discovery was made. In 2010, scientists discovered three transcription factors, GATA4, TBX5, and MEF2C, important for turning skin cells into the cells of the beating heart (cardiomyocytes) in a process called cellular reprogramming. Being able to make patient-specific cardiomyocytes from their own skin cells is crucial for personalized toxicology testing, disease modeling, and other tasks.
For this example, Stewart’s group ran KinderMiner, only allowing it to access and read abstracts from 2008 and before. GATA4, TBX5, and MEF2C were in the top 20 of the list of the more than 2,000 transcription factors. Thus, KinderMiner was able to prioritize the top factors two years before the scientific discovery. The hope is that KinderMiner can be used to prioritize targets and facilitate faster discoveries in the future, Stewart says.
KinderMiner can also be useful for validating results. A UW–Madison research team was analyzing medical records for phenotypes that are associated with Fragile X premutation, a genetic condition that can result in developmental and cognitive impairment of offspring. They were able to use KinderMiner as a validating step in their work – to confirm that the phenotypes associated with Fragile X premutation were backed up by statistical analysis of the scientific literature.
Stewart and his team have been developing and using KinderMiner for years to support the work of Morgridge stem cell pioneer James Thomson and other researchers at Morgridge and on the UW campus. Stewart is looking forward to expanding the scope of KinderMiner. The web application will allow anyone to use this powerful algorithm.
Other data corpuses may be added to the system, such as the electronic archives of patents related to biology and medical technology. They also want to continue adding to the lexicons of prefilled lists of biomedical targets. Stewart’s group is working on a second algorithm called “Serial KinderMiner” that should be more powerful in discovering links between targets and key phrases.
Stewart thanks the Morgridge Institute for Research for funding Finn Kuusisto, the postdoctoral researcher who led this project. Stewart also thanks Marv Conney, Donna Green, and Marilyn Pinkley for providing funding for this project.