The Stewart Computational Biology Group is developing text mining tools to help researchers search and find relevant information within this database more quickly.

SKiM search tool goes beyond the surface to discover hidden links in biomedical data

The PubMed database contains more than 35 million peer-reviewed papers from biomedical publications — a vast resource where any one scientist is unlikely to keep up with the breadth of knowledge available.

At the Morgridge Institute, the Stewart Computational Biology Group is developing text mining tools to help researchers search and find relevant information within this database more quickly.

In work recently published in the journal BMC Bioinformatics, the team developed a tool called Serial KinderMiner (SKiM) that can make connections and uncover potentially hidden associations between a set of terms.

Rob-Millikin
Rob Millikin

Building off of their original KinderMiner web application that finds links between two terms, the SKiM algorithm uses literature-based discovery to find unknown links between terms “A” and “C” through a “B” intermediary.

“Instead of just finding existing A-B links, you can find A-B-C links which implies that A and C might be related,” says Rob Millikin, computational biologist and co-first author of the paper. “It’s not just summarizing existing knowledge, but you can potentially discover new things.”

This application is particularly useful for investigating new uses for existing drugs.

For example, if term A is a particular disease, and term C is the list of all existing FDA-approved drugs, the B terms might be a list of known genes or biological pathways, some of which are associated with both A (disease) and C (drug) linking them together.

“In some cases for the drug repurposing application, we might find drugs that don’t show up with a clinical trial or even a co-occurrence together with the disease,” says Morgridge Investigator Ron Stewart. “If we can give a researcher a two-year or a five-year head start on what drugs might be useful, that’s going to be potentially very helpful.”

Ron Stewart
Ron Stewart

A true collaboration over the years between many members of the Stewart Computational Biology Group, Millikin developed much of the back end programming that makes SKiM run smoothly, while front-end developer Cannon Lock (from Miron Livny and Brian Bockelman’s Research Computing Group) helped make the tool more robust at the web interface that users see.

“It was important to us to have a free, publicly available tool where people could use this and it would be practical,” Millikin says.

The success of SKiM is due to improved indexing and efficient use of memory. Instead of needing to read through all 35 million papers for every single query, Millikin has developed pre-processing steps where the search term can scan through a smaller set of abstracts, allowing SKiM to operate more quickly.

Millikin says that with their current server architecture, the front-end hands off information to the back-end, like assigning jobs to workers. Currently they have 10 workers on the Stewart Group servers that can handle up to 10 different jobs all at once.

Stewart adds that in the future, they plan to partner with the Center for High Throughput Computing, which means they could utilize 100 workers in parallel — or even one thousand or one million workers — which would make things even faster. 

Adding to the robust capabilities of SKiM, Millikin used a transformer machine-learning model (similar to ChatGPT) to develop a “knowledge graph” to understand and label the relationships between A, B, and C search terms.

“The goal was to provide the sort of qualitative labels to make it easier for a human to sort of skim through and read (no pun intended),” says Millikin.

By using terms like positive vs. negative associations or inhibition vs. activation, SKiM can help interpret deeper relationship by taking the quantitative statistical data and then applying the qualitative descriptor between the terms.

“There are co-occurrence models out there, but very few of them have any idea about what the relationship is between the term. All those models can do is say A is related to B in some way,” says Stewart. “But now we can put a word on it and say, A treats B, or A inhibits B, or A binds B. So that’s pretty promising.”

Millikin adds that they want to really emphasize transparency and trust, so users feel confident in the tool and where they are getting their information. 

“Rob did a great job of making the code really reproducible and solid. So much that one of our reviewers actually said ‘this is a great example of how to make the gold standard for scientific reproducibility’,” adds Stewart. “Which I think is really cool — something that we’ve been stressing.”