Our lab develops research software for analyzing large-scale biological datasets. Lab members developed or contributed to the tools below. Also see our GitHub organization.
- Manubot: Manuscripts, Open and Automated
Manubot is a workflow for writing and distributing scholarly manuscripts. It uses GitHub to coordinate large-scale collaborative writing and automates many aspects of the writing process, such as citation. Daniel Himmelstein led the development with many collaborators.
References: Himmelstein et al. 2019, Rando et al. 2021, Manubot website
- nn4dms: Neural networks for deep mutational scanning data
The neural networks for deep mutational scanning data project is a deep learning framework for learning protein sequence-function relationships. The software supports retraining the models described in our manuscript, training models on new sequence-function examples, or using the pre-trained models to predict function scores for new sequence variants. Sam Gelman led the development.
References: Gelman et al. 2021
- SSPS: Sparse Signaling Pathway Sampling
SSPS learns signaling pathway structures from time series protein phosphorylation data. Its statistical model is implemented in the Gen probabilistic programming language. David Merrell led the development.
References: Merrell and Gitter 2020
- SINGE: Single-Cell Inference of Networks Using Granger Ensembles
SINGE adopts Granger Causality to reconstruct transcriptional regulatory networks from pseudotime-ordered single-cell RNA-seq data. It uses a specialized form of Granger Causality to smooth the irregularly-spaced expression data and builds ensembles of many individual candidate networks. Atul Deshpande led the development.
References: Deshpande et al. 2022
- TPS: Temporal Pathway Synthesizer
TPS uses protein-protein interactions and time series phosphorylation data to infer signaling pathway structures. It synthesizes (generates) candidate pathways that are consistent with logical constraints. For instance, a protein activated late in a stimulus response cannot control a protein activated earlier in the response. In addition, all proteins in the pathway must be connected to the source(s) of stimulation. Ali Köksal lead the development in collaboration with several other groups.
References: Köksal et al. 2018
- LPWC: Lag Penalized Weighted Correlation
LPWC is a clustering algorithm specialized for time series data. Unlike general clustering methods, it detects related temporal patterns that occur at similar times even if they are not perfectly synchronized. Thevaa Chandereng lead the development.
References: Chandereng and Gitter 2020
- OMICS Integrator
Omics Integrator is a suite of tools for integrating and building network models from multiple types of omic data (transcriptomic, epigenomic, proteomic, genomic, etc.). The Garnet module combines epigenomic and transcriptomic data to determine which transcription factors are relevant in a biological condition. The Forest module uses the prize-collecting Steiner forest (PCSF) algorithm to connect proteins of interest in a protein-protein interaction network, which may optionally include transcription factors from Garnet. The software was primarily developed by Ernest Fraenkel’s lab.
References: Tuncbag et al. 2016
- ML4Bio: Machine Learning for Biologists
ML4Bio is a Python package used to introduce machine learning concepts to a biology audience in a workshop format. It focuses on classification and wraps the scikit-learn Python package. The workshop includes example datasets and guides to the machine learning pipeline and different classifiers. Chris Magnano, Fangzhou Mu, and Milica Cvetkovic led the development.
References: Magnano et al. 2022
- SDREM: Signaling and Dynamic Regulatory Events Miner
SDREM reconstructs the signaling pathways and transcriptional regulatory networks that cells use to response to external stimuli. It takes as input time series gene expression data following stimulation, a list of proteins that initially detect the stimulation, and optional prior knowledge about the relevance of other proteins on the signaling pathway. These condition-specific data are combined with generic protein-protein interactions and protein-DNA interactions (e.g. from ChIP-seq, ChIP-chip, or inferred from DNA binding motifs). The resulting model predicts which transcription factors control the response, when they are active, and how they are activated by upstream signaling pathways. The Gitter et al. 2015 reference below is a step-by-step guide for using the SDREM software (PDF available upon request). MT-SDREM is a multi-task learning extension of SDREM that jointly models multiple responses. The software was developed with Ziv Bar-Joseph’s lab.
References: Gitter et al. 2013a, Gitter et al. 2013b, Jain et al. 2014, and Gitter et al. 2015
- MEO: Maximum Edge Orientation
MEO orients an undirected graph by finding the edge directions that maximize the high-confidence connections between a set of source nodes and a set of target nodes. This approach can be used to find signaling pathways embedded in a protein-protein interaction network given their starting points (e.g. receptors) and end points (e.g. transcription factors). The software was developed with Ziv Bar-Joseph’s lab.
References: Gitter et al. 2011
- DREM 2.0: Dynamic Regulatory Events Miner
DREM identifies the transcription factors that drive temporal changes in gene expression by predicting which regulators cause groups of genes that are co-expressed up until a particular time point to diverge. It integrates time series gene expression data with protein-DNA interactions (e.g. from ChIP-seq, ChIP-chip, or inferred from DNA binding motifs). DREM 2.0 extends the original DREM by supporting protein-DNA interactions that change over time, incorporating motif finding, and improving the visualization. The software was developed with Jason Ernst and Ziv Bar-Joseph‘s labs.
References: Ernst et al. 2007 and Schulz et al. 2012
- Multi-PCSF: Multi-Sample Prize Collecting Steiner Forest
Multi-PCSF extends the PCSF algorithm to jointly model multiple samples or patients. PCSF combines scores on proteins with a weighted protein-protein interaction network to identify low cost connections between high-scoring proteins. The multi-sample extension learns networks for all samples simultaneously, constraining the networks to be similar for different samples. The software was developed with Ernest Fraenkel‘s lab.
References: Gitter et al. 2014