A Meta-Motif Based Statistical Framework and Pipeline to Predict SELEX-derived Binding Aptamers Introduction
Aptamers are chemically synthesized short single-stranded DNA or RNA oligonucleotides, which have the ability to bind to a variety of targets, such as small molecules (Yang and Bowser, 2013), proteins (Ng, et al., 2006) and the surface of cells (Cerchia, et al., 2005; Sefah, et al., 2010). It is regarded as a promising way to generate large-scale affinity reagents, which can be an alternative of antibody. To generate high affinity aptamers, it starts with a random oligonucleotide pools. Then those oligonucleotides are evolved through a process called systematic evolution of ligands by exponential enrichment (SELEX), which involves several rounds of selection. Despite of its widespread applications, SELEX derived aptamers are suffering from high false positive rates (Cho, et al., 2010).
MPBind is a meta-motif based statistical framework and pipeline to predict SELEX derived binding aptamers. Briefly, MPBind calculates four kinds of p-values (1-sided) for each motif, representing different features. The p-values are then transformed to Z-scores (Z1, Z4, Z3 and Z4) via Z = Φ−1(1−p), where Φ is the standard normal cumulative distribution function. For each motif, MPBind used Stouffer’s method to combine those four Z-scores into one combined Z-score. For any given aptamer sequence, MPBind uses an n-mer window to scan it. The binding potential is inferred from the combinations of those motifs and estimated by Meta-Z-Score. MPBind provides several options for users, such as whether to use unique reads or redundant reads to train parameters, motif length etc. It also provides data preprocess functions to users, e.g., transform FASTQ/FASTA to plain text format, primer trimming and transform antisense sequences to sense sequences based on primer sequence matching.
New Features of MPBind (v2.1):
- This version allows less stringent input. For example, it allows rounds to be defined with commas with or without spaces in between.
- This version allows users to run MPBind in any folder (the previous version assumed that user should run program within the data folder).
- It gives more information for the screen output (e.g., status checking, running time, and summary of output).
Python (version >=2.4.3) and R (version >=2.13.0) are required to be installed.
- Download MPBind (Linux or MacOS)
- tar –xzf
- Add MPBind directory to the $PATH environment variable (Optional) or you need to type the absolute path of MPBind directory before you run this program.
Step 1: Preprocess SELEX-Seq reads (Optional):
MPBind requires input files should be in plain text format with each row only contains sense aptamer sequences. To this end, MPBind provides MPBind_ Preprocess.py script to transform raw sequencing reads formats (FASTQ or FASTA) to plain text format. It will also automatically transform antisense reads to sense reads based on matching primer sequences.
python MPBind_ Preprocess.py < Parameters> Required Parameters: -Infile: Input file name -t: input file format (FASTA or FASTQ) -Forward_primer: Forward primer sequence -Reverse_primer: Reverse primer sequence -primer_max_mismatch: The maximal mismatches allowed to match primers -Outfile: Output file name
python MPBind_ Preprocess.py –Infile Test.fastq –t FASTQ -Forward_primer AGCAGCACAGAGGTCAGATG -Reverse_primer TTCACGGTAGCACGCATAGG -primer_max_mismatch 1 –Outfile Test_sequence.txt
Step 2: MPBind (training)
MPBind requires the input sequences should be in plain text format
Input file Example (plain text):
CTTTGCCACCGGGTTGTAGTTACGGCTGA CTTTGCCACCGGGTTGTAGTTACGGCTGA TTATGTTTTTTTTTTTTTTTAATGCCCTG GTTTTCAAAGAGGCTCGACCTGACTTCTA GGTTTGCTGAGGTGGGCTCTGTTTAACCT GCAGGTGTGGTTTGCTGAGGTGGGCCCTG TTCCCCAATAACATCGTATACCCGCGCCC
Required Parameters: -R0: Initial library file [plain text format] -RS: SELEX round files (e.g., R1, R2, R3, …) [Plain text format] -RC: Control Seq round (No target and just control PCR amplification) -mer: Motif length (e.g., 5,6,7) -U: <1: Unique reads only; 2: Redundant reads only; 3 Both> (default=1) # 1: Unique reads only: merged duplicates to one read # 2: Redundant reads only: Using all reads # 3: Both: MPBind will generate two sub-folders for ‘Unique reads only’ and ‘Redundant reads only’, respectively. -Out: Output file folder
Python MPBind_Train.py -R0 R0.txt -RS R1.txt,R2.txt,R3.txt,R4.txt,R5.txt,R6.txt,R7.txt -RC Control.txt -nmer 6 -U 3 -Out MPBind_Out_R01234567_Unique_and_Redundant
It will generate *.train.nmer (e.g., Test.train.6mer) files under the output file folder.
Step 3: MPBind (Prediction)
Required Parameters: -Train: *.train.nmer (e.g., Test.train.6mer) files generated by MPBind_Train.py -Aptamer: Aptamer sequences to be predicted [Plain text format] -Sort: Sort Aptamer sequences based on combined meta-Z-score < default=FALSE> -Out: Output file
python MPBind_Predict.py -Train Test.train.6mer -Aptamer To_be_predicted.txt -Sort TRUE -Out Predicted_Aptamers.txt
Output file (columns):
- Aptamer.Seq: Aptamer sequences (e.g., TTTTGTTTTTTGTTTTCTTTTCCCCCCTC)
- Z1.Scan: Z1-Scores for each scanned position using n-mer window
- Z1.MetaScore: Combined Z-Score using Z1 only
- Z2.Scan: Z2-Score for each scanned position using n-mer window
- Z2.MetaScore: Combined Z-Score using Z2 only
- Z3.Scan: Z3-Score for each scanned position using n-mer window
- Z3.MetaScore: Combined Z-Score using Z3 only
- Z4.Scan: Z4-Score for each scanned position using n-mer window
- Z4.MetaScore: Combined Z-Score using Z4 only
- Z_Combined.Scan: Combined Z-Score for each scanned position using n-mer window (e.g., 8.0,7.4,3.4 …)
- Z_Combined.MetaScore: Meta-Combined Z-Score
Correspondences regarding the MPBind methods/software should be directed to
Jiang P., Meyer S., Hou Z., Nicholas E. Propson, Soh H.T., Thomson J.A., Stewart R., MPBind: A Meta-Motif Based Statistical Framework and Pipeline to Predict Binding Potential of SELEX-derived Aptamers. (2014), Bioinformatics 30 (18): 2665-2667