A Method and Tool to Control Single-cell RNA-seq Data Quality
Single-cell RNA-seq (scRNA-seq) is emerging as a promising technology for profiling cell-to-cell variability in cell populations. However, the combination of technical noise and intrinsic biological variability makes detecting technical artifacts in scRNA-seq samples particularly challenging. Proper detection of technical artifacts is critical to prevent spurious results during downstream analysis. SinQC is a method and tool detecting technical artifacts in scRNA-seq samples. SinQC assumes that if gene expression outliers are also associated with poor sequencing library quality, then they are more likely to be technical artifacts than to be cells with real biological variation. First, SinQC classifies cells as either gene expression outliers or cells of the main population based on their gene expression patterns. For each cell, SinQC then calculates two types of data quality meta-scores by integrating a set of quality metrics (mapping rates, total number of mapped reads, and reads complexity). The two data quality meta-scores highlight whether a cell has one significantly low quality metric or the overall quality metrics are significantly low. SinQC assumes that the cells of the main population have good data quality and thus uses them to estimate reasonable data quality meta-score cutoffs by requiring a limited fraction of cells of the main population to not pass (allowing the determination of a false positive rate). After estimating reasonable data quality meta-score cutoffs, SinQC identifies cells as technical artifacts if they are gene expression outliers and also fail to pass either of the two meta-score cutoffs.
Python (version >=2.4.3), R (version >=2.13.0) and R package (‘ROCR’) are required to be installed.
- Download SinQC (Linux or MacOS)
- tar –xzf
- Add SinQC directory to the $PATH environment variable (Optional) or you need to type the absolute path of SinQC directory before you run this program.
python SINQC.py <Parameters>
The raw reads (single end) data folder. The default is ‘./’. SinQC will read all ‘*.fastq’ or ‘*.fa’ files within that folder.
The raw reads type: ‘FASTQ’ or ‘FASTA’. The default is ‘FASTQ’.
The RSEM output folder. The default is ‘./’. SinQC will read all ‘*.genes.results’ files from that folder to obtain Expected Counts and TPM values for each sample. SinQC requires that the ‘*.genes.results’ files are generated by RSEM (version >= 1.2.1).
The output folder.
TPM cutoff. SinQC is designed not only for detecting technical artifacts, but also for generating general quality related information. This parameter is to define how many genes can be detected with minimal TPM cutoff. The default is 1.
To define gene expression outliers (GEOs), SinQC calculates a list of Spearman rank correlations of a given cell to the rest of the cells (‘one-to-others’), as well as pairwise correlations after removing that cell. A one-sided Wilcoxon signed-rank test is calculated to assess whether the ‘one-to-others’ is significantly lower than overall ‘pairwise’ correlations. This parameter is the p-value cutoff to define GEOs. The default is 0.001.
Similar to ‘-PValueCutoff–Distinct–Spearman’, it uses ‘Pearson’ instead of ‘Spearman’ correlations. The default is 0.001.
After setting ‘-PValueCutoff–Distinct–Spearman’ and ‘-PValueCutoff–Distinct–Pearson’, this parameter is to tell SinQC how to define GEOs. The options are ‘AND’ or ‘OR’. If this parameter sets ‘AND’, SinQC will define GEOs as cells with both p-values being significant. If this parameter is set to ‘OR’, SinQC will define GEOs as cells with either p-value being significant. The default is ‘AND’.
The maximal false positive allowed. SinQC estimates the quality ‘bottom lines’ by requiring that at least ‘1-FDR’ fraction of MPCs should pass both of the MQS and WCQS cutoffs. Then SinQC applies them to GEOs to determine technical artifacts. The default is 0.05.
Python SINQC.py -SEQ ./Example_Datasets/ -t FAST -RSEM ./Example_Datasets/ -o ./SinQC_Out/ -TPMCutoff 1 -PValueCutoff–Distinct–Spearman 0.001 -PValueCutoff–Distinct–Pearson 0.001 -CorTag AND -Max_FPR 0.05
The command can also be simplified as (other parameters used default settings):
Python SINQC.py -SEQ ./Example_Datasets/ -t FASTQ -RSEM ./Example_Datasets/ -o ./SinQC_Out/
Example dataset and output
Peng Jiang, Ph.D
Computational Biologist, Morgridge Institute for Research,
University of Wisconsin – Madison
330 N. Orchard St., Madison, WI
Tel: 608-316-4479 (Office)
Jiang, P., Thomson, J. A., Stewart, R. Quality Control of Single-cell RNA-Seq by SinQC. Bioinformatics (2016) (In Press)