This pipeline will import data from single-cell preprocessing algorithms (e.g. CellRanger), generate various quality control metrics (e.g. doublet scores), and output results in standard data containers (e.g. SingleCellExperiment). Both the original droplet matrix and the filtered cell matrix will be processed. This pipeline is focused on single cell data generated from microfluidic devices (e.g. 10X).
If you are using docker image to run the pipeline, please skip this section and refer sections Running the pipeline and Docker and Singularity Images. Docker image has all dependencies installed properly so that you can use it directly on any machine.
This pipeline depends on Python package described below. If these packages is not installed or not at the proper version, please install them by running the following code:
--user h5py==2.9.0 pip3 install --user Scrublet pip3 install --user virtualenv pip3 install --user scanpypip3 install
For each tool, the pipeline expect the data in a certain format within a specific directory structure. For example, data generated with CellRanger V3 is expected to be under a sample folder (specified with the “-s” flag) and a base folder (specified with the “-b” flag). Some tools prepend the sample name onto the output files instead of making a separate subdirectory (e.g. BUStools and SEQC). The combination of –base_path (“-b”) and –sample (“-s”) should specify the location of the data files: For a comprehensive description of expected input directory structure of import functions, refer documentation here
The required arguments are as follows:
-b, --basePath (required). Base path for the output from the preprocessing algorithm -P, --preproc (required). Algorithm used for preprocessing. One of One of 'CellRangerV2', 'CellRangerV3', 'BUStools', 'STARSolo', 'SEQC', 'Optimus', 'DropEst', 'SceRDS', 'CountMatrix' and 'AnnData'. -s, --sample (required). Name of the sample. This will be prepended to the cell barcodes. -o, --directory (required). Output directory. A new subdirectory will be created with the name "sample". R, Python, and FlatFile directories will be created under the "sample" directory containing the data containers with QC metrics. Default ".". More information about output directory structure is explained in "Understanding outputs" section below. -F, --outputFormat (required). The output format of this QC pipeline. Currently, it supports R (RDS), FlatFile, Python (AnnData) and HTAN (manifest files that meets HTAN requirement). -S, --splitSample (required). Save a SingleCellExperiment object for each sample. Default is TRUE. If FALSE, the data of all samples will be combined into one SingleCellExperiment object and this object will be outputed.
The optional arguments are as follows. Their usage depend on type of data and user-defined behaviour.
-g, --gmt (optional). GMT file containing gene sets for quality control. -t, --delim (optional, required when -g is specified). Delimiter used in GMT file. Default "\t". -G, --genome (optional). The name of genome reference. This is only required for CellRangerV2 data. -y, --yaml (optional). YAML file used to specify parameters of QC functions called by singleCellTK QC pipeline. Please check "Specify parameters for QC algorithms" section for details. -c, --cellData (optional). The full path of the RDS file or Matrix file of the cell matrix. This would be use only when --preproc is SceRDS or CountMatrix. -r, --rawData (optional). The full path of the RDS file or Matrix file of the droplet matrix. This would be provided only when --preproc is SceRDS or CountMatrix. -C, --cellPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the filtered_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified. -R, --rawPath (optional). The directory contains matrix.mtx.gz, features.tsv.gz and barcodes.tsv.gz files originally generated by 10x CellrangerV2 or CellrangerV3 (files in the raw_feature_bc_matrix directory). This argument only works when --preproc is CellRangerV2 or CellRangerV3. Default is NULL. If 'base_path' is NULL, 'cellPath' or 'rawPath' should be specified. -d, --dataType. Type of data as input. Default is Both, which means taking both droplet and cell matrix as input. If set as 'Droplet', it will only processes droplet data. If set as 'Cell', it will only processes cell data. -D, --detectCells. Detect cells from droplet matrix. Default is FALSE. This argument is only eavluated when -d is 'Droplet'. If set as TRUE, cells will be detected and cell matrixed will be subset from the droplet matrix. Also, quality control will be performed on the detected cell matrix. -m, --cellDetectMethod. Methods to detect cells from droplet matrix. Default is 'EmptyDrops'. This argument is only eavluated when -D is 'TRUE'. Other options could be 'Knee' or 'Inflection'. More information is provided in the section below. -n, --numCores. Number of cores used to run the pipeline. By default is 1. Parallel computing is enabled if -n is greater than 1. -T, --parallelType. Type of parallel computing used for parallel computing. Parallel computing used in this pipeline depends on "BiocParallel" package. Default is 'MulticoreParam'. It can be 'MulticoreParam' or 'SnowParam'. This argument will be evaluated only when numCores > 1. -i, --studyDesign. The txt file containing the desrciption of the study design. Default is NULL. This would be shown at the begining the html report of cell and droplet QC. -L, --subTitle. The subtitle used in the cell and droplet QC HTML report. Default is None. The subtitle can contain information of the sample, like sample name, etc. If -S is set as TRUE, the length of subsitle should be the same as the length of samples. If -S is set as FALSE, the length of subtitle should be one or NULL.
To run the pipeline script, users will need to download the ‘SCTK_runQC.R’ here.
This pipeline enables different ways to import CellrangerV2/CellrangerV3 data for flexibility.
Rscript SCTK_runQC.R \-b /base/path \ -P CellRangerV3 \ -s SampleName \ -o Output_Directory \ -S TRUE \ -F R,Python,FlatFile,HTAN
For CellRangerV2, the reference used by cellranger needs to be specified by -G/–genome:
Rscript SCTK_runQC.R \-b /base/path \ -P CellRangerV2 \ -s SampleName \ -o Output_Directory \ -S TRUE \ -G hg19 \ -F R,Python,FlatFile,HTAN
As shown in the Import data into SCTK section, -b specify the base path and usually it’s the folder where you ran 10x cellranger count. -s specify the sample name, which has to be same as the name of the sample folder under the base folder. The folder layout would look like the following:
├── BasePath └── SampleName ├── outs | ├── filtered_feature_bc_matrix | | ├── barcodes.tsv.gz | | ├── features.tsv.gz | | └── matrix.mtx.gz | ├── raw_feature_bc_matrix | | ├── barcodes.tsv.gz | | ├── features.tsv.gz | | └── matrix.mtx.gz ...
Rscript SCTK_runQC.R \-P CellRangerV2 \ -C /path/to/cell/matrix \ -R /path/to/droplet/matrix \ -s SampleName \ -o Output_Directory \ -S TRUE \ -F R,Python,FlatFile,HTAN
In this case, you must skip -b arguments and you can also skip -G argument for CellRangerV2 data.
Rscript SCTK_runQC.R \-P SceRDS \ -s Samplename \ -o Output_Directory \ -S TRUE \ -F R,Python,FlatFile,HTAN \ -r /path/to/rds/file/droplet.RDS \ -c /path/to/rds/file/cell.RDS
Rscript SCTK_runQC.R \-P CountMatrix \ -s Samplename \ -o Output_Directory \ -S TRUE \ -r /path/to/matrix/file/droplet.txt \ -c path/to/matrix/file/cell.txt
If your data is preprocessed by other algorithms, you can load the data as shown in the figure above. Basically, the templated is shown below:
Rscript SCTK_runQC.R \-b /base/path \ -P Preprocessing_Algorithm \ -s SampleName \ -o Output_Directory \ -S TRUE \ -F R,Python,FlatFile,HTAN
User can choose to run QC pipeline on only one of the droplet count or cell count matrix, instead of running on both. In this case, the pipeline will only take single input and perform quality control on it. One of the example is shown below:
Rscript SCTK_runQC.R \-b /base/path \ -P Preprocessing_Algorithm \ -s SampleName \ -o Output_Directory \ -S TRUE \ -F R,Python,FlatFile,HTAN \ -d Droplet \ -D TRUE \ -m EmptyDrops
-d argument is used to specify which count matrix to used in the QC pipeline. Default is “Both”, which means take both droplet and cell count data as input.
If -d argument is set as “Droplet”, the QC pipeline will only take droplet count matrix as input and perform quality control. You can choose whether to detect cells from the droplet matrix by setting -D as TRUE. If yes, cell count matrix will be detected and the pipeline will also perform quality control on this matrix and output the result. You could further define the method used to detect cells from droplet matrix by setting -m argument. -m could be one of “EmptyDrops”, “Knee” or “Inflection”. “EmptyDrops” will keep cells that pass the “runEmptyDrops” function test. “Knee” and “Inflection” will keep cells that pass the knee or inflection point returned from “runBarcodeRankDrops” function.
If -d argument is set as “Cell”, the QC pipeline will only take cell count matrix as input and perform quality control. A figure showing the analysis steps and outputs of different inputs is shown below:
User can specify parameters for QC algorithms in this pipeline with a yaml file (supplied with -y/–yamlFile argument). The current supported QC algorithms including doublet dection (bcds, cxds, cxds_bcds_hybrid, doubletFinder, doubletCells and scrublet), decontamination (decontX), emptyDrop detection (emptyDrops) and barcodeRankDrops (barcodeRanks). A summary of each function is shown below:
An example of QC parameters yaml file is shown below:
--- Params: ### should not be omitted bcds: ntop: 600 cxds: ntop: 600 cxds_bcds_hybrid: nTop: 600 decontX: maxIter: 600 emptyDrops: lower: 50 niters: 5000 testAmbient: True barcodeRanks: lower: 50
The format of yaml file can be found here. The parameters should be consistent with the parameters of each QC function in singleCellTK package. Parameters that are not defined in this yaml file will use the default value. Please refer reference for detailed information about arguments in each QC function.
Quantifying the level of gene sets can be useful quality control. For example, the percentage of counts from mitochondrial genes can be an indicator or cell stress or death.
Users can pass a GMT file (with -g/–gmt argument) to the pipeline with one row for each gene set. The first column should be the name of the gene set (e.g. mito). The second column for each gene set in the GMT file (i.e. the description) should contain the location of where to look for the matching IDs in the data. If set to ‘rownames’, then the gene set IDs will be matched with the row IDs of the data matrix. If a character string or an integer index is supplied, then gene set IDs will be matched to IDs the that column of feature table.
SCTK QC pipeline enables parallel computing to speed up the analysis. Parallel computing is enabled by setting -n/–numCores > 1. The -n/–numCores is used to set the number of cores used for the pipeline.
The backend of parallel computing is supported by “BiocParallel” package. Therefore, users can select different types of parallel evaluation by setting -T/–parallelType argument. Default is “MulticoreParam”. Currently, “MulticoreParam” and “SnowParam” is supported for -T argument. However, “MulticoreParam” is not supported by Windows system. Winows user can choose “SnowParam” as the backend of parallel computing.
The output directory is created under the path specified by -o/–directory argument. Each sample is stoed in the subdirectory (named by -s/–sample argument) within this output direcotry. Within each sample directory, the each output format will be separated into subdirectories. The output file hierarchy is shown below:
(root; output directory) ├── level3Meta.csv ├── level4Meta.csv ├── sample1_dropletQC.html ├── sample1_cellQC.html └── sample1 ├── R | ├── sample1_Droplets.rds | └── sample1_Cells.rds ├── Python | ├── Droplets | | └── sample1.h5ad | └── Cells | └── sample1.h5ad ├── FlatFile | ├── Droplets | | ├── assays | | | └── sample1_counts.mtx.gz | | ├── metadata | | | └── sample1_metadata.rds | | ├── sample1_colData.txt.gz | | └── sample1_rowData.txt.gz | └── Cells | ├── assays | | └── sample1_counts.mtx.gz | ├── metadata | | └── sample1_metadata.rds | ├── reducedDims | | ├──sample1_decontX_UMAP.txt.gz | | ├──sample1_scrublet_TSNE.txt.gz | | └──sample1_scrublet_UMAP.txt.gz | ├── sample1_colData.txt.gz | └── sample1_rowData.txt.gz └── sample1_QCparameters.yaml
singleCellTK is available to use with both Docker and Singularity. This container be used as a portal to the singleCellTK command line interface. The work was done in R. The singleCellTK docker image is available from Docker Hub.
The Docker image can be obtained by running:
docker pull campbio/sctk_qc
Docker image will not access files in your host file system by default. To get access to the files on your machine, you can properly set up a mount volume. Noted that the transcriptome data and GMT file needed to be accessible to the container via mounted volume. In the below example, mount volumn is enabled for accessing input and output directory using argument -v. The transcriptome and GMT files inside /path/to/data is now available in /SCTK_docker folder inside the docker. To learn more about mounted volumes, please check out this post.
The usage of each argument is the same as running command line analysis. Here is an example code to perform quanlity control of CellRangerV3 data singleCellTK docker:
docker run --rm -v /path/to/data:/SCTK_docker \ -it campbio/sctk_qc:1.7.5 \ -b /SCTK_docker/cellranger \ -P CellRangerV3 \ -s pbmc_100x100 \ -o /SCTK_docker/result/tenx_v3_pbmc \ -g /SCTK_docker/mitochondrial_human_symbol.gmt \ -S TRUE \ -F R,Python,Flatfile,HTAN
The Singulatiry image can easily be built using Docker Hub as a source:
singularity pull docker://campbio/sctk_qc:1.7.6
The usage of singleCellTK Singularity image is very similar to that of Docker. In Singularity 3.0+, the mount volume is automatically overlaid.
It’s recommended to re-set the home directory when you run singularity. Singularity will mount $HOME path on your machine by default, which might contain your personal R/Python library folder. If we don’t re-set the home to mount, singularity will try to use R/Python libraries which are not built within the singularity image and cause some conflicts. You can point to some “sanitized home”, which is different from $HOME path on your machine, using argument -H/–home. Besides, you can use argument –bind/-B to specify your own mount volume, which is the path that contains the dataset and will be used to store the output of QC pipeline. The example is shown as below:
singularity run --home=/PathToSanitizedHome \ --bind /PathToData:/data sctk_qc_1.7.6.sif \ -P CellRangerV3 \ -s gencodev34_pbmc_1k_v3 \ -b /data/gencodev34_pbmc_1k_v3 -o /data/result/gencodev34_pbmc_1k_v3 \ -S TRUE \ -F R,Python,FlatFile,HTAN \ -n 15 \ -T MulticoreParam
Also, you might want to specify cpu architecture when run the Singularity image on BU SCC using ‘#$ -l cpu_arch=broadwell|haswell|skylake|cascadelake’ command. Because the python packages are compiled by SIMD instructions that are only compatible on these cpu architectures. If you are runnning singularity on other cluster, please contact IT helps about how to specify cpu architecture when you run the singularity image. One of the example is shown below:
#!/bin/bash #$ -cwd #$ -j y #$ -P camplab #$ -pe omp 16 #$ -l cpu_arch=broadwell|haswell|skylake|cascadelake singularity run --home=/PathToSanitizedHome \ --bind /PathToData:/data sctk_qc_1.7.6.sif \ -P CellRangerV3 \ -s gencodev34_pbmc_1k_v3 \ -b /data/gencodev34_pbmc_1k_v3 -o /data/result/gencodev34_pbmc_1k_v3 \ -S TRUE \ -F R,Python,FlatFile,HTAN \ -n 15 \ -T MulticoreParam