## Introduction

This section comes together with the previous section Differential Expression. The basic strategy singleCellTK (SCTK) uses to find biomarkers is to iteratively identify the significantly up-regulated features of each group of cells against all the other cells. This means, the function we have (findMarkerDiffExp()) is a wrapper of functions that do differential expression (DE) analysis, which would be called in a loop. For the detail of the functions we have for DE analysis, please refer to runDEAnalysis() and the documentation linked to above.

The methods we have for marker detection are listed below.

Method Citation
wilcox Aaron Lun and Jonathan Griffiths, 2016.
MAST Greg Finak and et al., 2015
Limma Gordon Smyth and et al., 2004
DESeq2 Michael Love and et al., 2014
ANOVA Jeffrey T. Leek and et al., 2020

## Workflow

### Parameters

As per SCTK’s strategy, the primary input is limited to an SingleCellExperiment (SCE) object, which has been through preprocessing steps with cluster labels annotated. While the iteration is a fixed pattern, the parameters needed are rather simple:

sce <- findMarkerDiffExp(inSCE = sce,
useAssay = "logcounts",
method = "MAST",
cluster = "cluster",
covariates = NULL,
log2fcThreshold = 0.25,
fdrThreshold = 0.05)

Here all arguments execpt the input SCE object (inSCE) are set by default. method can only be chosen from the table above. cluster and covariates should be given a single string which is present in names(colData(inSCE)). cluster is required for grouping cells, while covariates is optional for DE detection. log2fcThreshold and fdrThreshold are numeric and has to be set in plausible range. log2fcThrshold has to be positive and fdrThreshold has to be greater than zero and less than one.

The returned SCE object will contain the updated information of the markers identified in its metadata slot.

### Example

#### Preprocessing

To demonstrate a simple and clear example, here we use the “PBMC-3k” dataset from “10X” which can be easily imported with SCTK functions. The preprocessing only includes necessary steps to get cluster labels (i.e. QC and filtering are excluded).

library(singleCellTK)
pbmc3k <- importExampleData("pbmc3k")
pbmc3k <- scaterlogNormCounts(pbmc3k, "logcounts")
# Go through the Seurat curated workflow to get basic clusters
pbmc3k <- seuratNormalizeData(inSCE = pbmc3k, useAssay = "counts")
pbmc3k <- seuratFindHVG(inSCE = pbmc3k, useAssay = "seuratNormData")
pbmc3k <- seuratScaleData(inSCE = pbmc3k, useAssay = "seuratNormData")
pbmc3k <- seuratPCA(inSCE = pbmc3k, useAssay = "seuratScaledData")
pbmc3k <- seuratRunUMAP(pbmc3k)
pbmc3k <- seuratFindClusters(inSCE = pbmc3k, useAssay = "seuratScaledData")
# Optional visualization
plotSCEDimReduceColData(inSCE = pbmc3k,
colorBy = "Seurat_louvain_Resolution0.8",
conditionClass = "factor",
reducedDimName = "seuratUMAP")

#### Get Markers

Then we call findMarkerDiffExp() on the clustered data, with the cluster annotation just attached, which by default named with "Seurat_louvain_Resolution0.8".

pbmc3k <- findMarkerDiffExp(inSCE = pbmc3k,
method = "Limma", # Runs fastest
useAssay = "logcounts",
cluster = "Seurat_louvain_Resolution0.8")

After this, the marker genes, together with their statistics, of each cluster will be concatenated in one table. This table can be extracted with:

markerTable <- metadata(pbmc3k)\$findMarker
head(markerTable)
Gene Log2_FC Pvalue FDR Seurat_louvain_Resolution0.8 clusterExprPerc ControlExprPerc clusterAveExpr
74 NOSIP 0.5152459 0 0 0 0.6709091 0.3641860 1.0032117
121 IL7R 0.4715791 0 0 0 0.6418182 0.3404651 0.9947208
126 CD7 0.4285577 0 0 0 0.6200000 0.3018605 0.8756739
410 IL7R 0.7250848 0 0 1 0.7448560 0.3265583 1.2137737
7 CD2 0.5424127 0 0 1 0.6502058 0.2439024 0.8673990
601 LCK 0.3528016 0 0 1 0.6090535 0.3044264 0.7560978

Similarly to the Differential Expression section, we also provide a automated and organized heatmap plotting for the markers:

plotMarkerDiffExp(pbmc3k)

Note that when plotting the heatmap, the genes that are identified as up-regulated in multiple clusters will be considered only for the one cluster with the highest fold-change, while all of them are still kept in the table in metadata.