SCDC adopts an ENSEMBLE method to integrate deconvolution results across methods and datasets, giving reference data that are more close to the bulk RNA-seq data higher weights, implicitly addressing the batch-effect confounding when multiple scRNA-seq reference sets are available.

Our paper is published at Briefings In Bioinformatics.

Installation

if (!require("devtools")) {
  install.packages("devtools")
}
devtools::install_github("meichendong/SCDC")

Data Input

SCDC takes ExpressionSet objects with raw read counts as input. ExpressionSet object is based on the package Biobase, which mainly stores data matrix (genes by samples), featureData (information about genes), and phenoData (information about samples, like the clustering results, subjects, conditions). We also enable a quick construction of the ExpressionSet object by the function getESET(). Here we show a toy example:

library(SCDC)
#> Warning: replacing previous import 'vctrs::data_frame' by 'tibble::data_frame'
#> when loading 'dplyr'
counts <- matrix(sample(0:500, 25), nrow = 5)
colnames(counts) <- paste("cell",1:5, sep = "")
rownames(counts) <- paste("gene",1:5, sep = "")
counts
#>       cell1 cell2 cell3 cell4 cell5
#> gene1   179   138   305   385    11
#> gene2   428    96   156    73     6
#> gene3   195    55    16   169   210
#> gene4   417   381   199   488   311
#> gene5   213   422   463   310    78
fdata <- rownames(counts)
pdata <- cbind(cellname = colnames(counts), subjects = paste("patient", rep(1,5)))
eset <- getESET(counts, fdata = fdata, pdata = pdata)
class(eset)
#> [1] "ExpressionSet"
#> attr(,"package")
#> [1] "Biobase"

As an illustration of the SCDC framework, we analyze the pancreatic islet data as presented in our paper. Three public scRNA-seq datasets from Baron et al. (2016), Segerstolpe et al. (2016) and Xin et al. (2016) are used; a bulk RNA-seq dataset from Fadista et al. (2014) is used in the real data analysis. We also applied our method to a three-cell-line mixture data from Perou lab, where the cell lines MDA-MB-468, MCF-7, and normal fibroblast cells are cultured by the ratio 6:3:1. Both scRNA-seq and bulk RNA-seq data are available.

The processed data is available for download at the SCDC data page.

SCDC Pre-process of scRNA-seq Data

For each single-cell dataset with raw counts, we can first explore the demographic information by the visualization function DemoPlot. Here, single cells from each subject are summarized by total counts, counts per cell type, and the overall composition. This gives us a general understanding of a dataset. Take scRNA-seq data of healthy subjects from Segerstolpe et al. as an example, and we derive the following plot:

ct1 <- c("mediumorchid1","mediumpurple1","lightskyblue","seagreen1","yellow","tan1","azure3")
seger <- readRDS("segerstolpe.rds")
DemoPlot(seger, cluster = "cluster", sample = "sample", select.ct = c("alpha","beta","delta","gamma","ductal","acinar"), Palette = ct1)
scRNA-seq from Segerstolpe et al.

To make SCDC robust to single-cell clustering, a quality control procedure is performed as a first step to remove cells with questionable cell-type assignments, as well as cells with low library preparation and sequencing quality. Specifically, each single cell is treated as a "bulk" sample and its cell-type composition can be derived by a first-pass run of SCDC. The threshold of keeping a single cell from an assigned cluster is chosen as 0.7 in this illustration example. This user-defined threshold should be selected carefully, to achieve the effect of quality control of single cells without obstructing the performance on the downstream analysis. An unproper threshold could potentially lead to filtering out a large proportion of cells from rare cell types. To be conservative, this can be chosen from (0.5,0.7).

seger.qc <- SCDC_qc(seger, ct.varname = "cluster", sample = "sample", scsetname = "Segerstolpe",
                  ct.sub = c("alpha","beta","delta","gamma","ductal","acinar"), qcthreshold = 0.7) 
seger.qc$heatfig
scRNA-seq from Segerstolpe et al. Clustering QC

Similarly, we can perform clustering QC for single cells from only one subject using function SCDC_qc_ONE().

SCDC on Simulated Data

SCDC allows users to construct pseudo bulk samples in two ways:

  • generateBulk_allcells(): Sum up gene-wise raw read counts per subject, and the true cell-type proportion is consistent with the single cell clustering result.

  • generateBulk_norep(): Randomly sample single cells from each of the cell types of interest without replacement, and then sum up raw read counts per subject. The cell-type proportions are different from single cell clustering result.

Here, we use the ExpressionSet objects after the clustering-QC of each scRNA-seq dataset. Read in the processed objects downloaded from our webpage:

# ExpressionSet objects from the three different scRNA-seq resources
# setwd("your/working/directory")
qc.seger <- readRDS("qc_segerstolpe.rds")
qc.baron <- readRDS("qc_baron.rds")
qc.xin <- readRDS("qc_xin.rds")

Then create the pseudo bulk samples as follows. ct.varname specifies the variable name of your single cell clustering result variable. sample specifies the variable indicating subjects or individuals information. ct.sub specifies the subset of cell types you want to use to construct pseudo bulk samples.

## (1) using all single cells
pseudo.seger <- generateBulk_allcells(qc.seger$sc.eset.qc, ct.varname = "cluster", sample = "sample", ct.sub = c("alpha","beta","delta","gamma"))

## (2) using random proportions of single cells
pseudo.seger.rand <- generateBulk_norep(qc.seger$sc.eset.qc, ct.varname = "cluster", sample = "sample", ct.sub = c("alpha","beta","delta","gamma"), nbulk = 3)

## we know the true relative proportions for the pseudo bulk samples:
round(pseudo.seger$truep,3)
round(pseudo.seger.rand$true_p,3)

The constructed pseudo bulk object includes the true cell-type proportion matrix pseudo.seger$truep and the ExpressionSet object pseudo.seger$pseudo_eset.

Note that if the 'library size' factors do not reflect the real 'cell size' factors relationship according to prior biological knowledge, you can specify the ct.cell.size values in the SCDC_prop() or SCDC_basis() or SCDC_ENSEMBLE() functions. To perform deconvolution using one reference scRNA-seq dataset, use the SCDC_prop() function:

bulkseger.scseger <- SCDC_prop(bulk.eset = pseudo.seger$pseudo_eset, sc.eset = qc.seger$sc.eset.qc, ct.varname = "cluster", sample = "sample", ct.sub = c("alpha","beta","delta","gamma"), truep = pseudo.seger$truep)
bulkseger.scbaron <- SCDC_prop(bulk.eset = pseudo.seger$pseudo_eset, sc.eset = qc.baron$sc.eset.qc, ct.varname = "cluster", sample = "sample", ct.sub = c("alpha","beta","delta","gamma"), truep = pseudo.seger$truep)
bulkseger.scxin <- SCDC_prop(bulk.eset = pseudo.seger$pseudo_eset, sc.eset = qc.xin$sc.eset.qc, ct.varname = "cluster", sample = "sample", ct.sub = c("alpha","beta","delta","gamma"), truep = pseudo.seger$truep)

bulkseger.scseger$peval$evals.table
#>         RMSD     mAD      R
#> SCDC 0.03352 0.02553 0.9912
bulkseger.scbaron$peval$evals.table
#>         RMSD     mAD     R
#> SCDC 0.08337 0.06499 0.954
bulkseger.scxin$peval$evals.table
#>         RMSD     mAD      R
#> SCDC 0.19058 0.15449 0.9331

We can observe the discrepancy of performance using different datasets. In this example, when using single cells from the reference dataset Segerstolpe et al., we derive the highest Pearson correlation, and the lowest RMSD and mAD.

Next, we will explore the ENSEMBLE method integrating all the three scRNA-seq reference datasets using the pseudo bulk samples constructed above. Here we offered three methods to derive the ENSEMBLE weights: 'Grid search', 'LAD', and 'NNLS'. The default setting is using all these three methods. The search length is a parameter representing the search step size, which is related to the grid search method.

#input the list of single cell ExpressionSet objects
pancreas.sc <- list(baron = qc.baron$sc.eset.qc,
                    seger = qc.seger$sc.eset.qc,
                    xin = qc.xin$sc.eset.qc)
# This might take several minutes, depending on the search.length set by user.
PS_ensemble <- SCDC_ENSEMBLE(bulk.eset = pseudo.seger$pseudo_eset, sc.eset.list = pancreas.sc, ct.varname = "cluster",
                             sample = "sample", truep = pseudo.seger$truep[complete.cases(pseudo.seger$truep),], ct.sub =  c("alpha","beta","delta","gamma"), search.length = 0.01)
PS_ensemble$w_table
#               w1   w2   w3   RMSD_Y    mAD_Y Pearson_Y Spearman_Y RMSD_prop mAD_prop R_prop
#inverse SSE  0.06 0.66 0.28 7.146419 4.415475 0.7545354  0.8478534   0.06692  0.05541 0.9839
#LAD          0.00 1.00 0.00 6.706079 4.225556 0.8183429  0.8610900   0.03376  0.02511 0.9907
#Pearson_prop 0.02 0.93 0.05 6.720294 4.237310 0.8155518  0.8608889   0.03627  0.02779 0.9913
#mAD_Y        0.00 1.00 0.00 6.705799 4.225710 0.8185645  0.8611222   0.03376  0.02509 0.9907
#Spearman_Y   0.00 0.97 0.03 6.703476 4.229452 0.8192302  0.8613582   0.03483  0.02661 0.9912

Interpretation of the PS_ensemble$w_table: columns w1,w2,w3 are the weight assigned to each of the reference dataset result. The columns 4-11 are the metrics we used to derived the optimal weights for the references. For example, 'RMSD_Y' stands for the RMSD value based on observed and predicted expression 'Y' values while using the set of weights suggested on the left. For each row, these are the different optimization methods/criteria we used to derive the corresponding weights. For example, the first row suggests (0.06, 0.66, 0.28) for results derived from reference 1 to 3. This set of weights are selected based on calculating the inverse SSE value. Likewise, LAD stands for minimizing the least absolute deviation values based on 'Y'. Pearson_prop only works when you have the true sample composition input (that is, in the SCDC_ENSEMBLE() function, the truep option is not NULL.). This aims to find weights such that the final ensemble proportions have the highest Pearson correlation with the truth. mAD_Y is the grid search result, where we calculated the result from each weights combination, and chose the one minimized the mAD(Yobs, Ypredict). Spearman_Y is the grid search result, where we calculated the result from each weights combination, and chose the one maximized the Spearman corr(Yobs, Ypredict). If you have many reference datasets, say more than 4, then it's not recommended to use the 'grid search method' since it would be time consuming. You can turn off the grid search by setting grid.search = F.

In this case, SCDC ENSEMBLE weights suggest that the weight for results from reference dataset Segerstolpe et al. to be one. Hence, the final estimated cell-type proportions can be found at PS_ensemble$prop.only$seger or PS_ensemble$prop.list$seger$prop.est.mvw.

SCDC on Real Bulk RNA-seq Data

Here, we move on to analyze the real pancreatic islet bulk RNA-seq dataset. Fadista et al. (2014) provided 89 bulk samples, of which 77 samples have the HbA1c level information(an important biomarker for the Type II diabetes). Hence, we focus on the deconvolution for the 77 bulk samples (51 healthy and 26 diabetic). To allow the basis matrix to reflect the potentially different gene expression patterns between the cases and controls, we performed the ENSEMBLE weight selection procedures for the samples from the two classes (normal and T2D) separately.

fadista_77 <- readRDS("fadista_77.rds")
# setting the search.length as 0.01 might take several minutes to finish the ENSEMBLE procedure.
fadista.healthy.ens <- SCDC_ENSEMBLE(bulk.eset = fadista_77[,fadista_77$hba1c_class2 == "Normal"], sc.eset.list = list(baronh = qc.baron$sc.eset.qc, segerh = qc.seger$sc.eset.qc), ct.varname = "cluster",
                               sample = "sample", truep = NULL, ct.sub =  c("alpha","beta","delta","gamma","acinar","ductal"), search.length = 0.01, grid.search = T)  

fadista.t2d.ens <- SCDC_ENSEMBLE(bulk.eset = fadista_77[,fadista_77$hba1c_class2 == "T2D"], sc.eset.list = list(baronh = qc.baron$sc.eset.qc, segerh = qc.seger$sc.eset.qc), ct.varname = "cluster",
                               sample = "sample", truep = NULL, ct.sub =  c("alpha","beta","delta","gamma","acinar","ductal"), search.length = 0.01, grid.search = T)  
#> fadista.healthy.ens$w_table
#               w1   w2   RMSD_Y    mAD_Y Pearson_Y Spearman_Y
#inverse SSE  0.37 0.63 12.94834 5.844663 0.3419521  0.6552073
#LAD          0.17 0.83 12.62988 5.834102 0.3451222  0.6489382 
#mAD_Y        0.24 0.76 12.68422 5.828969 0.3456796  0.6524225 
#Spearman_Y   0.40 0.60 13.02000 5.850878 0.3406539  0.6552603
#> fadista.t2d.ens$w_table
#               w1   w2   RMSD_Y    mAD_Y Pearson_Y Spearman_Y
#inverse SSE  0.41 0.59 11.60516 5.492857 0.3950583  0.6889226 
#LAD          0.33 0.67 11.50850 5.494326 0.3962562  0.6870404 
#mAD_Y        0.38 0.62 11.56228 5.492216 0.3957707  0.6884008 
#Spearman_Y   0.48 0.52 11.75522 5.500820 0.3918644  0.6894475

To compare the ENSEMBLE results to the deconvolution results using one reference dataset only, we visualize the estimated proportions as following:

library(reshape2)
library(ggplot2)
getPropBox <- function(ens_h, ens_d, metric = NULL, ref, input_wt = NULL){
  if (!is.null(input_wt)){
    prop.h = as.data.frame(wt_prop(input_wt, ens_h$prop.only))
    prop.d = as.data.frame(wt_prop(input_wt, ens_d$prop.only))
  } else {
    prop.h = as.data.frame(wt_prop(ens_h$w_table[metric,1:2], ens_h$prop.only))
    prop.d = as.data.frame(wt_prop(ens_d$w_table[metric,1:2], ens_d$prop.only))
  }
  prop2 <- rbind(prop.h, prop.d)
  prop2$condition <- c(rep("Normal", nrow(prop.h)), rep("T2D", nrow(prop.d)))
  
  dtmelt <- melt(prop2, id.vars = "condition")
  dtmelt$ref <- as.factor(ref)
  return(dtmelt)
}

fdt.ens.spearman <- getPropBox(ens_h = fadista.healthy.ens, ens_d = fadista.t2d.ens, metric = 1, ref = "ENSEMBLE+SpearmanR")
fdt.seger <- getPropBox(ens_h = fadista.healthy.ens, ens_d = fadista.t2d.ens,  ref = "Segerstolpe", input_wt = c(0,1))
fdt.baron <- getPropBox(ens_h = fadista.healthy.ens, ens_d = fadista.t2d.ens,  ref = "Baron", input_wt = c(1,0))
dtall_fadista <- rbind(fdt.ens.spearman, fdt.seger, fdt.baron)
dtall_fadista$refcond <- paste(dtall_fadista$ref, dtall_fadista$condition)
colfunc <- colorRampPalette(c("red", "white"))
colfunc2 <- colorRampPalette(c("blue", "white"))
pfa2 <- ggplot(dtall_fadista[dtall_fadista$refcond %in% c("ENSEMBLE+SpearmanR Normal", "Segerstolpe Normal","Baron Normal",
                                                          "ENSEMBLE+SpearmanR T2D","Segerstolpe T2D","Baron T2D"),], 
               aes(x=variable, y=value, color = factor(refcond, levels = c("ENSEMBLE+SpearmanR Normal", "Segerstolpe Normal","Baron Normal",
                                                                           "ENSEMBLE+SpearmanR T2D","Segerstolpe T2D","Baron T2D")))) +
  geom_boxplot(outlier.size=-1)+
  geom_jitter(aes(x=variable,
                  color = factor(refcond,
                                 levels = c("ENSEMBLE+SpearmanR Normal", "Segerstolpe Normal","Baron Normal",
                                            "ENSEMBLE+SpearmanR T2D","Segerstolpe T2D","Baron T2D"))),
              position = position_dodge(0.75), alpha = 0.5,cex = 0.25)+
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size=10),
        axis.text.y = element_text(size = 10),
        text = element_text(size = 10),
        plot.title = element_text(size=10, face = "bold"),
        plot.margin=unit(c(1,1,-5,0), "mm"),
        legend.position="top",legend.title = element_blank(),
        legend.text = element_text(size=8),
        legend.box.spacing = unit(0, "mm"))+
  scale_color_manual(values=c(colfunc(10)[seq(1,9,3)],colfunc2(10)[seq(1,9,3)])) +
  ylab("") + xlab("")
pfa2
Cell-type proportion estimation for 77 bulk samples from Fadista et al.

In our paper, we sought to replicate previous findings on the negative correlation between the hemoglobin A1c (HbA1c, an important biomarker for type 2 diabetes) levels and the beta cell functions ( Kanat et al., 2011, Hou et al., 2016). Hence, we constructed a linear model using the estimated cell-type proportions as the response variable and the other covariates (age, gender, BMI, and HbA1c) as predictors. Here we compare deconvolution results from three sources: 1) scRNA-seq from Segerstolpe et al. only; 2) scRNA-seq from Baron et al. only; 3) ENSEMBLE deconvolution results using scRNA-seq from Segerstolpe et al. and Baron et al. separately.

## we use the scRNA-seq datasets without clustering-QC as input here, aiming to be comparable to the results as MuSiC shown in their paper.
baron <- readRDS("baron.rds")
seger <- readRDS("segerstolpe.rds")
## SCDC deconvolution using one ref dataset:
fadista.seger <- SCDC_prop(fadista_77, seger, ct.varname = "cluster", sample = "sample",
                          ct.sub = c("alpha","beta","delta","gamma","acinar","ductal"))

fadista.baron <- SCDC_prop(fadista_77, baron, ct.varname = "cluster", sample = "sample",
                                ct.sub = c("alpha","beta","delta","gamma","acinar","ductal"))
## ENSEMBLE proportions
getPropbycond <- function(ens_h, ens_d, metric){
  prop.h = as.data.frame(wt_prop(ens_h$w_table[metric,1:2], ens_h$prop.only))
  prop.d = as.data.frame(wt_prop(ens_d$w_table[metric,1:2], ens_d$prop.only))
  prop2 <- rbind(prop.h, prop.d)
  prop2$condition <- c(rep("Normal", nrow(prop.h)), rep("T2D", nrow(prop.d)))
  return(prop2)
}

## combine demographic information
ens.spearman.prop <- getPropbycond(ens_h = fadista.healthy.ens, ens_d = fadista.t2d.ens, metric = 8)

Fit linear models: beta cell proportion ~ HbA1c + age + BMI + sex.

## get demographic information of the bulk samples from Fadista et al.
fadista_demo <- fadista_77@phenoData@data[,c("age","gender","hba1c","bmi","hba1c_class2")]
## the following getLMpValue() function is to perform the linear regression and wrap the linear model results
getLMpValue <- function(prop_est, method_){
  prop_est_all <- cbind(prop_est, fadista_demo[rownames(prop_est),])
  getlmtable <- function(ct){
    prop_est_all$ct <- prop_est_all[,ct]
    lm.ens1 <- lm(ct ~ hba1c + age + bmi + gender, data = prop_est_all)
    s1 <- summary(lm.ens1)
    sdt <- cbind(round(s1$coefficients,4), celltype = rep(ct,5))
    return(sdt)
  }
  dtlm <- NULL
  for (ct in intersect(c("alpha","beta","delta","gamma","acinar","ductal"),
                       colnames(prop_est))){
    dtlm0 <- getlmtable(ct)
    dtlm <- rbind(dtlm, dtlm0)
  }
  dat_text_fad <- data.frame(label = paste("p-value =",dtlm[seq(from = 2 , to = 27, by=5),c(4)]),
                             variable = dtlm[seq(from = 2 , to = 27, by=5),c(5)],
                             hba1c = 6, value = 0.5,
                             condition = "T2D",
                             method = method_)
  demo_pval <- data.frame(label = paste("p-value =",dtlm[,c(4)]),
                          variable = dtlm[,c(5)],
                          value = 0.5,
                          covar =rownames(dtlm),
                          condition = "T2D",
                          method = method_)
  dtmelt <- melt(prop_est_all, id.vars = c("hba1c_class2","age","gender","bmi","hba1c"), measure.vars = c("alpha","beta","delta","gamma","acinar","ductal"))
  dtmelt$method <- method_
  return(list(meltdata = dtmelt, pvaldata = dat_text_fad, demo_pval = demo_pval))
}
res.SCDC.seger <- getLMpValue(prop_est = fadista.seger$prop.est.mvw, method_ = "SCDC+Segerstolpe")
res.SCDC.baron <- getLMpValue(prop_est = fadista.baron$prop.est.mvw, method_ = "SCDC+Baron")
res.SCDC.ensemble <- getLMpValue(prop_est = ens.spearman.prop, method_ = "SCDC+ENSEMBLE")
## pool the results together for visualization:
alldata <- rbind(res.SCDC.seger$meltdata, res.SCDC.ensemble$meltdata,
                 res.SCDC.baron$meltdata)
allpvalue <- rbind(res.SCDC.seger$pvaldata, res.SCDC.ensemble$pvaldata,
                   res.SCDC.baron$pvaldata)
demopval <- rbind(res.SCDC.seger$demo_pval, res.SCDC.ensemble$demo_pval,
                  res.SCDC.baron$demo_pval)
alldata$method <- as.factor(alldata$method)

Comparing ENSEMBLE proportion estimation results with the one-reference deconvolution results. Here we visualize the beta-cell proportions with the HbA1c levels, and present the p-values for HbA1c coefficient from each model.

fadista.LMplot <- ggplot(alldata[alldata$variable %in% c("beta"),], aes(x=hba1c, y= value)) + geom_point(aes(color = hba1c_class2)) +
  geom_smooth(method='lm', se = FALSE, color = "black", lwd = 0.25) +
  theme(legend.position = "top", legend.title = element_blank(), # axis.title.x = element_blank(),
        legend.box.spacing = unit(0, "mm"),
        text = element_text(size=9),
        axis.text.x = element_text(size=9),
        axis.text.y = element_text(size=9))+#, axis.title.y = element_blank()) +
  facet_grid(cols = vars(factor(method, levels = c("SCDC+Baron", "SCDC+Segerstolpe", "SCDC+ENSEMBLE"))),scales = "free") +
  geom_text(data = allpvalue[allpvalue$variable %in% c("beta"),],
            mapping = aes(x=5.5, y=0.85,label = label),
            hjust   = 0.2, vjust   = 0, size = 2.5) +
  xlab("HbA1c") + ylab("Proportions") + ylim(c(0,1))
fadista.LMplot

We see that SCDC ENSEMBLE method derives the p-value of 0.0019 <0.05, indicating a statistically significant association between the beta-cell proportions and the HbA1c levels.

Three Cell-line Mixture Data Analysis

In the real world, some identified cell types are hard to be distinguished in certain environment. Employing a hierachical deconvolution structure (tree-guided deconvolution structure) may work better than a straightforward deconvolution. Here, using the three-cell-line mixture sample, we show how deconvolution is performed by SCDC when only one subject is available. Also, by looking at the hierarchical structure of cell lines, we see two of the cell lines are mroe closely related. Hence, we can show how a tree-guided deconvolution step is applied in this simple case.

qc.3cl <- readRDS("MIX3cl_scESET.rds")
sc3cl.basis <- SCDC_basis_ONE(qc.3cl$sc.eset.qc, ct.varname = "md_cluster", sample ="orig.ident")
df.d <- dist(t(log(sc3cl.basis$basis.mvw + 1e-10)), method = "euclidean")
hc1 <- hclust(df.d, method = "complete")
plot(hc1, cex = 0.8, hang = -1, main = "log(basis matrix)")

The tree-structured deconvolution procedure will allow users to specify a 'meta cluster' by their own professional judgment. Without the requirement for marker genes input, we could employ the function SCDC_prop_ONE_subcl_marker() to deconvolve the bulk samples.

bulk3cl <- readRDS("MIX3cl_bulkESET.rds")
sc3cl2 <- qc.3cl$sc.eset.qc
sc3cl2$md_cluster2 <- sc3cl2$md_cluster
sc3cl2$md_cluster2[!sc3cl2$md_cluster2 %in% "NormalFibroblast"] <- "Tumor"
# deconv_3cl <- SCDC_prop_ONE(bulk.eset = bulk3cl, sc.eset = sc3cl2, ct.varname = "md_cluster", sample = "orig.ident", ct.sub = unique(sc3cl$md_cluster), weight.basis = T)
SCDC_3cl_tree <- SCDC_prop_ONE_subcl_marker(bulk.eset = bulk3cl, sc.eset = sc3cl2, ct.varname = "md_cluster",
                                             sample = "orig.ident", ct.sub = unique(sc3cl2$md_cluster),
                                             ct.fl.sub = unique(sc3cl2$md_cluster2) , weight.basis = T,
                                             fl.varname = "md_cluster2", select.marker = T, LFC.lim = 5)
#> WNNLS for Second level clusters MCF7 MDA-MB468 Converged at iteration  3
SCDC_3cl_tree$prop.est
#>        MCF7 MDA-MB468 NormalFibroblast
#> 1 0.2561348 0.6367564        0.1071088

The choice of limit for LogFoldChange LFC.lim=5 could be changed by users. Empirically, a good threshold value would cut the number of genes to 1000~2000, which we believe is an adequate number to detect the inter-cell-type variations.

Mouse Mammary Gland Data Analysis

Using the single cell clustering package Seurat, we classified the mouse mammary gland single cells into five cell types of interest: endothelial, fibroblast, luminal, basal and immune cells. In our paper, we perform deconvolution for mouse mammary gland tissue bulk RNA-seq data of two forms: fresh-frozen and 10X. The detailed data process structure is shown in the figure 5 of our paper.

For illustration purpose, we only show how deconvolution of the 10X bulk samples is performed as follows. The processed fresh-frozen bulk RNA-seq data is also available at our data download page.

First we show the deconvolution using function SCDC_prop_subcl_marker() utilizing each of the single cell reference dataset separately.

## read in ExpressionSet object from Perou lab, and Tabula Muris
bulk10x <- readRDS("peroubulk10x_fvb34.rds")
qc.perou<- readRDS("qc_perou.rds")
qc.tmouse<- readRDS("qc_tmouse.rds")

## you could try the SCDC_prop() function for deconvolution without tree-guided structure first, 
## but based on biological knowledge and the single cell information we have now, the proportions estimation result
## might not reflect the proportions that are close to truth.

## here, tree-guided two-level deconvolution is performed (takes around 2 min):
perou <- qc.perou$sc.eset.qc
tmouse <- qc.tmouse$sc.eset.qc

perou$metacluster2[perou$md_cluster %in% c( "immune")] <- "immune"
perou$metacluster2[perou$md_cluster %in% c("basal","luminal","fibroblast","endothelial")] <- "BaLuFibEndo"
perou.subcl <- SCDC_prop_subcl_marker(bulk.eset = bulk10x, sc.eset = perou, ct.varname = "md_cluster", fl.varname = "metacluster2", sample = "subj", ct.sub = c("endothelial","fibroblast","immune","luminal","basal"), ct.fl.sub = unique(perou$metacluster2), select.marker = T, LFC.lim = 5)

tmouse$metacluster2[tmouse$md_cluster %in% c("immune")] <- "immune"
tmouse$metacluster2[tmouse$md_cluster %in% c("endothelial", "fibroblast","luminal","basal")] <- "BaLuFibEndo"
tmouse.subcl <- SCDC_prop_subcl_marker(bulk.eset = bulk10x, sc.eset = tmouse, ct.varname = "md_cluster", fl.varname = "metacluster2", sample = "subj", ct.sub = c("endothelial","fibroblast","immune","luminal","basal"), ct.fl.sub = unique(tmouse$metacluster2), select.marker = T, LFC.lim = 5)

Then we proceed with the ENSEMBLE step. Note that you can also input the deconvolution results from other methods like MuSiC, CIBERSORTx, Bisque, etc. The input elements we require here consist of at least two parts: the basis/signature/feature/profile matrix (genes by cell types), and the estimated cell-type proportions matrix (cell types by samples). We provide the function CreateSCDCpropObj to create object as such input format.

## 
ens_subcl_perou10x <- SCDC_ENSEMBLE(bulk.eset = bulk10x, prop.input = list(tmouse.subcl = tmouse.subcl, perou.subcl = perou.subcl), ct.sub = c("endothelial","fibroblast","immune","luminal","basal"), search.length = 0.01, grid.search = T)
ens_subcl_perou10x$w_table

The following code visualizes the SCDC estimation results. getPearson is a function calculating the Pearson correlation between two (estimated) datasets for the selected subjects (subj) and cell types (ct). ens_res is the function to summarize the ENSEMBLE results. wt is a vector of estimated weights which should sum up to 1. proplist is a list of estimated proportion matrix corresponding to the reference datasets in the same order with wt. subject is a vector which specifies the name of subjects of interest. getENSplot is the function to generate the final ENSEMBLE plot. enstable is a matrix which usually is the selected your_ENSEMBLE_result$w_table. You can specify your own designed weights for the references also. each row is a set of weights and each column correspond to a reference dataset. combo is a list of estimated proportion matrices derieved from SCDC or other deconvolution methods. The order of the lists should be consistent with the column order in the weight table (enstable). subject specifies the subjects of interest in your bulk samples. subcat is optional, and it can help you select a subset of ENSEMBLE method options. To specify subcat, it should be a subset of rownames of your enstable.

getPearson <- function(dt1, dt2, ct, subj){
  cor(c(dt1[subj,ct]), c(dt2[subj,ct]))
}
ens_res <- function(wt, proplist, subject = c("FVB3","FVB4")){
  ens <- wt_prop(wt, proplist = proplist) 
  p.ens <- getPearson(perou_pseudo_all$truep, ens ,
                      ct = c("endothelial","fibroblast","luminal","basal","immune"), subj = subject)
  return(list(prop = ens, p = p.ens))
}
combo.10x <- lapply(ens_subcl_perou10x$prop.list, function(x){
  x$prop.est[,c("endothelial","fibroblast","luminal","basal","immune")] 
})

library(ggplot2)
library(reshape2)
ct1 <- c("mediumorchid1","mediumpurple1","lightskyblue","seagreen1","yellow","tan1","azure3")
perou_pseudo_all <- generateBulk_allcells(perou, ct.varname = 'md_cluster', sample = 'subj', ct.sub = c("endothelial","fibroblast","immune","luminal","basal"))

getENSplot <- function(enstable, combo, subject, subcat = NULL){
  if (!is.null(subcat)){
    enstable <- enstable[subcat,]
  }
  wtres.list <- list()
  for (i in 1:nrow(enstable)){
    wtres.list[[i]] <- ens_res(enstable[i,1:2], combo)$prop
  }
  names(wtres.list) <- rownames(enstable)
  pcor.list <- NULL
  for (i in 1:nrow(enstable)){
    pcor.list[[i]] <- ens_res(enstable[i,1:2], combo, subject = subject)$p
  }
  names(pcor.list) <- rownames(enstable)
  est.all <- rbind(perou_pseudo_all$truep[,c("endothelial","fibroblast","luminal","basal","immune")],
                   do.call(rbind, combo), 
                   do.call(rbind, wtres.list))
  est.all.3 <- est.all[rownames(est.all) == subject,]
  rownames(est.all.3) <- as.factor(c("Seurat", "Tabula Muris", "Perou", 
                                     paste("ENSEMBLE:",names(pcor.list))))
  
  p_t <- getPearson(perou_pseudo_all$truep, combo[[1]],
                    ct = c("endothelial","fibroblast","luminal","basal","immune"), subj = c(subject))
  p_p <- getPearson(perou_pseudo_all$truep, combo[[2]],
                    ct = c("endothelial","fibroblast","luminal","basal","immune"), subj = c(subject))
  dat_text <- data.frame(label = c("",paste("R=", round(c(p_t, p_p, pcor.list),2), sep = "")),
                         Var1 = as.factor(c("Seurat", "Tabula Muris", "Perou", 
                                            paste("ENSEMBLE:",names(pcor.list)))),
                         value =rep(0.9, length(pcor.list)+3),
                         Var2 = rep("endothelial",length(pcor.list)+3))
  meltdt <- melt(est.all.3)
  P <- ggplot(meltdt, aes(x=Var1, y=value, fill = factor(Var2, levels = c("endothelial","fibroblast","luminal","basal","immune")))) +
    geom_bar(stat = "identity") + xlab("") + ylab("Percentage") +
    theme(axis.text.x = element_text(angle = 30, hjust = 1, size=10),
          axis.text.y = element_text(size = 10),
          text = element_text(size = 10),
          plot.title = element_text(size=10, face = "bold"),
          plot.margin=unit(c(1,1,-5,0), "mm"),
          legend.position="top",legend.title = element_blank(),
          legend.text = element_text(size=8),
          legend.box.spacing = unit(0, "mm"),
          panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          panel.background = element_blank(), axis.line = element_line(colour = "black"))+
    scale_fill_manual(values=ct1) +
    guides(fill = guide_legend(nrow = 2))+
    # facet_grid(cols = vars(ref)) +
    geom_text(data = dat_text,
              mapping = aes(x=Var1, y=value,label = label),
              hjust   = 0.5, vjust   = 0)
  # png(file.path(vpath,  "mammarygland.png"), res = 80, height = 300, width = 300)
  print(P)
  # dev.off()
}

getENSplot(enstable = ens_subcl_perou10x$w_table, combo=combo.10x, subject = "FVB3", subcat = c("Spearman_Y","LAD"))

Visualization of the proportion estimation results:

Simple Deconvolution

For simple deconvolution aims, SCDC provides the deconv_simple() function to allow users to specify a pre-processed bulk sample matrix "count.filter.norm", and a pre-calculated basis matrix "basis.norm" for deconvolution. Thus, by iterative weighted-NNLS, the cell-type proportion would be estimated. One example will be:

res = deconv_simple(count.filter.norm, basis.norm)
res$prop.est.mvw

Thanks for using SCDC! Please feel free to reach out (meichen@live.unc.edu) if you have questions!