vignettes/articles/Attaching_SC.Rmd
Attaching_SC.Rmd
In single-cell RNA sequencing workflows, dimensional reduction is typically performed by first identifying highly variable features. These features are then used directly for UMAP/tSNE projection or as inputs for principal component analysis. The same approach is commonly applied to clustering as well.
However, in immune-focused datasets, VDJ genes from TCR and BCR are often among the most variable genes. This variability arises naturally due to clonal expansion and diversity within lymphocytes. As a result, UMAP projections and clustering outcomes may be influenced by clonal information rather than broader transcriptional differences across cell types.
To mitigate this issue, a common strategy is to exclude VDJ genes from the set of highly variable features before proceeding with clustering and dimensional reduction. We introduce a set of functions that facilitate this process by removing VDJ-related genes from either a Seurat Object or a vector of gene names (useful for SCE-based workflows):
quietVDJgenes()
– Removes both TCR and BCR VDJ
genes.quietTCRgenes()
– Removes only TCR VDJ genes.quietBCRgenes()
– Removes only BCR VDJ genes, but
retains BCR VDJ pseudogenes in the variable features.# Check the first 10 variable features before removal
VariableFeatures(scRep_example)[1:10]
## [1] "TRBV7-2" "HSPA1B" "HSPA1A" "TRBV4-1" "CCL4" "TRBV5-1"
## [7] "TRBV10-3" "TRBV3-1" "TRBV6-6" "CD79A"
# Remove TCR VDJ genes
scRep_example <- quietTCRgenes(scRep_example)
# Check the first 10 variable features after removal
VariableFeatures(scRep_example)[1:10]
## [1] "HSPA1B" "HSPA1A" "CCL4" "CD79A" "HLA-DRA" "CCL20" "TUBA1B"
## [8] "HSPA6" "TNF" "MS4A1"
By applying these functions, you can ensure that clustering and dimensional reduction are driven by broader transcriptomic differences rather than clonal expansion artifacts.
The data in the scRepertoire package is derived from a study of acute
respiratory stress disorder in the context of bacterial and COVID-19
infections. The internal single cell data (scRep_example()
)
built in to scRepertoire is randomly sampled 500 cells from the fully
integrated Seurat object to minimize the package size. However, for the
purpose of the vignette we will use the full single-cell object with
30,000 cells. We will use both Seurat and Single-Cell Experiment (SCE)
with scater to perform further visualizations in tandem.
scRep_example <- readRDS("scRep_example_full.rds")
#Making a Single-Cell Experiment object
sce <- Seurat::as.SingleCellExperiment(scRep_example)
After processing the contig data into clones via
combineBCR()
or combineTCR()
, we can add the
clonal information to the single-cell object using
combineExpression()
.
Importantly, the major requirement for the attachment is matching contig cell barcodes and barcodes in the row names of the meta data of the Seurat or Single-Cell Experiment object. If these do not match, the attachment will fail. Based on ease, we suggest making changes to the single-cell object barcodes.
Part of combineExpression()
is calculating the clonal
frequency and proportion, placing each clone into groups called
cloneSize. The default cloneSize
argument uses the following bins: c(Rare = 1e-4, Small = 0.001, Medium =
0.01, Large = 0.1, Hyperexpanded = 1), which can be modified to include
more/less bins or different names.
Clonal frequency and proportion is dependent on the repertoires being
compared, which we can modify the calculation using the
group.by parameter, such as grouping by the
Patient variable from above. If group.by is
not set, combineExpression()
will calculate clonal
frequency, proportion, and cloneSize as a function of
individual sequencing runs. In addition, cloneSize can
use the frequency of clones when proportion =
FALSE.
We can look at the default cloneSize groupings using the Single-Cell
Experiment object we just created above with using
group.by set to the sample variable used in
combineTCR()
:
sce <- combineExpression(combined.TCR,
sce,
cloneCall="gene",
group.by = "sample",
proportion = TRUE)
#Define color palette
colorblind_vector <- hcl.colors(n=7, palette = "inferno", fixup = TRUE)
plotUMAP(sce, colour_by = "cloneSize") +
scale_color_manual(values=rev(colorblind_vector[c(1,3,5,7)]))
Alternatively, if we want cloneSize to be based on
the frequency of the clone, we can set proportion =
FALSE and we will need to change the cloneSize bins to
integers. If we have not inspected our clone data, setting the upper
limit of the clonal frequency might be difficult -
combineExpression()
will automatically adjust the upper
limit to fit the distribution of the frequencies. To demonstrate this,
check out the Seurat object below:
scRep_example <- combineExpression(combined.TCR,
scRep_example,
cloneCall="gene",
group.by = "sample",
proportion = FALSE,
cloneSize=c(Single=1, Small=5, Medium=20, Large=100, Hyperexpanded=500))
Seurat::DimPlot(scRep_example, group.by = "cloneSize") +
scale_color_manual(values=rev(colorblind_vector[c(1,3,4,5,7)]))
If we have TCR/BCR enrichment or want to add info for gamma-delta and
alpha-beta T cells, we can make a single list and use
combineExpression()
.
Major note if there are duplicate barcodes (if a
cell has both Ig and TCR), the immune receptor information will not be
added. It might be worth checking cluster identities and removing
incongruent barcodes in the products of combineTCR()
and
combineBCR()
.
As an anecdote, the testing data we used to improve this function had 5-6% of barcode overlap.
#This is an example of the process, which will not be evaluated during knit
TCR <- combineTCR(...)
BCR <- combineBCR(...)
list.receptors <- c(TCR, BCR)
seurat <- combineExpression(list.receptors,
seurat,
cloneCall="gene",
proportion = TRUE)