vignettes/articles/Clonal_Cluster.Rmd
Clonal_Cluster.Rmd
The clonalCluster()
function provides a powerful method
to group clonotypes based on sequence similarity. It calculates the edit
distance between CDR3 sequences and uses this information to build a
network, identifying closely related clusters of T or B cell receptors.
This functionality allows for a more nuanced definition of a “clone”
that extends beyond identical sequence matches.
The clustering process follows these key steps:
sequence = "nt"
) or amino acid
(sequence = "aa"
) CDR3 sequences for a specified
chain.use.v = TRUE
).cluster.method = "components"
).igraph
object
for network analysis, or export a sparse adjacency matrix.threshold
Parameter
The behavior of the threshold parameter is critical for controlling cluster granularity:
0.85
), it
represents the normalized edit distance (Levenshtein distance / mean
sequence length). A higher value corresponds to a stricter similarity
requirement. This is useful for comparing sequences of varying
lengths.Key Parameter(s) for clonalCluster()
sequence
: Specifies whether to cluster based on
aa
(amino acid) or nt
(nucleotide)
sequences.threshold
: The similarity threshold for clustering.
Values < 1
are normalized similarity, while values
>= 1
are raw edit distance.group.by
: A column header in the metadata or lists to
group the analysis by (e.g., “sample”, “treatment”). If
NULL
, clusters are calculated across all sequences.use.V
: If TRUE
, sequences must share the
same V gene to be clustered together.use.J
: If TRUE
, sequences must share the
same J gene to be clustered together.cluster.method
: The clustering algorithm to use.
Defaults to components
, which finds connected
subgraphs.cluster.prefix
: A character prefix to add to the
cluster names (e.g., “cluster.”).exportGraph
: If TRUE
, returns an igraph
object of the sequence network.exportAdjMatrix
: If TRUE
, returns a sparse
adjacency matrix (dgCMatrix) of the network.To run clustering on the first two samples for the TRA chain, using amino acid sequences with a normalized similarity threshold of 0.85:
# Run clustering on the first two samples for the TRA chain
sub_combined <- clonalCluster(combined.TCR[c(1,2)],
chain = "TRA",
sequence = "aa",
threshold = 0.85)
# View the new cluster column
head(sub_combined[[1]][, c("barcode", "TCR1", "TRA.Cluster")])
## barcode TCR1 TRA.Cluster
## 1 P17B_AAACCTGAGTACGACG-1 TRAV25.TRAJ20.TRAC <NA>
## 2 P17B_AAACCTGCAACACGCC-1 TRAV38-2/DV8.TRAJ52.TRAC <NA>
## 3 P17B_AAACCTGCAGGCGATA-1 TRAV12-1.TRAJ9.TRAC cluster.1
## 4 P17B_AAACCTGCATGAGCGA-1 TRAV12-1.TRAJ9.TRAC cluster.1
## 5 P17B_AAACGGGAGAGCCCAA-1 TRAV20.TRAJ8.TRAC cluster.2
## 6 P17B_AAACGGGAGCGTTTAC-1 TRAV12-1.TRAJ9.TRAC cluster.1
You can calculate clusters based on specific metadata variables
within a single-cell object by using the group.by
parameter. This is useful for analyzing clusters on a per-sample or
per-patient basis without subsetting the data first.
First, add patient and type information to the
scRep_example
Seurat object:
#Adding patient information
scRep_example$Patient <- substr(scRep_example$orig.ident, 1,3)
#Adding type information
scRep_example$Type <- substr(scRep_example$orig.ident, 4,4)
Now, run clustering on the scRep_example
Seurat object,
grouping calculations by “Patient”:
# Run clustering, but group calculations by "Patient"
scRep_example <- clonalCluster(scRep_example,
chain = "TRA",
sequence = "aa",
threshold = 0.85,
group.by = "Patient")
#Define color palette
num_clusters <- length(unique(na.omit(scRep_example$TRA.Cluster)))
cluster_colors <- hcl.colors(n = num_clusters, palette = "inferno")
DimPlot(scRep_example, group.by = "TRA.Cluster") +
scale_color_manual(values = cluster_colors, na.value = "grey") +
NoLegend()
Instead of modifying the input object, clonalCluster()
can export the underlying network structure for advanced analysis. Set
exportGraph = TRUE
to get an igraph object consisting of
the networks of barcodes by the indicated clustering scheme.
set.seed(42)
#Clustering Patient 19 samples
igraph.object <- clonalCluster(combined.TCR[c(5,6)],
chain = "TRB",
sequence = "aa",
group.by = "sample",
threshold = 0.85,
exportGraph = TRUE)
# Setting color scheme
col_legend <- factor(igraph::V(igraph.object)$group)
col_samples <- hcl.colors(2,"inferno")[as.numeric(col_legend)]
color.legend <- factor(unique(igraph::V(igraph.object)$group))
# Sampling 1000 Barcodes
sample.vertices <- V(igraph.object)[sample(length(igraph.object), 1000)]
subgraph.object <- induced_subgraph(igraph.object, vids = sample.vertices)
V(subgraph.object)$degrees <- igraph::degree(subgraph.object)
edge_alpha_color <- adjustcolor("gray", alpha.f = 0.3)
#Plotting
plot(subgraph.object,
layout = layout_nicely(subgraph.object),
vertex.label = NA,
vertex.size = sqrt(igraph::V(subgraph.object)$degrees),
vertex.color = col_samples[sample.vertices],
vertex.frame.color = "white",
edge.color = edge_alpha_color,
edge.arrow.size = 0.05,
edge.curved = 0.05,
margin = -0.1)
legend("topleft",
legend = levels(color.legend),
pch = 16,
col = unique(col_samples),
bty = "n")
For computational applications, you can export a sparse adjacency
matrix using exportAdjMatrix = TRUE
. This matrix represents
the connections between all barcodes in the input data, with the edit
distance that meet the threshold in places of connection.
# Generate the sparse matrix
adj.matrix <- clonalCluster(combined.TCR[c(1,2)],
chain = "TRB",
exportAdjMatrix = TRUE)
# View the dimensions and a snippet of the matrix
dim(adj.matrix)
## [1] 5698 5698
print(adj.matrix[1:10, 1:10])
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## P17B_AAACCTGAGTACGACG-1 . . . . . . . . . .
## P17B_AAACCTGCAACACGCC-1 . . . . . . . . . .
## P17B_AAACCTGCAGGCGATA-1 . . . 1e-06 . 1e-06 1e-06 1e-06 . 1e-06
## P17B_AAACCTGCATGAGCGA-1 . . 1e-06 . . 1e-06 1e-06 1e-06 . 1e-06
## P17B_AAACGGGAGAGCCCAA-1 . . . . . . . . . .
## P17B_AAACGGGAGCGTTTAC-1 . . 1e-06 1e-06 . . 1e-06 1e-06 . 1e-06
## P17B_AAACGGGAGGGCACTA-1 . . 1e-06 1e-06 . 1e-06 . 1e-06 . 1e-06
## P17B_AAACGGGAGTGGTCCC-1 . . 1e-06 1e-06 . 1e-06 1e-06 . . 1e-06
## P17B_AAACGGGCAGTTAACC-1 . . . . . . . . . .
## P17B_AAACGGGGTCGCCATG-1 . . 1e-06 1e-06 . 1e-06 1e-06 1e-06 . .
You can analyze the combined network of both TRA/TRB or IGH/IGL
chains by setting chain = "both"
. This will create a single
cluster column named Multi.Cluster
.
# Cluster using both TRB and TRA chains simultaneously
clustered_both <- clonalCluster(combined.TCR[c(1,2)],
chain = "both")
# View the new "Multi.Cluster" column
head(clustered_both[[1]][, c("barcode", "TCR1", "TCR2", "Multi.Cluster")])
## barcode TCR1 TCR2
## 1 P17B_AAACCTGAGTACGACG-1 TRAV25.TRAJ20.TRAC TRBV5-1.None.TRBJ2-7.TRBC2
## 2 P17B_AAACCTGCAACACGCC-1 TRAV38-2/DV8.TRAJ52.TRAC TRBV10-3.None.TRBJ2-2.TRBC2
## 3 P17B_AAACCTGCAGGCGATA-1 TRAV12-1.TRAJ9.TRAC TRBV9.None.TRBJ2-2.TRBC2
## 4 P17B_AAACCTGCATGAGCGA-1 TRAV12-1.TRAJ9.TRAC TRBV9.None.TRBJ2-2.TRBC2
## 5 P17B_AAACGGGAGAGCCCAA-1 TRAV20.TRAJ8.TRAC <NA>
## 6 P17B_AAACGGGAGCGTTTAC-1 TRAV12-1.TRAJ9.TRAC TRBV9.None.TRBJ2-2.TRBC2
## Multi.Cluster
## 1 <NA>
## 2 <NA>
## 3 cluster.1
## 4 cluster.1
## 5 cluster.2
## 6 cluster.1
While the default cluster.method = "components"
is
robust, you can use other algorithms from the igraph package, such as
walktrap
or louvain
, to potentially uncover
different community structures.
# Cluster using the walktrap algorithm
graph_walktrap <- clonalCluster(combined.TCR[c(1,2)],
cluster.method = "walktrap",
exportGraph = TRUE)
# Compare the number of clusters found
length(unique(V(graph_walktrap)$cluster))
## [1] 424
Overall, clonalCluster()
is a versatile function for
defining and analyzing clonal relationships based on sequence
similarity. It allows researchers to move beyond exact sequence matches,
providing a more comprehensive understanding of clonal families. The
ability to customize parameters like threshold
,
chain
selection, and group.by
ensures
adaptability to diverse research questions. Furthermore, the option to
export igraph
objects or sparse adjacency matrices provides
advanced users with the tools for in-depth network analysis.