This function clusters TCRs or BCRs based on the edit distance of their CDR3
sequences. It can operate on either nucleotide (nt
) or amino acid (aa
)
sequences and can optionally enforce that clones share the same V and/or J
genes. The output can be the input object with an added metadata column for
cluster IDs, a sparse adjacency matrix, or an igraph
graph object
representing the cluster network.
clonalCluster(
input.data,
chain = "TRB",
sequence = "aa",
threshold = 0.85,
group.by = NULL,
cluster.method = "components",
cluster.prefix = "cluster.",
use.V = TRUE,
use.J = FALSE,
exportAdjMatrix = FALSE,
exportGraph = FALSE
)
The product of combineTCR()
,
combineBCR()
or combineExpression()
.
The TCR/BCR chain to use. Use both
to include both chains
(e.g., TRA/TRB). Accepted values: TRA
, TRB
, TRG
, TRD
, IGH
, IGL
(for both light chains), both
.
Clustering based on either aa
or nt
sequences.
The similarity threshold. If < 1, treated as normalized similarity (higher is stricter). If >= 1, treated as raw edit distance (lower is stricter).
A column header in the metadata or lists to group the analysis
by (e.g., "sample", "treatment"). If NULL
, clusters will be calculated across
all sequences.
The clustering algorithm to use. Defaults to "components"
,
which finds connected subgraphs.
A character prefix to add to the cluster names (e.g., "cluster.").
If TRUE
, sequences must share the same V gene to be
clustered together.
If TRUE
, sequences must share the same J gene to be
clustered together.
If TRUE
, the function returns a sparse
adjacency matrix (dgCMatrix
) of the network.
If TRUE
returns an igraph object of connected
sequences or the amended input.data
with a new cluster-based variable
Depending on the export parameters, one of the following:
An amended input.data
object with a new metadata column containing cluster IDs (default).
An igraph
object if exportGraph = TRUE
.
A sparse dgCMatrix
object if exportAdjMatrix = TRUE
.
The clustering process is as follows:
The function retrieves the relevant chain data from the input object.
It calculates the edit distance between all sequences within each group
(or across the entire dataset if group.by
is NULL
).
An edge list is constructed, connecting sequences that meet the similarity
threshold
.
The threshold
parameter behaves differently based on its value:
threshold
< 1 (e.g., 0.85): Interpreted as a normalized edit
distance or sequence similarity. A higher value means greater
similarity is required. This is the default behavior.
threshold
>= 1 (e.g., 2): Interpreted as a maximum raw edit
distance. A lower value means greater similarity is required.
An igraph
graph is built from the edge list.
A clustering algorithm is run on the graph. The default
cluster.method = "components"
simply identifies the connected
components (i.e., each cluster is a group of sequences connected by
edges). Other methods from igraph
can be used.
The resulting cluster information is formatted and returned in the specified format.
# Getting the combined contigs
combined <- combineTCR(contig_list,
samples = c("P17B", "P17L", "P18B", "P18L",
"P19B","P19L", "P20B", "P20L"))
# Add cluster information to the list
sub_combined <- clonalCluster(combined[c(1,2)],
chain = "TRA",
sequence = "aa",
threshold = 0.85)
# Export the graph object instead
graph_obj <- clonalCluster(combined[c(1,2)],
chain = "TRA",
exportGraph = TRUE)