Cluster clones by sequence similarity — clonalCluster • scRepertoire

This function clusters TCRs or BCRs based on the edit distance of their CDR3 sequences. It can operate on either nucleotide (nt) or amino acid (aa) sequences and can optionally enforce that clones share the same V and/or J genes. The output can be the input object with an added metadata column for cluster IDs, a sparse adjacency matrix, or an igraph graph object representing the cluster network.

clonalCluster(
  input.data,
  chain = "TRB",
  sequence = "aa",
  threshold = 0.85,
  group.by = NULL,
  cluster.method = "components",
  cluster.prefix = "cluster.",
  use.V = TRUE,
  use.J = FALSE,
  exportAdjMatrix = FALSE,
  exportGraph = FALSE
)

Arguments

input.data: The product of combineTCR(), combineBCR() or combineExpression().
chain: The TCR/BCR chain to use. Use both to include both chains (e.g., TRA/TRB). Accepted values: TRA, TRB, TRG, TRD, IGH, IGL (for both light chains), both.
sequence: Clustering based on either aa or nt sequences.
threshold: The similarity threshold. If < 1, treated as normalized similarity (higher is stricter). If >= 1, treated as raw edit distance (lower is stricter).
group.by: A column header in the metadata or lists to group the analysis by (e.g., "sample", "treatment"). If NULL, clusters will be calculated across all sequences.
cluster.method: The clustering algorithm to use. Defaults to "components", which finds connected subgraphs.
cluster.prefix: A character prefix to add to the cluster names (e.g., "cluster.").
use.V: If TRUE, sequences must share the same V gene to be clustered together.
use.J: If TRUE, sequences must share the same J gene to be clustered together.
exportAdjMatrix: If TRUE, the function returns a sparse adjacency matrix (dgCMatrix) of the network.
exportGraph: If TRUE returns an igraph object of connected sequences or the amended input.data with a new cluster-based variable

Value

Depending on the export parameters, one of the following:

An amended input.data object with a new metadata column containing cluster IDs (default).
An igraph object if exportGraph = TRUE.
A sparse dgCMatrix object if exportAdjMatrix = TRUE.

Details

The clustering process is as follows:

The function retrieves the relevant chain data from the input object.
It calculates the edit distance between all sequences within each group (or across the entire dataset if group.by is NULL).
An edge list is constructed, connecting sequences that meet the similarity threshold.
The threshold parameter behaves differently based on its value:
- threshold < 1 (e.g., 0.85): Interpreted as a normalized edit distance or sequence similarity. A higher value means greater similarity is required. This is the default behavior.
- threshold >= 1 (e.g., 2): Interpreted as a maximum raw edit distance. A lower value means greater similarity is required.
An igraph graph is built from the edge list.
A clustering algorithm is run on the graph. The default cluster.method = "components" simply identifies the connected components (i.e., each cluster is a group of sequences connected by edges). Other methods from igraph can be used.
The resulting cluster information is formatted and returned in the specified format.

Examples

# Getting the combined contigs
combined <- combineTCR(contig_list,
                       samples = c("P17B", "P17L", "P18B", "P18L",
                                   "P19B","P19L", "P20B", "P20L"))

# Add cluster information to the list
sub_combined <- clonalCluster(combined[c(1,2)],
                              chain = "TRA",
                              sequence = "aa",
                              threshold = 0.85)

# Export the graph object instead
graph_obj <- clonalCluster(combined[c(1,2)],
                           chain = "TRA",
                           exportGraph = TRUE)