Getting Started

immApex is meant to serve as an API for supervised and unsupervised learning tasks based on immune receptor sequencing. These functions extract or generate amino acid or nucleotide sequences and prepare them for downstream machine learning. immApex is the underlying structure for the BCR models in Ibex and TCR models in Trex. It should be noted that the tools here are created for immune receptor sequences; they will work more generally for nucleotide or amino acid sequences. The package itself supports AIRR, Adaptive, and 10x formats and interacts with the scRepertoire R package.

More information is available at the immApex GitHub Repo.

Acquiring and Preparing Repertoire Data

getIMGT

Depending on the sequencing technology and the version, we might want to expand the length of our sequence embedding approach. The first step in the process is pulling the reference sequences from the ImMunoGeneTics (IMGT) system using getIMGT(). More information for IMGT can be found at imgt.org. Data from IMGT is under a CC BY-NC-ND 4.0 license. Please be aware that attribution is required for usage and should not be used to create commercial or derivative work.

Parameters for getIMGT()

  • species One or two word designation of species. Currently supporting: “human”, “mouse”, “rat”, “rabbit”, “rhesus monkey”, “sheep”, “pig”, “platypus”, “alpaca”, “dog”, “chicken”, and “ferret”
  • chain Sequence chain to access
  • frame Designation for “all”, “inframe” or “inframe+gap”
  • region Sequence gene loci to access
  • sequence.type Type of sequence - “aa” for amino acid or “nt” for nucleotide

Here, we will use the getIMGT() function to get the amino acid sequences for the TRBV region to get all the sequences by V gene allele.

TRBV_aa <- getIMGT(species = "human",
                   chain = "TRB",
                   frame = "inframe",
                   region = "v",
                   sequence.type = "aa") 

TRBV_aa[[1]][1]
## $`TRBV1*01`
## [1] "TRBVHPVREGIONAADTGITQTPKYLVTAMGSKRTMKREHLGHDSMYWYRQKAKKSLEFMFYYNCKEFIENKTVPNHFTPECPDSSRLYLHVVALQQEDSAAYLCTSSQ"

formatGenes

Immune receptor nomenclature can be highly variable across sequencing platforms. When preparing data for models, we can use formatGenes() to universalize the gene formats into IMGT nomenclature.

Parameters for formatGenes()

  • input.data Data frame of sequencing data or scRepertoire outputs
  • region Sequence gene loci to access - ‘v’, ‘d’, ‘j’, ‘c’ or a combination using c(‘v’, ‘d’, ‘j’)
  • technology The sequencing technology employed - ‘TenX’, “Adaptive’, or ‘AIRR’
  • species One or two word designation of species. Currently supporting: “human”, “mouse”, “rat”, “rabbit”, “rhesus monkey”, “sheep”, “pig”, “platypus”, “alpaca”, “dog”, “chicken”, and “ferret”
  • simplify.format If applicable, remove the allelic designation (TRUE) or retain all information (FALSE)

Here, we will use the built-in example from Adaptive Biotechnologies and reformat and simplify the v region. formatGenes() will add 2 columns to the end of the data frame per region selected - 1) v_IMGT will be the formatted gene calls and 2) v_IMGT.check is a binary for if the formatted region appears in the IMGT database. In the example below, “TRBV2-1” is not recognized as a designation within IMGT.

data("immapex_example.data")
Adaptive_example <- formatGenes(immapex_example.data[["Adaptive"]],
                                region = "v",
                                technology = "Adaptive", 
                                simplify.format = TRUE) 

head(Adaptive_example[,c("aminoAcid","vGeneName", "v_IMGT", "v_IMGT.check")])
##              aminoAcid  vGeneName  v_IMGT v_IMGT.check
## 4490  CASSQDGPSGIETQYF TCRBV04-02 TRBV4-2            1
## 18266    CASSEGSNQPQHF TCRBV02-01 TRBV2-1            0
## 22061   CSASAGDMVTEAFF       <NA>    <NA>            0
## 22174  CASSQDPGETDTQYF       <NA>    <NA>            0
## 19117     CATSAWTGELFF       <NA>    <NA>            0
## 2659     CATSVPGQETQYF       <NA>    <NA>            0

inferCDR

We can now use inferCDR() to add additional sequence elements to our example data using the outputs of formatGenes() and getIMGT(). Here, we will use the function to isolate the complementarity-determining regions (CDR) 1 and 2. If the gene nomenclature does not match the IMGT the result will be NA for the given sequences. Likewise, if the IMGT nomenclature has been simplified, the first allelic match will be used for sequence extraction.

Parameters for inferCDR

  • input.data Data frame of sequencing data or output from formatGenes().
  • reference IMGT sequences from getIMGT()
  • technology The sequencing technology employed - ‘TenX’, “Adaptive’, or ‘AIRR’,
  • sequence.type Type of sequence - “aa” for amino acid or “nt” for nucleotide
  • sequences The specific regions of the CDR loop to get from the data.
Adaptive_example <- inferCDR(Adaptive_example,
                             chain = "TRB", 
                             reference = TRBV_aa,
                             technology = "Adaptive", 
                             sequence.type = "aa",
                             sequences = c("CDR1", "CDR2"))

Adaptive_example[200:210,c("CDR1_IMGT", "CDR2_IMGT")]
##          CDR1_IMGT  CDR2_IMGT
## 27450         <NA>       <NA>
## 14624         <NA>       <NA>
## 10452         <NA>       <NA>
## 31207 IIEKRQSVAFWC QGPKLLIQFQ
## 23216 IKTRGQQVTLSC QGLQFLFEYF
## 3356  VMGMTNKKSLKC KPPELMFVYS
## 25030         <NA>       <NA>
## 1889  VMGMTNKKSLKC KPLELMFVYN
## 31150 VTEKGKDVELRC QGLEFLIYFQ
## 20759         <NA>       <NA>
## 13865 VMGMTNKKSLKC KPPELMFVYS

Generating and Augmenting Sequence Sets

generateSequences

Generating synthetic sequences is a quick way to start testing the model code. generateSequences() can also generate realistic noise for generative adversarial networks.

Parameters for generateSequences()

  • prefix.motif Add a defined sequence to the start of the generated sequences.
  • suffix.motif Add a defined sequence to the end of the generated sequences
  • number.of.sequences Number of sequences to generate
  • min.length Minimum length of the final sequence (will be adjusted if incongruent with prefix.motif/suffix.motif)
  • max.length Maximum length of the final sequence
  • sequence.dictionary The letters to use in sequence generation (default are all amino acids)
sequences <- generateSequences(prefix.motif = "CAS",
                               suffix.motif = "YF",
                               number.of.sequences = 1000,
                               min.length = 8,
                               max.length = 16)
sequences <- unique(sequences)
head(sequences)
## [1] "CASHMCYF"         "CASKDINEQTYF"     "CASVIVYF"         "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF"      "CASMCISYF"

If we want to generate nucleotide sequences instead of amino acids, we must to change the sequence.dictionary.

nucleotide.sequences <- generateSequences(number.of.sequences = 1000,
                                          min.length = 8,
                                          max.length = 16, 
                                          sequence.dictionary = c("A", "C", "T", "G"))
head(nucleotide.sequences)
## [1] "TCGTGAAC"         "ATGCGAAC"         "CCGTGCCACGGGTCAA" "AAGAACGGATAGATT" 
## [5] "CTAGCCTGGCG"      "ACGCCGACAT"

mutateSequences

A common approach is to mutate sequences randomly or at specific intervals. This can be particularly helpful if we have fewer sequences or want to test a model for accuracy given new, altered sequences. mutateSequences() allows us to tune the type of mutation, where along the sequences to introduce the mutation and the overall number of mutations.

Parameters for mutateSequences()

  • input.sequences The amino acid or nucleotide sequences to use
  • number.of.sequences The number of mutated sequences to return per input.sequence
  • mutation.rate The rate of mutations introduced into sequences
  • position.start The starting position to mutate along the sequence. Default NULL will start the random mutations at position 1.
  • position.end The ending position to mutate along the sequence. Default NULL will end the random mutations at the last position.
  • sequence.dictionary The letters to use in sequence mutation (default are all amino acids)
mutated.sequences <- mutateSequences(sequences, 
                                     number.of.sequences = 1,
                                     position.start = 3,                                  
                                     position.end = 8)
head(sequences)
## [1] "CASHMCYF"         "CASKDINEQTYF"     "CASVIVYF"         "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF"      "CASMCISYF"
head(mutated.sequences)
##           CASHMCYF       CASKDINEQTYF           CASVIVYF   CASKVNWNPKPNVQYF 
##         "CASNMCYF"     "CASKDFNEQTYF"         "CASVIVYT" "CASSVNWNPKPNVQYF" 
##        CASIQLAINYF          CASMCISYF 
##      "CASIQMAINYF"        "CASACISYF"

Feature Engineering from Repertoires

Beyond encoding single sequences, immApex provides functions to calculate summary statistics and features across a collection of sequences, such as a clonal repertoire.

calculateFrequency

This function calculates the relative frequency of each residue at each position across a set of sequences.

Parameters for calculateFrequency()

  • input.sequences: A character vector of sequences.
  • max.length: The length to align sequences to by padding/trimming.
  • tidy: If TRUE, returns a long-format data frame suitable for ggplot2.
freq.matrix <- calculateFrequency(sequences, 
                                  max.length = 20)
head(freq.matrix[, 1:10])
##   Pos.1 Pos.2 Pos.3      Pos.4      Pos.5      Pos.6      Pos.7      Pos.8
## A     0     1     0 0.05210421 0.04308617 0.05210421 0.03807615 0.03106212
## R     0     0     0 0.04208417 0.06412826 0.06012024 0.03707415 0.03306613
## N     0     0     0 0.03907816 0.04208417 0.05911824 0.04809619 0.03306613
## D     0     0     0 0.04809619 0.05210421 0.05611222 0.04208417 0.03306613
## C     1     0     0 0.05010020 0.05410822 0.04609218 0.04008016 0.03807615
## Q     0     0     0 0.04108216 0.05811623 0.05811623 0.04308617 0.02905812
##        Pos.9     Pos.10
## A 0.03306613 0.02605210
## R 0.03006012 0.02805611
## N 0.03206413 0.02404810
## D 0.03807615 0.02204409
## C 0.02905812 0.02805611
## Q 0.02304609 0.01903808

calculateEntropy

This function measures the diversity (or entropy) at each position in a set of aligned sequences. It can use several common diversity metrics.

Parameters for calculateEntropy()

  • method: The diversity metric to use: “shannon”, “inv.simpson”, “gini.simpson”, “norm.entropy”, “pielou”, or Hill numbers (“hill0”, “hill1”, “hill2”).
shannon.entropy <- calculateEntropy(sequences, 
                                    method = "shannon")
head(shannon.entropy)
##     Pos1     Pos2     Pos3     Pos4     Pos5     Pos6 
## 0.000000 0.000000 0.000000 2.986591 2.984538 2.985312

calculateProperty

This function computes position-wise summary statistics for amino acid properties. For each position, it calculates a metric (like the mean) of a specific physicochemical property for all residues found at that position.

Parameters for calculateProperty()

  • property.set: The amino acid property scale to use (e.g., “atchleyFactors”).
  • summary.fun: The summary statistic to apply (“mean”, “median”, “sum” etc.).
# Calculate the mean of Atchley factors at each position
atchley.profile <- calculateProperty(sequences, 
                                     property.set = "atchleyFactors", 
                                     summary.fun = "mean")
head(atchley.profile[, 1:6])
##      position
## scale  Pos.1  Pos.2  Pos.3       Pos.4       Pos.5       Pos.6
##   AF1 -1.343 -0.591 -0.228 -0.04847395  0.06555010  0.03563226
##   AF2  0.465 -1.302  1.399 -0.04388978 -0.05410321 -0.04387976
##   AF3 -0.862 -0.733 -4.760 -0.02414930  0.06490982 -0.09043687
##   AF4 -1.020  1.570  0.670  0.02724349 -0.02970441 -0.02113026
##   AF5 -0.255 -0.146 -2.647 -0.02846994  0.06192184 -0.03162625

calculateGeneUsage

This function quantifies the usage of gene loci (e.g., V and J genes) within a repertoire.

Parameters for calculateGeneUsage() * loci: A character vector of one or two column names corresponding to the gene loci. * summary: The output format: “proportion” (default), “count”, or “percent”.

# Using the built-in AIRR data
data_airr <- immapex_example.data[["AIRR"]]

# Calculate paired V-J gene usage as percentages
vj_usage <- calculateGeneUsage(data_airr, 
                               loci = c("v_call", "j_call"),
                               summary = "percent")
vj_usage[1:5, 1:5]
##             y
## x             TRAJ10*01  TRAJ11*01  TRAJ12*01  TRAJ13*01  TRAJ13*02
##   TRAV1-1*01 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
##   TRAV1-1*02 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
##   TRAV1-2*01 0.00000000 0.01666667 0.00000000 0.00000000 0.00000000
##   TRAV1-2*03 0.01666667 0.00000000 0.03333333 0.00000000 0.01666667
##   TRAV10*01  0.08333333 0.00000000 0.00000000 0.00000000 0.00000000

calculateMotif

This function rapidly finds and counts contiguous (or gapped) amino acid or nucleotide motifs of specified lengths across a set of sequences.

Parameters for calculateMotif()

  • motif.lengths: An integer vector of motif sizes to search for.
  • min.depth: The minimum number of times a motif must appear to be included in the output.
  • discontinuous: If TRUE, also finds motifs with a single gap (e.g., “C.S”).
motif.counts <- calculateMotif(sequences, 
                               motif.lengths = 3, 
                               min.depth = 5)
head(motif.counts)
##   motif frequency
## 1   SYF        52
## 2   LQY         5
## 3   SAT         6
## 4   STH         5
## 5   SQI         5
## 6   HNY         6

probabilityMatrix

This function calculates a positional probability matrix (PPM) for a group of sequences. This can be used to represent the sequence profile of an antigen-specific repertoire or to generate a sequence logo.

ppm.matrix <- probabilityMatrix(sequences)
head(ppm.matrix)
##   Pos.1 Pos.2 Pos.3      Pos.4      Pos.5      Pos.6      Pos.7      Pos.8
## A     0     1     0 0.05210421 0.04308617 0.05210421 0.03807615 0.03106212
## R     0     0     0 0.04208417 0.06412826 0.06012024 0.03707415 0.03306613
## N     0     0     0 0.03907816 0.04208417 0.05911824 0.04809619 0.03306613
## D     0     0     0 0.04809619 0.05210421 0.05611222 0.04208417 0.03306613
## C     1     0     0 0.05010020 0.05410822 0.04609218 0.04008016 0.03807615
## Q     0     0     0 0.04108216 0.05811623 0.05811623 0.04308617 0.02905812
##        Pos.9     Pos.10    Pos.11     Pos.12     Pos.13     Pos.14 Pos.15
## A 0.03780069 0.03439153 0.0312500 0.02592593 0.02347418 0.01242236      0
## R 0.03436426 0.03703704 0.0312500 0.03333333 0.01643192 0.01552795      0
## N 0.03665521 0.03174603 0.0359375 0.03148148 0.03521127 0.00931677      0
## D 0.04352806 0.02910053 0.0375000 0.04259259 0.01877934 0.00621118      0
## C 0.03321879 0.03703704 0.0328125 0.02407407 0.03755869 0.02795031      0
## Q 0.02634593 0.02513228 0.0203125 0.03888889 0.02816901 0.01242236      0
##   Pos.16
## A      0
## R      0
## N      0
## D      0
## C      0
## Q      0

The PPM can also be converted to a positional weight matrix (PWM) using a log-likelihood ratio.

set.seed(42)
# Generate a sample background frequency
back.freq <- sample(1:1000, 20)
names(back.freq) <- amino.acids
back.freq <- back.freq/sum(back.freq)

pwm.matrix <- probabilityMatrix(sequences,
                                max.length = 20,
                                convert.PWM = TRUE,
                                background.frequencies = back.freq)
head(pwm.matrix)
##       Pos.1     Pos.2     Pos.3       Pos.4      Pos.5      Pos.6      Pos.7
## A -6.153093  3.811248 -6.153093 -0.42517268 -0.6936615 -0.4251727 -0.8676909
## R -6.982686 -6.982686 -6.982686 -1.55642111 -0.9603181 -1.0519485 -1.7347584
## N -5.347666 -5.347666 -5.347666 -0.02573756  0.0785991  0.5592249  0.2670442
## D -4.278624 -4.278624 -4.278624  1.33608583  1.4492964  1.5542660  1.1476407
## C  6.733651 -3.230690 -3.230690  2.44173581  2.5506702  2.3238993  2.1268625
## Q -4.854126 -4.854126 -4.854126  0.53819124  1.0285169  1.0285169  0.6053054
##         Pos.8      Pos.9     Pos.10     Pos.11     Pos.12      Pos.13
## A -1.15309313 -0.8766248 -1.0065966 -1.1355761 -1.3839637 -1.50303956
## R -1.89522302 -1.8394841 -1.7330959 -1.9651688 -1.8725195 -2.79206392
## N -0.26020282 -0.1142661 -0.3122005 -0.1375035 -0.3155018 -0.15704371
## D  0.80883883  1.1957837  0.6365469  0.9904318  1.1685773  0.08192294
## C  2.05471268  1.8652065  2.0189005  1.8539417  1.4389042  2.04739525
## Q  0.05276441 -0.0801582 -0.1405891 -0.4215716  0.4675443  0.03693548
##       Pos.14    Pos.15     Pos.16     Pos.17     Pos.18     Pos.19     Pos.20
## A -2.2574957 -4.080435 -3.1615713 -0.4834994 -0.4834994 -0.4834994 -0.4834994
## R -2.8240540 -4.910027 -3.9911640 -1.3130921 -1.3130921 -1.3130921 -1.3130921
## N -1.7739963 -3.275007 -2.3561438  0.3219281  0.3219281  0.3219281  0.3219281
## D -1.1199922 -2.205965 -1.2871022  1.3909697  1.3909697  1.3909697  1.3909697
## C  1.6649079 -1.158031 -0.2391677  2.4389042  2.4389042  2.4389042  2.4389042
## Q -0.9585288 -2.781468 -1.8626043  0.8154676  0.8154676  0.8154676  0.8154676

adjacencyMatrix and buildNetwork

These functions help analyze relationships between residues or whole sequences.

adjacencyMatrix() summarizes transitions between adjacent residues, creating a matrix of co-occurrence counts.

adj.matrix <- adjacencyMatrix(sequences, 
                              normalize = FALSE)
adj.matrix[1:10, 1:10]
##      A  R  N  D    C  Q  E  G  H  I
## A   20 28 38 26 1023 28 24 28 32 24
## R   28 22 41 33   27 34 33 27 30 34
## N   38 41 16 14   21 31 23 31 33 29
## D   26 33 14 38   35 18 30 24 32 35
## C 1023 27 21 35   24 28 30 30 27 34
## Q   28 34 31 18   28 20 28 27 20 31
## E   24 33 23 30   30 28 28 28 23 32
## G   28 27 31 24   30 27 28 36 25 35
## H   32 30 33 32   27 20 23 25 28 28
## I   24 34 29 35   34 31 32 35 28 42

buildNetwork() constructs a similarity network where nodes are sequences and edges connect sequences with an edit distance below a given threshold.

# Building an edge list
g1 <- buildNetwork(data.frame(sequences = sequences),
                   seq_col = "sequences",
                   threshold = 2)

# Forming network
graph <- graph_from_edgelist(as.matrix(g1[,1:2]))
E(graph)$weight <- g1[,3]

# Remove isolated nodes for clearer visualization
graph <- delete_vertices(graph, which(degree(graph) == 0))

# Convert to tidygraph for use with ggraph
g_tidy <- as_tbl_graph(graph)

# Plot the network
ggraph(g_tidy, layout = "fr") + 
  geom_edge_link(aes(width = weight), color = "black", alpha = 0.5) + 
  geom_node_point(aes(size = degree(g_tidy, mode = "all")),
                  fill = "steelblue", color= "black", shape = 21) +
  theme_void() + 
  scale_edge_width(range = c(0.1, 0.5)) + 
  labs(size = "Degree") +
  theme(legend.position = "right")

Encoding Sequences for Model Input

sequenceEncoder

The primary tool for converting biological sequences into a numerical format suitable for machine learning is the sequenceEncoder() function. It is a versatile wrapper that can generate three different types of numerical representations, controlled by the mode argument. Understanding this function is key to using the package effectively.

Key Arguments:

  • input.sequences: The character vector of your amino acid or nucleotide sequences.
  • mode: The most important argument. A string specifying the encoding type you need: "onehot", "property", or "geometric".
  • verbose: A logical flag (TRUE or FALSE) to control whether progress messages are printed.

onehotEncoder

One-hot encoding is a common method that transforms each amino acid at each position into a binary vector. This is useful when the specific identity and position of each residue are important. This functionality is accessed by setting mode = "onehot" in sequenceEncoder() or by using the onehotEncoder() alias.

Mode-Specific Arguments:

  • max.length: Additional length to pad, NULL will pad sequences to the max length of input.sequences
  • padding.symbol: The single character used to pad sequences shorter than max.length.
  • sequence.dictionary: The letters to use in encoding (default are all amino acids)
# Generate one-hot encoding using the main function
enc_onehot <- sequenceEncoder(sequences, mode = "onehot", verbose = FALSE)

# You can achieve the same result with the alias
enc_onehot_alias <- onehotEncoder(sequences, verbose = FALSE)

# View the first few columns of the flattened matrix output
head(enc_onehot$flattened[, 1:10])
##      A_1 R_1 N_1 D_1 C_1 Q_1 E_1 G_1 H_1 I_1
## [1,]   0   0   0   0   1   0   0   0   0   0
## [2,]   0   0   0   0   1   0   0   0   0   0
## [3,]   0   0   0   0   1   0   0   0   0   0
## [4,]   0   0   0   0   1   0   0   0   0   0
## [5,]   0   0   0   0   1   0   0   0   0   0
## [6,]   0   0   0   0   1   0   0   0   0   0

propertyEncoder

Instead of a binary vector, this method represents each amino acid using a set of continuous numerical values based on physicochemical properties (e.g., kideraFactors, FASGAI, etc). This can capture the biochemical similarity between amino acids. Access this method by setting mode = "property" or using the propertyEncoder() alias.

Mode-Specific Arguments:

  • property.set: A character string specifying which set of pre-defined amino acid properties to use.
  • property.matrix: A custom numeric matrix of property values you provide yourself.

Available property.set options include:

# Generate property-based encoding using FASGAI factors
enc_prop <- sequenceEncoder(sequences,
                            mode = "property",
                            property.set = "FASGAI",
                            verbose = FALSE)

# The propertyEncoder() alias is a convenient shortcut
enc_prop_alias <- propertyEncoder(sequences, property.set = "FASGAI", verbose = FALSE)

# View the first few columns of the flattened property matrix
head(enc_prop$flattened[, 1:6])
##       F1_1  F2_1   F3_1  F4_1   F5_1  F6_1
## [1,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [2,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [3,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [4,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [5,] 0.997 0.021 -1.419 -2.08 -0.799 0.502
## [6,] 0.997 0.021 -1.419 -2.08 -0.799 0.502

geometricEncoder

This approach creates a single, fixed-length vector (an “embedding”) for an entire sequence. It works by averaging the vectors for each amino acid from a substitution matrix (like BLOSUM62) and then applying a geometric rotation. This is useful for tasks where a summary of the entire sequence is needed, similar to approaches like GIANA. Use this with mode = "geometric" or the geometricEncoder() alias.

Parameters for geometricEncoder()

  • method: Select the following substitution matrices: “BLOSUM45”, “BLOSUM50”, “BLOSUM62”, “BLOSUM80”, “BLOSUM100”, “PAM30”, “PAM40”, “PAM70”, “PAM120”, or “PAM250”
  • theta: The angle in which to create the rotation matrix
# Generate a geometric embedding for each sequence
enc_geo <- sequenceEncoder(sequences,
                           mode = "geometric",
                           method = "BLOSUM62",
                           theta = pi / 3,
                           verbose = FALSE)

# The alias provides a direct path to this functionality
enc_geo_alias <- geometricEncoder(sequences, 
                                  method = "BLOSUM62", 
                                  theta = pi / 3, 
                                  verbose = FALSE)

# The output is a single summary matrix where each row is a sequence
head(enc_geo$summary)
##        [,1]        [,2]       [,3]        [,4]       [,5]       [,6]      [,7]
## 1 -1.640544 -0.65849365 -2.7610572  0.28229128 -0.6282849 -1.6617786 -2.840304
## 2 -1.229861 -0.03648518 -0.7135148  0.06917725 -0.8720085  0.8436963 -1.754380
## 3 -2.056810 -1.18750000 -3.2900635  0.69855716 -1.5612976 -1.0457532 -3.119310
## 4 -1.380529  0.14114564 -1.3470349 -0.41686706 -1.2371393  1.0177881 -1.930168
## 5 -1.462587 -0.73945224 -2.3684144  0.10221415 -1.1720277 -0.1518066 -2.628962
## 6 -1.732051 -1.00000000 -2.8133898  0.42848961 -0.7915951 -1.7400282 -2.883831
##          [,8]       [,9]       [,10]     [,11]      [,12]       [,13]
## 1  0.66955081 -0.9910254 -0.28349365 -1.953044 -0.1172278 -0.12500000
## 2 -0.62799153 -1.6852027  0.08552329 -1.255181  1.0073714 -1.51036297
## 3  0.65280398 -0.3337341  1.82804446 -1.794551 -0.8917468  0.04575318
## 4 -0.03185118 -1.9072913  0.05352540 -1.191386  1.1885413 -1.76778811
## 5  0.37167793 -0.8392773  1.09003464 -1.474767 -0.5365385 -0.53001155
## 6  0.77272030 -1.0664529  1.18048396 -1.761823 -0.5039887 -0.09622504
##          [,14]      [,15]     [,16]     [,17]      [,18]      [,19]       [,20]
## 1  0.216506351 -1.5122595 1.8693103 -1.582532  0.2410254 -0.6785254 -0.82475953
## 2  0.282692070 -0.9166667 1.5877132 -2.043055 -0.7946582 -1.3853630  0.06618572
## 3  0.170753175 -1.8370191 1.6818103 -1.828044 -0.3337341  0.7120191  0.26674682
## 4  0.061898816 -0.7935095 0.6243988 -1.859294 -0.2796075 -1.4346552  0.10989564
## 5  0.008916019 -1.4512820 1.7864214 -2.095687 -0.3701633 -0.4180069  0.17855469
## 6 -0.055555556 -1.1666667 2.0207259 -1.680484 -0.2004275 -0.4960113 -0.02977213

tokenizeSequences

Another approach to transforming a sequence into numerical values is tokenizing it into numbers. This is a common approach for recurrent neural networks where one letter corresponds to a single integer. In addition, we can add start and stop tokens to our original sequences to differentiate between the beginning and end of the sequences.

Parameters for tokenizeSequences()

  • add.startstop Add start and stop tokens to the sequence
  • start.token The character to use for the start token
  • stop.token The character to use for the stop token
  • max.length Additional length to pad, NULL will pad sequences to the max length of input.sequences
  • convert.to.matrix Return a matrix (TRUE) or a vector (FALSE)
token.matrix <- tokenizeSequences(input.sequences =  c(sequences, mutated.sequences), 
                                  add.startstop = TRUE,
                                  start.token = "!",
                                  stop.token = "^", 
                                  convert.to.matrix = TRUE)
head(token.matrix[,1:18])
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]    1    6    2   17   10   14    6   20   15    22    23    23    23    23
## [2,]    1    6    2   17   13    5   11    4    8     7    18    20    15    22
## [3,]    1    6    2   17   21   11   21   20   15    22    23    23    23    23
## [4,]    1    6    2   17   13   21    4   19    4    16    13    16     4    21
## [5,]    1    6    2   17   11    7   12    2   11     4    20    15    22    23
## [6,]    1    6    2   17   14    6   11   17   20    15    22    23    23    23
##      [,15] [,16] [,17] [,18]
## [1,]    23    23    23    23
## [2,]    23    23    23    23
## [3,]    23    23    23    23
## [4,]     7    20    15    22
## [5,]    23    23    23    23
## [6,]    23    23    23    23

sequenceDecoder

We have a function called sequenceDecoder() that extracts sequences from one-hot or property-encoded matrices or arrays. This function can be applied to any generative approach to sequence generation.

Parameters for sequenceDecoder()

  • sequence.matrix The encoded sequences to decode in an array or matrix
  • mode The encoding mode used for decoding: "onehot" or "property"
  • property.set For mode = "property", a character vector of property names (e.g., "Atchley") that were used for the original encoding.
  • call.threshold The relative strictness of sequence calling with higher values being more stringent
property.matrix <- propertyEncoder(input.sequences =  c(sequences, mutated.sequences), 
                                   property.set = "FASGAI")

property.sequences <- sequenceDecoder(property.matrix[[1]],
                                      mode = "property",
                                      property.set  = "FASGAI",
                                      call.threshold = 1)
head(sequences)
## [1] "CASHMCYF"         "CASKDINEQTYF"     "CASVIVYF"         "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF"      "CASMCISYF"
head(property.sequences)
##                                                                             
##         "CASHMCYF"     "CASKDINEQTYF"         "CASVIVYF" "CASKVNWNPKPNVQYF" 
##                                       
##      "CASIQLAINYF"        "CASMCISYF"

A similar approach can be applied when using matrices or arrays derived from one-hot encoding:

sequence.matrix <- onehotEncoder(input.sequences =  c(sequences, mutated.sequences))

OHE.sequences <- sequenceDecoder(sequence.matrix,
                                 mode= "onehot")

head(OHE.sequences)
## [1] "CASHMCYF"         "CASKDINEQTYF"     "CASVIVYF"         "CASKVNWNPKPNVQYF"
## [5] "CASIQLAINYF"      "CASMCISYF"

Training a Model

Example 1: Classifying Sequences with Random Forest

A common task is to classify sequences into groups, such as predicting whether an immune receptor binds to a specific antigen. Here, we’ll train a Random Forest model to distinguish between two classes of sequences based on their physicochemical properties

Step 1: Simulate Data and Engineer Features

First, we’ll simulate two distinct classes of sequences using generateSequences(). We’ll then use propertyEncoder() with the “atchleyFactors” to convert each sequence into a flattened numerical vector. Each element in the vector represents a specific physicochemical property at a specific position, transforming our variable-length sequences into a fixed-size feature matrix suitable for machine learning.

# Step 1a: Generate two distinct classes of sequences
class1.sequences <- generateSequences(prefix.motif = "CAS",
                                      min.length = 3,
                                      number.of.sequences = 500)

class2.sequences <- generateSequences(prefix.motif = "CSG",
                                      min.length = 3,
                                      number.of.sequences = 500)

# Combine sequences and create labels
all.sequences <- c(class1.sequences, class2.sequences)
labels <- as.factor(c(rep("Class1", 500), rep("Class2", 500)))

# Step 1b: Use propertyEncoder to create a feature matrix from Atchley factors
feature.matrix <- propertyEncoder(all.sequences, 
                                  property.set = "atchleyFactors",
                                  verbose = FALSE)

# Combine the flattened feature matrix and labels into the final data frame
training.data <- data.frame(feature.matrix$flattened, labels)

Step 2: Train and Evaluate the Model

Now we can train the Random Forest classifier. This model is robust, requires minimal tuning, and can provide insights into which features (in this case, which property at which position) are most important for distinguishing between the classes.

suppressMessages(library(randomForest))

# Train the Random Forest model
set.seed(42) # for reproducibility
rf.model <- randomForest(labels ~ ., 
                         data = training.data, 
                         ntree = 100, 
                         importance = TRUE) # Set importance=TRUE to calculate scores

# Print the confusion matrix to see model performance
print(rf.model)
## 
## Call:
##  randomForest(formula = labels ~ ., data = training.data, ntree = 100,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##        Class1 Class2 class.error
## Class1    500      0           0
## Class2      0    500           0

A key advantage of Random Forest is the ability to easily extract feature importance. We can create a plot to see which positional Atchley factors were most influential in the model’s predictions.

# Extract feature importance data from the model
importance.data <- as.data.frame(importance(rf.model))
importance.data$Feature <- rownames(importance.data)

# Get the top 15 most important features
top_features <- importance.data[order(importance.data$MeanDecreaseGini, decreasing = TRUE), ][1:15,]

# Plot using ggplot2
ggplot(top_features, aes(x = MeanDecreaseGini, y = reorder(Feature, MeanDecreaseGini))) +
  geom_col(aes(fill = MeanDecreaseGini)) +
  scale_fill_viridis_c() +
  labs(title = "Top 15 Most Important Features",
       x = "Mean Decrease in Gini Impurity",
       y = "Feature (Property and Position)") +
  theme_classic() +
  theme(legend.position = "none")

The plot shows the features that the model found most useful for classification. The feature names correspond to specific Atchley factors at specific positions within the sequence - here residues 2 and 3 where the difference was encoded - demonstrating how propertyEncoder() allows the model to learn from the underlying biochemistry.

Example 2: Unsupervised Clustering with PCA and Geometric Encoding

Sometimes we don’t have labels, but we want to see if our sequences form natural clusters. We can use dimensionality reduction techniques like Principal Component Analysis (PCA) to visualize the relationships between sequences. The geometricEncoder() function is perfect for this, as it creates a rich, fixed-length numerical embedding for each sequence.

Step 1: Simulate and Encode a Mixed Population of Sequences

We’ll generate three distinct families of sequences. Then, we will use geometricEncoder() to transform them into a 20-dimensional numerical matrix.

# Generate three distinct groups of sequences
groupA <- generateSequences(prefix.motif = "CA", 
                            number.of.sequences = 100, 
                            min.length = 2)
groupB <- generateSequences(prefix.motif = "QR", 
                            number.of.sequences = 100, 
                            min.length = 2)
groupC <- generateSequences(prefix.motif = "YY", 
                            number.of.sequences = 100, 
                            min.length = 2)

# Combine them and create a vector of original group IDs for later visualization
mixed.sequences <- c(groupA, groupB, groupC)
original.groups <- as.factor(rep(c("Group A", "Group B", "Group C"), each = 100))

# Use geometricEncoder to create embeddings
geometric.embeddings <- geometricEncoder(mixed.sequences, 
                                         method = "BLOSUM62", 
                                         verbose = FALSE)

Step 2: Perform PCA and Visualize Clusters

With our geometric.embeddings matrix, we can now perform PCA and plot the results. If the geometric encoding captures the underlying differences in our sequence families, we should see distinct clusters in the plot of the first two principal components.

# Perform PCA on the embedding matrix
pca.result <- prcomp(geometric.embeddings$summary, center = TRUE, scale. = TRUE)

# Prepare data for plotting with ggplot2
pca.data <- data.frame(PC1 = pca.result$x[,1],
                       PC2 = pca.result$x[,2],
                       Group = original.groups)

# Plot PC1 vs PC2 using ggplot2 and viridis
ggplot(pca.data, aes(x = PC1, y = PC2, color = Group)) +
  geom_point(size = 3, alpha = 0.8) +
  scale_color_viridis(discrete = TRUE) +
  labs(title = "PCA of Geometric Sequence Embeddings",
       x = "Principal Component 1",
       y = "Principal Component 2") +
  theme_classic()

The resulting plot clearly shows three distinct clusters, demonstrating that the geometricEncoder() successfully captured the structural differences between our sequence families, allowing for their separation in an unsupervised manner.

Example 3: Identifying Sequence Communities with Network Analysis

Beyond analyzing individual sequences, we can explore the relationships between sequences by building a similarity network. A powerful machine learning application for these networks is community detection, an unsupervised clustering method that finds groups of densely connected nodes. In immunology, this can be used to identify clonal families or groups of sequences with shared features.

Step 1: Simulate Data with Inherent Structure

First, we will simulate a dataset containing three distinct “families” of sequences. Our goal is to see if the network analysis can blindly recover these groups without being given the labels.

set.seed(42)
# Generate three distinct groups of sequences to simulate clonal families
family1 <- unique(generateSequences(prefix.motif = "CASS", 
                             suffix.motif = "YF", 
                             min.length = 6,
                             number.of.sequences = 60))
family2 <- unique(generateSequences(prefix.motif = "CARS", 
                             suffix.motif = "GF",
                             min.length = 6,
                             number.of.sequences = 60))
family3 <- unique(generateSequences(prefix.motif = "CSVA", 
                             suffix.motif = "HF", 
                             min.length = 6,
                             number.of.sequences = 60))

# Combine into a single data frame
all_sequences_df <- data.frame(
  sequence = c(family1, family2, family3),
  original_family = c(rep("Family 1", 42), rep("Family 2", 40), rep("Family 3", 46))
)

Step 2: Build the Sequence Network

Next, we use buildNetwork() to create an edge list based on sequence similarity (edit distance). We then convert this list into a formal graph object using the igraph package, which is the standard for network analysis in R.

# Build the edge list: connect sequences with an edit distance of 3 or less
edge_list <- buildNetwork(all_sequences_df,
                          seq_col = "sequence",
                          threshold = 3)

# Replace numerical edge list with the sequences
edge_list_sequences <- as.matrix(data.frame(
  from = all_sequences_df$sequence[as.numeric(edge_list$from)],
  to = all_sequences_df$sequence[as.numeric(edge_list$to)],
  dist = edge_list$dist
))

# Create a graph object from the edge list and node data
# This graph now contains all our sequences as nodes
sequence_graph <- graph_from_data_frame(d = edge_list_sequences, 
                                        vertices = all_sequences_df, 
                                        directed = FALSE)
E(sequence_graph)$weight <- edge_list_sequences[,3]

Step 3: Perform Community Detection

Now we apply a machine learning algorithm to find communities. The igraph package offers many algorithms; we will use “Walktrap,” a method that finds communities through a series of short random walks. The idea is that walks are more likely to get “trapped” within densely connected parts of the network (i.e., communities).

# Perform community detection using the Walktrap algorithm
communities <- cluster_walktrap(sequence_graph)

# Add the community membership as an attribute to the graph nodes
V(sequence_graph)$community <- communities$membership

Step 4: Visualize and Interpret the Communities

Finally, we use ggraph to visualize the network, coloring each node by the community it was assigned to by the algorithm. If the analysis was successful, the colors should align with the original sequence families we simulated.

# Convert to tidygraph for use with ggraph
g_tidy <- as_tbl_graph(sequence_graph)

# Plot the network, coloring nodes by their detected community
ggraph(g_tidy, layout = "fr") + 
  geom_edge_link(aes(alpha = weight), show.legend = FALSE) +
  geom_node_point(aes(color = as.factor(community)), size = 4) +
  scale_color_viridis(discrete = TRUE, option = "D") +
  labs(title = "Sequence Network with Detected Communities",
       color = "Detected\nCommunity") +
  theme_void()

The plot clearly visualizes the network structure, and the distinct color groups demonstrate that the community detection algorithm successfully identified and separated the three original sequence families without any prior information. This unsupervised approach is a powerful tool for exploring the hidden structure within a complex repertoire.