vignettes/articles/Combining_Contigs.Rmd
Combining_Contigs.Rmd
There are varying definitions of clones or clones in the literature. For the purposes of scRepertoire, we will use clone and define this as the cells with shared/trackable complementarity-determining region 3 (CDR3) sequences. Within this definition, one might use amino acid (aa) sequences of one or both chains to define a clone. Alternatively, we could use nucleotide (nt) or the V(D)JC genes (genes) to define a clone. The latter genes would be a more permissive definition of “clones”, as multiple amino acid or nucleotide sequences can result from the same gene combination. Another option to define clone is the use of the V(D)JC and nucleotide sequence (strict). scRepertoire allows for the use of all these definitions of clones and allows for users to select both or individual chains to examine.
The first step in getting clones is to use the single-cell barcodes
to organize cells into paired sequences. This is accomplished using
combineTCR()
and combineBCR()
.
input.data
loadContigs()
.samples and ID
removeNA
removeMulti
filterMulti
The output of combineTCR()
will be a list of contig data
frames that will be reduced to the reads associated with a single cell
barcode. It will also combine the multiple reads into clone calls by
either the nucleotide sequence (CTnt), amino acid
sequence (CTaa), the VDJC gene sequence
(CTgene), or the combination of the nucleotide and gene
sequence (CTstrict).
combined.TCR <- combineTCR(contig_list,
samples = c("P17B", "P17L", "P18B", "P18L",
"P19B","P19L", "P20B", "P20L"),
removeNA = FALSE,
removeMulti = FALSE,
filterMulti = FALSE)
head(combined.TCR[[1]])
## barcode sample TCR1 cdr3_aa1
## 1 P17B_AAACCTGAGTACGACG-1 P17B TRAV25.TRAJ20.TRAC CGCSNDYKLSF
## 3 P17B_AAACCTGCAACACGCC-1 P17B TRAV38-2/DV8.TRAJ52.TRAC CAYRSAQAGGTSYGKLTF
## 5 P17B_AAACCTGCAGGCGATA-1 P17B TRAV12-1.TRAJ9.TRAC CVVSDNTGGFKTIF
## 7 P17B_AAACCTGCATGAGCGA-1 P17B TRAV12-1.TRAJ9.TRAC CVVSDNTGGFKTIF
## 9 P17B_AAACGGGAGAGCCCAA-1 P17B TRAV20.TRAJ8.TRAC CAVRGEGFQKLVF
## 10 P17B_AAACGGGAGCGTTTAC-1 P17B TRAV12-1.TRAJ9.TRAC CVVSDNTGGFKTIF
## cdr3_nt1
## 1 TGTGGGTGTTCTAACGACTACAAGCTCAGCTTT
## 3 TGTGCTTATAGGAGCGCGCAGGCTGGTGGTACTAGCTATGGAAAGCTGACATTT
## 5 TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT
## 7 TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT
## 9 TGTGCTGTGCGAGGAGAAGGCTTTCAGAAACTTGTATTT
## 10 TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT
## TCR2 cdr3_aa2
## 1 TRBV5-1.None.TRBJ2-7.TRBC2 CASSLTDRTYEQYF
## 3 TRBV10-3.None.TRBJ2-2.TRBC2 CAISEQGKGELFF
## 5 TRBV9.None.TRBJ2-2.TRBC2 CASSVRRERANTGELFF
## 7 TRBV9.None.TRBJ2-2.TRBC2 CASSVRRERANTGELFF
## 9 <NA> <NA>
## 10 TRBV9.None.TRBJ2-2.TRBC2 CASSVRRERANTGELFF
## cdr3_nt2
## 1 TGCGCCAGCAGCTTGACCGACAGGACCTACGAGCAGTACTTC
## 3 TGTGCCATCAGTGAACAGGGGAAAGGGGAGCTGTTTTTT
## 5 TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## 7 TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## 9 <NA>
## 10 TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## CTgene
## 1 TRAV25.TRAJ20.TRAC_TRBV5-1.None.TRBJ2-7.TRBC2
## 3 TRAV38-2/DV8.TRAJ52.TRAC_TRBV10-3.None.TRBJ2-2.TRBC2
## 5 TRAV12-1.TRAJ9.TRAC_TRBV9.None.TRBJ2-2.TRBC2
## 7 TRAV12-1.TRAJ9.TRAC_TRBV9.None.TRBJ2-2.TRBC2
## 9 TRAV20.TRAJ8.TRAC_NA
## 10 TRAV12-1.TRAJ9.TRAC_TRBV9.None.TRBJ2-2.TRBC2
## CTnt
## 1 TGTGGGTGTTCTAACGACTACAAGCTCAGCTTT_TGCGCCAGCAGCTTGACCGACAGGACCTACGAGCAGTACTTC
## 3 TGTGCTTATAGGAGCGCGCAGGCTGGTGGTACTAGCTATGGAAAGCTGACATTT_TGTGCCATCAGTGAACAGGGGAAAGGGGAGCTGTTTTTT
## 5 TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT_TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## 7 TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT_TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## 9 TGTGCTGTGCGAGGAGAAGGCTTTCAGAAACTTGTATTT_NA
## 10 TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT_TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## CTaa
## 1 CGCSNDYKLSF_CASSLTDRTYEQYF
## 3 CAYRSAQAGGTSYGKLTF_CAISEQGKGELFF
## 5 CVVSDNTGGFKTIF_CASSVRRERANTGELFF
## 7 CVVSDNTGGFKTIF_CASSVRRERANTGELFF
## 9 CAVRGEGFQKLVF_NA
## 10 CVVSDNTGGFKTIF_CASSVRRERANTGELFF
## CTstrict
## 1 TRAV25.TRAJ20.TRAC;TGTGGGTGTTCTAACGACTACAAGCTCAGCTTT_TRBV5-1.None.TRBJ2-7.TRBC2;TGCGCCAGCAGCTTGACCGACAGGACCTACGAGCAGTACTTC
## 3 TRAV38-2/DV8.TRAJ52.TRAC;TGTGCTTATAGGAGCGCGCAGGCTGGTGGTACTAGCTATGGAAAGCTGACATTT_TRBV10-3.None.TRBJ2-2.TRBC2;TGTGCCATCAGTGAACAGGGGAAAGGGGAGCTGTTTTTT
## 5 TRAV12-1.TRAJ9.TRAC;TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT_TRBV9.None.TRBJ2-2.TRBC2;TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## 7 TRAV12-1.TRAJ9.TRAC;TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT_TRBV9.None.TRBJ2-2.TRBC2;TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
## 9 TRAV20.TRAJ8.TRAC;TGTGCTGTGCGAGGAGAAGGCTTTCAGAAACTTGTATTT_NA;NA
## 10 TRAV12-1.TRAJ9.TRAC;TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT_TRBV9.None.TRBJ2-2.TRBC2;TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
combineBCR()
is analogous to combineTCR()
with 2 major changes: 1) Each barcode can only have a
maximum of 2 sequences, if greater exists, the 2 with the highest reads
are selected; 2) The strict definition
of a clone is based on the normalized Levenshtein edit distance of CDR3
nucleotide sequences and V-gene usage. For more information on this
approach, please see the respective citation. This
definition allows for the grouping of BCRs derived from the same
progenitor that have undergone mutation as part of somatic hypermutation
and affinity maturation.
threshold
The level of similarity in sequences to group together.
Default is 0.85.
\[ \text{threshold}(s, t) = 1-\frac{\text{Levenshtein}(s, t)}{\frac{\text{length}(s) + \text{length}(t)}{2}} \]
call.related.clones
Calculate the normalized edit distance (TRUE) or skip
the calculation (FALSE). Skipping the edit distance
calculation may save time, especially in the context of large data sets,
but is not recommended.
BCR.contigs <- read.csv("https://www.borch.dev/uploads/contigs/b_contigs.csv")
combined.BCR <- combineBCR(BCR.contigs,
samples = "P1",
threshold = 0.85)
head(combined.BCR[[1]])
## barcode sample IGH
## 1 P1_CGAACATTCCCTTGTG-1 P1 <NA>
## 2 P1_CCTTCGAAGACGCTTT-1 P1 <NA>
## 3 P1_CAGCTAACAGCTTAAC-1 P1 <NA>
## 4 P1_CCCAATCAGAGGGCTT-1 P1 IGHV1-69-2.IGHD5-12.IGHJ6.IGHG1
## 5 P1_TACAGTGGTAAGGGCT-1 P1 IGHV1-18.IGHD3-10.IGHJ5.IGHA1
## 6 P1_GACAGAGGTTGGTAAA-1 P1 IGHV1-18.IGHD3-10.IGHJ5.IGHA1
## cdr3_aa1
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 CASGGLVAIRHYYYYGMDVW
## 5 CARVTYHHGSGSLVGGWFDPW
## 6 CARVTYHHGSGSLVGGWFDPW
## cdr3_nt1
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 TGTGCGAGCGGGGGATTAGTGGCTATTAGGCACTACTACTACTACGGTATGGACGTCTGG
## 5 TGTGCGAGAGTGACATATCACCATGGTTCGGGGAGCCTTGTCGGGGGCTGGTTCGACCCCTGG
## 6 TGTGCGAGAGTGACATATCACCATGGTTCGGGGAGCCTTGTCGGGGGCTGGTTCGACCCCTGG
## IGLC cdr3_aa2 cdr3_nt2
## 1 IGLV2-14.IGLJ6.IGLC2 CNSKGAGGTAVF TGCAACTCAAAAGGAGCCGGAGGCACTGCGGTTTTC
## 2 IGLV2-14.IGLJ6.IGLC2 CNSKGAGGTAVF TGCAACTCAAAAGGAGCCGGAGGCACTGCGGTTTTC
## 3 IGLV2-14.IGLJ2.IGLC2 CNSYTTSGTLWVF TGCAACTCATACACAACCAGCGGCACTCTCTGGGTATTC
## 4 IGLV2-14.IGLJ2.IGLC2 CNSYTSSSTLVF TGCAACTCATATACAAGCAGCAGCACTCTGGTCTTC
## 5 IGLV2-14.IGLJ1.IGLC1 CNSYTSFGTSVF TGCAACTCATATACAAGCTTCGGCACCTCGGTCTTC
## 6 IGLV2-14.IGLJ1.IGLC1 CNSYTSFGTSVF TGCAACTCATATACAAGCTTCGGCACCTCGGTCTTC
## CTgene
## 1 NA_IGLV2-14.IGLJ6.IGLC2
## 2 NA_IGLV2-14.IGLJ6.IGLC2
## 3 NA_IGLV2-14.IGLJ2.IGLC2
## 4 IGHV1-69-2.IGHD5-12.IGHJ6.IGHG1_IGLV2-14.IGLJ2.IGLC2
## 5 IGHV1-18.IGHD3-10.IGHJ5.IGHA1_IGLV2-14.IGLJ1.IGLC1
## 6 IGHV1-18.IGHD3-10.IGHJ5.IGHA1_IGLV2-14.IGLJ1.IGLC1
## CTnt
## 1 NA_TGCAACTCAAAAGGAGCCGGAGGCACTGCGGTTTTC
## 2 NA_TGCAACTCAAAAGGAGCCGGAGGCACTGCGGTTTTC
## 3 NA_TGCAACTCATACACAACCAGCGGCACTCTCTGGGTATTC
## 4 TGTGCGAGCGGGGGATTAGTGGCTATTAGGCACTACTACTACTACGGTATGGACGTCTGG_TGCAACTCATATACAAGCAGCAGCACTCTGGTCTTC
## 5 TGTGCGAGAGTGACATATCACCATGGTTCGGGGAGCCTTGTCGGGGGCTGGTTCGACCCCTGG_TGCAACTCATATACAAGCTTCGGCACCTCGGTCTTC
## 6 TGTGCGAGAGTGACATATCACCATGGTTCGGGGAGCCTTGTCGGGGGCTGGTTCGACCCCTGG_TGCAACTCATATACAAGCTTCGGCACCTCGGTCTTC
## CTaa
## 1 NA_CNSKGAGGTAVF
## 2 NA_CNSKGAGGTAVF
## 3 NA_CNSYTTSGTLWVF
## 4 CASGGLVAIRHYYYYGMDVW_CNSYTSSSTLVF
## 5 CARVTYHHGSGSLVGGWFDPW_CNSYTSFGTSVF
## 6 CARVTYHHGSGSLVGGWFDPW_CNSYTSFGTSVF
## CTstrict
## 1 NA.NA_IGLC.549.IGLV2-14
## 2 NA.NA_IGLC.549.IGLV2-14
## 3 NA.NA_IGLC.535.IGLV2-14
## 4 IGH.481.IGHV1-69-2_IGLC:Cluster.56.IGLV2-14
## 5 IGH.781.IGHV1-18_IGLC:Cluster.56.IGLV2-14
## 6 IGH.781.IGHV1-18_IGLC:Cluster.56.IGLV2-14