Loading and Processing Contig Data

What data to load into scRepertoire?

scRepertoire primarily functions using the filtered_contig_annotations.csv output generated by 10x Genomics Cell Ranger. This file is typically found in the ./outs/ directory of your VDJ alignment folder.

Demonstrating Manual Data Loading (10x Genomics)

To prepare data for scRepertoire from 10x Genomics outputs, you would:

  • Load the filtered_contig_annotations.csv file for each of your samples.
  • Combine these loaded data frames into a single list in your R environment.
S1 <- read.csv(".../Sample1/outs/filtered_contig_annotations.csv")
S2 <- read.csv(".../Sample2/outs/filtered_contig_annotations.csv")
S3 <- read.csv(".../Sample3/outs/filtered_contig_annotations.csv")
S4 <- read.csv(".../Sample4/outs/filtered_contig_annotations.csv")

contig_list <- list(S1, S2, S3, S4)

Other alignment workflows

Beyond the default 10x Genomics Cell Ranger pipeline outputs, scRepertoire supports various other single-cell immune receptor sequencing formats through the loadContigs() function.

Supported Formats and Expected File Names

  • 10X: “filtered_contig_annotations.csv”
  • AIRR: “airr_rearrangement.tsv”
  • BD: “Contigs_AIRR.tsv”
  • Dandelion: “all_contig_dandelion.tsv”
  • Immcantation: “_data.tsv” (or similar)
  • JSON: “.json”
  • MiXCR: “clones.tsv”
  • ParseBio: “barcode_report.tsv”
  • TRUST4: “barcode_report.tsv”
  • WAT3R: “barcode_results.csv”

Key Parameter(s) for loadContigs()

  • input: A directory path containing your contig files (the function will recursively search) or a list/data frame of pre-loaded contig data.
  • format: A string specifying the data format (e.g., 10X, TRUST4, WAT3R). If set to “auto”, the function will attempt to automatically detect the format based on file names or data structure.

You can provide loadContigs() with a directory where your sequencing experiments are located, and it will recursively load and process the contig data based on the file names:

# Directory example
contig.output <- c("~/Documents/MyExperiment")
contig.list <- loadContigs(input = contig.output, 
                           format = "TRUST4")

Alternatively, loadContigs() can be given a list of pre-loaded data frames and process the contig data based on the specified format:

# List of data frames example
S1 <- read.csv("~/Documents/MyExperiment/Sample1/outs/barcode_results.csv")
S2 <- read.csv("~/Documents/MyExperiment/Sample2/outs/barcode_results.csv")
S3 <- read.csv("~/Documents/MyExperiment/Sample3/outs/barcode_results.csv")
S4 <- read.csv("~/Documents/MyExperiment/Sample4/outs/barcode_results.csv")

contig.list <- list(S1, S2, S3, S4)
contig.list <- loadContigs(input = contig.list, 
                           format = "WAT3R")

Multiplexed Experiment

It is now easy to create the contig list from a multiplexed experiment by first generating a single-cell RNA object (either Seurat or Single Cell Experiment), loading the filtered contig file, and then using createHTOContigList(). This function will return a list separated by the group.by variable(s).

Important Considerations for createHTOContigList()

  • This function depends on the match of barcodes between the single-cell object and contigs. If there is a prefix or different suffix added to the barcode, this will result in no contigs being recovered.
  • It is currently recommended to perform this step before integration workflows, as integration commonly alters the barcodes.
  • There is a multi.run variable that can be used on an integrated object. However, it assumes you have modified the barcodes with the Seurat pipeline (automatic addition of _# to the end), and your contig list is in the same order.

To create a contig list separated by HTO (Hash Tag Oligo) IDs from a single-cell object:

contigs <- read.csv(".../outs/filtered_contig_annotations.csv")

contig.list <- createHTOContigList(contigs, 
                                   Seurat.Obj, 
                                   group.by = "HTO_maxID")

Example Data in scRepertoire

scRepertoire includes a built-in example dataset to demonstrate the functionality of the R package. This dataset consists of T cells derived from four patients with acute respiratory distress with paired peripheral-blood (B) and bronchoalveolar lavage (L), effectively creating 8 distinct runs for T cell receptor (TCR) enrichment. More information on the data set can be found in the corresponding manuscript.

The built-in example data is derived from the 10x Cell Ranger pipeline, so it is ready to go for downstream processing and analysis.

To load and preview the example data built into scRepertoire:

data("contig_list") #the data built into scRepertoire

head(contig_list[[1]])
##              barcode is_cell                   contig_id high_confidence length
## 1 AAACCTGAGTACGACG-1    True AAACCTGAGTACGACG-1_contig_1            True    500
## 2 AAACCTGAGTACGACG-1    True AAACCTGAGTACGACG-1_contig_2            True    478
## 4 AAACCTGCAACACGCC-1    True AAACCTGCAACACGCC-1_contig_1            True    506
## 5 AAACCTGCAACACGCC-1    True AAACCTGCAACACGCC-1_contig_2            True    470
## 6 AAACCTGCAGGCGATA-1    True AAACCTGCAGGCGATA-1_contig_1            True    558
## 7 AAACCTGCAGGCGATA-1    True AAACCTGCAGGCGATA-1_contig_2            True    505
##   chain       v_gene d_gene  j_gene c_gene full_length productive
## 1   TRA       TRAV25   None  TRAJ20   TRAC        True       True
## 2   TRB      TRBV5-1   None TRBJ2-7  TRBC2        True       True
## 4   TRA TRAV38-2/DV8   None  TRAJ52   TRAC        True       True
## 5   TRB     TRBV10-3   None TRBJ2-2  TRBC2        True       True
## 6   TRA     TRAV12-1   None   TRAJ9   TRAC        True       True
## 7   TRB        TRBV9   None TRBJ2-2  TRBC2        True       True
##                 cdr3                                                cdr3_nt
## 1        CGCSNDYKLSF                      TGTGGGTGTTCTAACGACTACAAGCTCAGCTTT
## 2     CASSLTDRTYEQYF             TGCGCCAGCAGCTTGACCGACAGGACCTACGAGCAGTACTTC
## 4 CAYRSAQAGGTSYGKLTF TGTGCTTATAGGAGCGCGCAGGCTGGTGGTACTAGCTATGGAAAGCTGACATTT
## 5      CAISEQGKGELFF                TGTGCCATCAGTGAACAGGGGAAAGGGGAGCTGTTTTTT
## 6     CVVSDNTGGFKTIF             TGTGTGGTCTCCGATAATACTGGAGGCTTCAAAACTATCTTT
## 7  CASSVRRERANTGELFF    TGTGCCAGCAGCGTAAGGAGGGAAAGGGCGAACACCGGGGAGCTGTTTTTT
##   reads umis raw_clonotype_id         raw_consensus_id
## 1  8344    4     clonotype123 clonotype123_consensus_2
## 2 65390   38     clonotype123 clonotype123_consensus_1
## 4 18372    8     clonotype124 clonotype124_consensus_1
## 5 34054    9     clonotype124 clonotype124_consensus_2
## 6  5018    2       clonotype1   clonotype1_consensus_2
## 7 25110   11       clonotype1   clonotype1_consensus_1