Data analysis

A next-generation understanding of immune responses

Repertoire sequencing experiments usually generate thousands to millions of sequencing reads per sample. Powerful and specialized bioinformatics pipelines are warranted to accurately interpret large-scale repertoire data and extract biologically and clinically relevant insights. The analysis of immune repertoire data can be divided in two steps: the clone identification and quantification and the downstream analysis.


From sequencing reads to clone counts

  • The correct identification and quantification of lymphocyte clones is a crucial and fundamental step in the analysis of immune repertoires. Dependent on the library preparation, the pipeline generally entails the following steps:

  • Preprocessing of sequencing (FASTQ) files includes a read quality control and, if applicable, read pairing. 

  • Alignment of the sequences against a reference database to determine the variable (V), joining (J) and sometimes diversity (D) gene segments.

  • A grouping of the sequences into clonotypes. Note that the definition of a clonotype is different for B cells, and T cells. Subsequently, some form of error correction is performed on the clonotype groups to correct for sequencing and PCR errors and the abundance of each clonotype is determined. The annotated sequences are now ready for downstream analysis.

Downstream analysis

From immune repertoire data to biological insight

Immune repertoire data spans to the nucleotide resolution but is also well suited to study system dynamics of immune repertoires as a whole. In pursuance of deriving biologically relevant information, methods to interrogate different aspects of the immune repertoire have to be tailored to the nature of immune repertoire data.

Explore the different methods to characterize various features of the immune repertoire:

Clone tracking

Cancer is clonal, meaning for lymphoid malignancies, that a single T or B cell can be the progenitor of the entire cancer. This concept is used for diagnostic purposes, where a large group of B or T cells with a common sequence rearrangement assists the diagnosis of leukemia and lymphoma. After treatment, that same sequence is used for ultra-sensitive detection of residual leukemic clones


The expansive diversity of immune repertoires renders the finding of the same sequence within two individuals and even within separate samples of the same organism highly unlikely. This imposes limitations to the direct comparison of sequences and led to the adoption of sequence independent measures.

CDR3 length distribution

CDR3 length distribution comparisons serve as a popular tool in immuno-diagnostics, because it is sequence independent and therefore less sensitive to mutations and sequencing errors. Shifted or skewed CDR3 length distributions have been associated to clinical outcome for various auto-immune diseases and cancer.

Gene usage

During V(D)J recombination, variable (V), joining (J) and sometimes diversity (D) gene segments are rearranged in a nearly random fashion.

Statistical analysis of the gene usage amongst various immunological statuses, has shown a preferential use of particular V/J genes with specific immune responses, this way assisting immunodiagnostics and patient stratification.

CDR3 amino acid composition

Due to the fact that the CDR3 determines to a large extend the specificity and binding strength to an antigen, the information captured in this particular region is crucial to accurately characterize immune responses. Statistical analysis of the CDR3 amino acid composition can yield valuable information about onging immiune response. For example, an enrichment in the CDR3 region for amino acids with certain physicochemical properties (e.g. hydrophobicity, charge or polarity) can indicate functional selection as a response to antigenic exposure.

B-cell lineage

Upon activation, B cells can heavily mutate and, as a result, form groups of clonally-related cells. In the search for therapeutic antibodies, it is of great interest to investigate the clonal ‘space’ surrounding the antibody of interest, as it increases the ability to identify antibodies with the optimal therapeutic properties. Moreover, it allows researchers to extrapolate external data (e.g. wet-lab measurements such as affinity) from known therapeutics to B cells in the same lineage. Lastly, by investigating clonally related sequences, we can track B cell lineages up until the level of an unmutated common ancestor, providing novel treatment strategies for rapidly evolving viral pathogens such as influenza and HIV.

Public clones

Public clones are TCR or BCR sequences that are shared between different individuals and indicate a common response to the same antigen. T cells identified as “public” in a population known to have been exposed to the same pathogen, have the potential to be used to diagnose pathogenic exposure in the general population.

T-cell clustering

T cell clustering is the based on the idea that receptors that are more alike in terms of CDR3 sequence and structure, have a higher probability of sharing epitope specificity. Using structural TCR data, conserved motifs and CDR3 sequence similarity, we can learn to predict, from sequence, whether two receptors are likely to share specificity for the same epitope. Rather than tracking the expansion and contraction of individual clones, cluster-based analyses can track changes in the immune repertoire on a functional level, facilitating cross-patient analysis.

Generation probabilities

The generation probability of a receptor is the probability of a specific (VDJ) rearrangement to be generated by chance. The number of possible receptor rearrangements is tremendous, but large-scale analysis of receptor sequence data has shown that the process is biased towards the production of a limited pool of conformations. By learning the statistical properties of this process, one can predict the probability of a particular receptor sequence being generated. This information can be used to support the fact that an observed clonal expansion is likely to be in response to antigenic exposure, rather than random gene-rearrangement.