Data analysis

A next-generation understanding of immune responses

Repertoire sequencing experiments usually generate thousands to millions of sequencing reads per sample. Powerful and specialized bioinformatics pipelines are warranted to accurately interpret large-scale repertoire data and extract biologically and clinically relevant insights. The analysis of immune repertoire data can be divided in two steps: the clone identification and quantification and the downstream analysis.


From sequencing reads to clone counts

  • The correct identification and quantification of lymphocyte clones is a crucial and fundamental step in the analysis of immune repertoires. Dependent on the library preparation, the pipeline generally entails the following steps:

  • Preprocessing of sequencing (FASTQ) files includes a read quality control and, if applicable, read pairing. 

  • Alignment of the sequences against a reference database to determine the variable (V), joining (J) and sometimes diversity (D) gene segments.

  • A grouping of the sequences into clonotypes. Note that the definition of a clonotype is different for B cells, and T cells. Subsequently, some form of error correction is performed on the clonotype groups to correct for sequencing and PCR errors and the abundance of each clonotype is determined. The annotated sequences are now ready for downstream analysis.

Downstream analysis

From immune repertoire data to biological insight

Immune repertoire data spans to the nucleotide resolution but is also well suited to study system dynamics of immune repertoires as a whole. In pursuance of deriving biologically relevant information, methods to interrogate different aspects of the immune repertoire have to be tailored to the nature of immune repertoire data.

Explore the different methods to characterize various features of the immune repertoire:

Clone tracking

Sampling longitudinal timepoints allows researchers to track unique receptor rearrangements over time. This approach can be employed for diagnostic purposes, e.g. to track the expansion or contraction of leukemic clones in Minimal Residual Disease detection, or to discover rearrangements involved in response to vaccination or other immunotherapeutic interventions.


Repertoire diversity is a key measure of immunological complexity. Various metrics for diversity, such as Shannon entropy (clonality), repertoire richness and evenness can be used to convert the complexity of a sample into a numeric value. This diversity value can be associated to clinical parameters, such as treatment progression, to generate biomarkers of clinical response.

CDR3 length distribution

Selection and expansion of lymphocyte clones leads to changes in the composition of immune receptor repertoires. Comparison of CDR3 lengths allows discovery of shifts in the distribution and can indicate antigen-specific expansion which has been associated with clinical outcome for various autoimmune diseases.

V/J gene usage

Gene segments play a major role in defining antigen recognition. Comparison of segment usage can uncover preferential use of particular genes and indicate skewing of the repertoire due to an antigen-specific response.

CDR3 amino acid composition

The information captured in the CDR3 region is crucial to antigen recognition. Analysis of the amino acid composition of the CDR3 region can yield valuable information about onging immune responses. An enrichment of amino acids with certain physicochemical properties (e.g. hydrophobicity, charge or polarity) can indicate functional selection in response to a defined antigenic stimulus.

B-cell lineage analysis

B cells undergo somatic hypermutation after activation which leads to expansion and sequence mutation and, as a result, groups of clonally-related cells. Reconstruction of the clonal lineage tree allows researcher to investigate the clonal space surrounding well-characterized antibodies (e.g. by wet-lab measurements such as affinity). The uncharacterized antibodies in this space can provide a source for alternative antibodies with better chemical or therapeutic properties, while retaining the specificity for the target antigen.

Public clones

Public clones are TCR or BCR sequences that are shared between different individuals and can indicate a common response to the same antigen. When identified in a population known to have been exposed to the same pathogen, these receptor sequences have the potential to be used as a diagnostic tool for pathogenic exposure. Moreover, they can help guide development of novel therapeutics.

T cell receptor clustering

Receptors sharing similarities in their sequence have a higher probability of sharing epitope specificity. Structural TCR data, conserved motifs, and CDR3 sequence similarity can be used to predict whether two receptors are likely to share specificity for the same epitope. Rather than tracking the expansion and contraction of individual receptor sequences, cluster-based analyses track changes to groups of sequences instead of single sequences, providing a more robust readout.

Generation probabilities

The generation probability of a receptor is the probability of a specific (V-D-J) rearrangement to be generated. While the number of possible receptor rearrangements is vast, large-scale analyses of receptor sequences has shown that there is a bias towards a smaller pool of rearrangements. The generation probability can be used in conjunction with clone publicity to understand whether observation of shared clones is likely due to a high chance of recombination, or indicates a common antigen-specific response.