5.1. Overview of the ImmuneBuilder Method
Since recognition of pathogenic peptides is dependent on T-cells and their receptors, it is of interest to model the protein structure of these receptors. These models, such as AlphaFold [
30], are expected to yield insight into peptide immunogenicity and emergent pathogen evolution.
While AlphaFold depends on a deep learning method for prediction of protein structure, it is not specifically adapted to the molecules of immunity. Therefore, Abanades and others developed ImmuneBuilder [
59], a set of deep learning models specific to the hypervariable molecules of adaptive immunity. It includes a software component known as TCRBuilder2 [
59], which codes for a model to predict the protein structure of a T-cell receptor. Further, the authors showed that their method is over 100x more performant than the AlphaFold approach. This higher performance in the generation of protein structure is also reflective of a high efficiency in the computation, so that this software is applicable for use in a computer workstation. Since TCRBuilder2 is specific to prediction of a restricted set of protein structures, it has removed any dependency on a prior consisting of multiple sequence alignments. This is in contrast to AlphaFold-Multimer [
60], which is dependent on this prior since it is designed as a general model of protein structure.
The model weights of TCRBuilder2 are publicly available [
59], and the model is based on a curated set of 704 T-cell receptor variable domains [
61]. Of these, a sample of 50 records were used in a validation step, and, therefore, excluded in the training of the model [
59].
The RMSD metric [
62], as described in an above section, is a measure for comparing the quality of predictions of protein structure, such as generated by TCRBuilder2 or AlphaFold-Multimer. In this case, the survey of methods showed similar levels of predictveness of the structure of T-cell receptors [
59]. For example, in the CDR (hypervariable complementarity-determining) regions of the TCR alpha and beta chains, the mean RMSD metric values, as reported in angstroms, are typically less than 2.0, whereas the values are nearer to 2.0 in the CDR3 region. In this case, lower RMSD values are indicative of a closer correspondence between the structure of the prediction and that of the expected structure, and a value of 0 shows that a compared pair of proteins are identical. CDR3 is an example of a hypervariable region, a distinct region in comparison to the other regions of the T-cell receptor. Therefore, it is expected that the hypervariable regions require an increased sample size for yielding higher model predictiveness.
Interpretation of the RMSD values is dependent on knowledge of other factors at the molecular level, such as the sample size of amino acid residues. The TM-score metric [
31] has fewer assumptions to meet and is helpful in validating the values generated by the RMSD metric. However, it is not uncommon to interpret a mean RMSD value of less than two angstroms as suggestive of structural similarity between two protein molecules. To further interpret the results of the TCRBuilder2 study, and for the purpose of disentangling the parameters of the model of protein structure, the authors examined measurements of error in the reconstruction of the six angles between the alpha and beta chains (ABangles) [
63], the four torsion angles of the side chain (potentiality for peptide binding) [
64], and solvent accessibility of amino acid residues [
59]. Overall, their analysis of protein by region is supportive of a robust interpretation of model performance against that of competing methods.
5.3. Verification of the TCRBuilder2 Model
As described below, the TCRBuilder2 can generate a 3d protein structure from the input consisting of protein sequences that correspond to the two TCR polypeptide chains, such as the complement of alpha and beta chains. In the following example, the input is a protein complex from the RCSB database [
68]: PDB record 5d2l (rcsb.org/structure/5d2l), which includes an empirically determined protein structure, a potential benchmark for measuring the quality of the protein prediction by TCRBuilder2. The empirical data for 5d2l is exportable as a PDB formatted file. This data file appears to correspond to an empirical analysis of a quaternary crystal structure of a protein. To further examine their empirical analysis, the PDB record was referenced to find any literature associated with the record. An article is associated and entitled: Structural Basis for Clonal Diversity of the Public T Cell Response to a Dominant Human Cytomegalovirus Epitope [
69]. It has the following relevant details:
"The corresponding r.m.s.d. for the four C7·NLV·HLA-A2 complexes ranged from 0.50 to 0.83 Å. Based on these close similarities, the following descriptions of TCR-pMHC interactions apply to all complex molecules in the asymmetric unit of the C25·NLV·HLA-A2 or C7·NLV·HLA-A2 crystal."
"Three complex molecules in the asymmetric unit were located first; the fourth was found according to non-crystallographic symmetry."
To validate the above statements, a method is described below to confirm that the record 5d2l is composed of four empirically derived samples, yet corresponding to the same protein complex. First, the empirical data (in this case, PDB formatted) is processed for collating the alpha and beta chains of the TCR receptor, along with their amino acid sequences. Comparisons of the data samples by chain type show that they are identical or nearly identical at the amino acid residue level. Therefore, the data is expected to truly contain four separate models that are based on four empirical samples of the crystallization of a single protein complex. Third, there is a REMARK section in the file with descriptions of 4 BIOMOLECULE(S).
The PDB formatted file of record 5d2l is annotated with information on the individual polypeptide chains. For this case, the relevant data fields are identified by the prefix COMPND:
COMPND 12 MOL_ID: 3
COMPND 13 MOLECULE: C7 TCR ALPHA CHAIN
COMPND 14 CHAIN: I, K, O, E
COMPND 16 MOL_ID: 4
COMPND 17 MOLECULE: C7 TCR BETA CHAIN
COMPND 18 CHAIN: J, L, P, F
The alpha and beta chains of the TCR are annotated with letter assignments that correspond to the four empirical models. The alpha chain is represented in the data file as I, K, O, and E; likewise, the beta chain is represented as J, L, P, and F. In the file, there are also fields that list the amino acid residue sequence of each of these polypeptide chains. This data can be extracted by searching for the data fields containing a prefix of SEQRES and a letter that signifies the polypeptide chain of interest. For viewing these amino acid sequences and their residue similarity, a sequence alignment software is an appropriate tool, such as ClustalW [
70]. An amino acid sequence alignment of the TCR beta chain is shown in
Figure 6 (confirming the identity of the four beta chains in the 5d2l data record).
In this record, the corresponding pairs of TCR alpha and beta chains are shown by viewing the data fields with a prefix BIOMOLECULE, revealing that polypeptide chains I and J are of the same sample, and, therefore, correspondent to an alpha or beta chain of the TCR. These PDB formatted files can then be used to create PDB formatted files specific to each of the chains I and J. The predictions by TCRBuilder2 are presented in PDB format, so comparisons are possible between it and the 3d protein structure stored in the PDB database. Next, the PDB data fields that are prefixed with ATOM of the TCR alpha chain were aligned (by visual inspection), and trimmed so that the amino acid residues are comparable and orthologous between the PDB record and that generated by TCRBuilder2. This procedure was repeated for the beta chain. The goal was to have a comparison between data that represents aligned amino acid residues, where each residue is orthologous between the comparisons. Last, the index value of the amino acid residues (data fields for residues have prefix ATOM) was reset as described in an above section on use of the TM-score metric.
Next, TM-score was used to compute the TM-score values [
31]. The command line below is an example of this procedure. The "seq" parameter may be appended to align the sequence data via the software, but this practice is not foolproof and, if used, the sequences should be verified by inspection of the output file. Instead, it may be preferable to construct the alignments beforehand.
tmscore 5d2l_prediction_A.pdb 5d2l_ChainA-I.pdb > 5d2l_A_RMSD.out tmscore 5d2l_prediction_B.pdb 5d2l_ChainB-J.pdb > 5d2l_B_RMSD.out |
The output of the TCR alpha chain comparison (the first line above) reports an TM-score value of 0.9539 across 98 residues. For the beta chain, the TM-score value is 0.9426 across 110 residues. This verifies that TCRBuilder2 is constructing models of protein structure that closely resemble the empirical models in the PDB record.
Furthermore, TM-score has an option to compute the superposition data for viewing protein structure similarity (the "seq" parameter may be appended, if needed):
tmscore 5d2l_prediction_A.pdb 5d2l_ChainA-I.pdb -o 5d2l_A_SUP tmscore 5d2l_prediction_B.pdb 5d2l_ChainB-J.pdb -o 5d2l_B_SUP |
RasMol can display a tube diagram of the two superimposed protein structures (
Figure 7 and
Figure 8) [
67]. The figure legends have further details.
./RasMol/raswin.exe -script 5d2l_A_SUP ./RasMol/raswin.exe -script 5d2l_B_SUP |
5.4. Comments on TCR Modeling by Deep Learning
Even though ImmuneBuilder is competitive with the more generalized model of Alphafold-Multimer, it is of interest to expand deep learning models that are specific to one that is applicable to a greater set of problems. For example, generalization is preferred to achieve a broader parameterization of 3d protein structure and for capturing the rarer patterns of atomic arrangements. However, at this time, the specificness of the ImmuneBuilder approach is reasonable since it shows very good performance during inference and generation of 3d protein structure. Another goal of interest is in expanding the collection of TCR data so that a model has the potential to sample data more broadly. Collectively, these goals, and others, contribute to an increased performance in these deep learning approaches and converge on the possibility of meaningful analysis of the shift in protein structure of an T cell receptor that is asssociated with changes in peptide binding, and, therefore, gain an eventual insight into the proximate mechanisms of adaptive immunity at the host population level.
The field of deep learning is expanding on techniques with applicability to TCR modeling, such as in the use of a large language model to "perform evolutionary optimization over reward code" and the use of "the resulting rewards ... to acquire complex skills via reinforcement learning" [
71]. This approach refers to an example that was applied to the field of robotics, but the methodology is applicable to other tasks and in automation of trial-and-error experimentation, such as in an informatics pipeline for identification of immunogenic peptides and their putative association with an MHC receptor, and subsequently its downstream effects on immune surveillance by the T cell receptor repertoire. This kind of automatability in deep learning is akin to an outer loop that has control over an inner loop which codes for the world of possible predictions, so the overall system is recursive in its operation and has the potential for planning and application to the practice of experimentation. This kind of perspective and approach is seen in deep learning by the intelligent processing of input and verification methods for assessing model output. Related approaches that apply to immunogenetics are expanded upon below (section #7).