A protocol for information-driven antibody-antigen modelling with the HADDOCK2.4 webserver
Francesco Ambrosetti, Zuzana Jandova, Alexandre M.J.J. Bonvin
A protocol for information-driven antibody-antigen modelling with the HADDOCK2.4 webserver.
Francesco Ambrosetti a , Zuzana Jandova a and Alexandre M.J.J. Bonvin a, * a Computational Structural Biology Group, Bijvoet Centre for Biomolecular Research, Faculty of Science - Chemistry, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands. *Phone: +31.30.2533859, Email: [email protected] i. Summary
In the recent years, therapeutic use of antibodies has seen a huge growth, due to their inherent proprieties and technological advances in the methods used to study and characterize them. Effective design and engineering of antibodies for therapeutic purposes are heavily dependent on knowledge of the structural principles that regulate antibody-antigen interactions. Several experimental techniques such as X-ray crystallography, cryo-electron microscopy, NMR or mutagenesis analysis can be applied, but these are usually expensive and time consuming. Therefore computational approaches like molecular docking may offer a valuable alternative for the characterisation of antibody-antigen complexes. Here we describe a protocol for the prediction of the 3D structure of antibody-antigen complexes using the integrative modelling platform HADDOCK. The protocol consists of: 1) The identification of the antibody residues belonging to the hyper variable loops which are known to be crucial for the binding and can be used to guide the docking; 2) The detailed steps to perform docking with the HADDOCK 2.4 webserver following different strategies depending on the availability of information about epitope residues. ii.
Keywords
Antibody; HADDOCK; information-driven docking Introduction
Antibodies are key players of the immune response against external treats. They consist of two couples of identical polypeptide chains namely light (L) and heavy (H) chain kept together by disulphide bonds [1]. For both heavy and light chains it is possible to identify a variable domain, one constant domain for the light chain, or three or more for the heavy chain. The variable domains of the light (VL) and heavy chain (VH) are involved in the antigen recognition [2]. More specifically, the VL and VH regions contain the so-called Complementarity Determining Regions (CDRs), 3 for the VL and 3 for the VH, which display the highest level of variability and, in most of the cases, directly participate to the interaction with the antigen [3, 4]. Each CDR harbours one loop named hyper variable loop (HV) (six loops in total) which are crucial for the recognition of the cognate antigen and therefore offer a reasonable proxy of the antibody binding interface (paratope). Over the recent years, the development and application of antibodies for therapeutic purposes has been experiencing a significative boost [5] due to their high specificity and affinity towards specific targets (antigens) and to the advancements in the methodologies applied to produce and characterize them. Knowledge of the relationship between antibody and antigen residues and the way they interact is of paramount importance in order to elaborate effective strategies for their design and engineering [6, 7]. Structural information about their specific interactions can be obtained by using different experimental techniques such as X-ray crystallography, cryo-electron microscopy and Nuclear Magnetic Resonance (NMR). These can provide high resolution information but can also be time consuming, expensive, requiring high amounts of purified samples, and as such not always applicable. Valuable alternatives to the traditional experimental approaches are offered by computational methods which are faster and cheaper. One of them is molecular docking that aims at predicting the structure of a complex starting rom the free structures of its components [8]. The process involves two stages: The first one ( sampling ) consists of generating (tens of) thousands of different possible conformations, which are subsequently scored and ranked during the second stage (scoring ) in order to identify the models that have a higher probability of representing the native structure. Both sampling and scoring can benefit from the use of available information about e.g. binding interfaces, contacts or other types of information, to either bias the generation of models and/or select the models which are consistent with the provided information [9]. We have recently published an exhaustive comparison of the ability of four well established docking software namely, ClusPro [10], HADDOCK [11], LightDock [12] and ZDOCK [13], in predicting antibody-antigen complexes [14]. All of them can make use of information about the binding interface following different strategies. In this chapter we illustrate how the information about the antibody HV loops and the antigen binding interface (epitope) can be used in HADDOCK, which we demonstrated to be the most accurate for this specific task [14]. The described protocol includes the following steps: Preparation of the structures, selection of residues to use in order to drive the docking, setup of the docking run using the HADDOCK 2.4 web portal and finally analysis of the results. . Overview
This section describes the different steps to follow to perform the docking of an antibody-antigen complex with HADDOCK. HADDOCK is an information-driven docking algorithm which can make use of different types of information about the binding interface of two molecules in order to drive the docking [15, 16]. This information can be derived from various experimental sources such as NMR, hydrogen/deuterium exchange, mutagenesis analysis [17, 18] but it can also be extracted from bioinformatic predictions about putative binding sites [19] or co-evolving residues [20]. The information is encoded in the form of distance restraints (ambiguous or unambiguous) which are used both for sampling and scoring to guide the docking process. In the case of antibodies, as aforementioned, the residues belonging to the HV loops usually are directly involved in the interaction with the antigen and therefore represent a valuable information in order to investigate the antibody-antigen interaction. Nevertheless, on average an antibody uses only around 20-33% of the CDRs residues to engage the antigen [3] therefore depending on the task more precise information can be required. For this reason several computational paratope prediction methods have been developed over the years [21–24]. As for the antigen side, the identification of the epitope is far more challenging and despite the efforts of the community [25–32] none of the current epitope prediction methods show a reasonable accuracy. Their application for molecular docking is therefore largely limited. However, often some experimental evidence pointing to the epitope region on the antigen is available, which can be used to drive the docking with HADDOCK. .1.
HADDOCK docking protocol
The docking in HADDOCK follows three different stages namely: -
Rigid-body energy minimization (it0) -
Semi-flexible refinement by simulated annealing in torsional angle space (it1) -
Final refinement in Cartesian space (itw), with or without explicit solvent. In the first stages (it0), structures are considered as rigid, separated in spaces and randomly rotated to avoid any bias derived from the starting orientation. Then a rigid-body minimization, driven by the provided data usually transformed into ambiguous distance restraints, is performed allowing the structures to rotate and translate. Despite being rigid, this stage can accommodate ensembles of structures, effectively allowing the docking from various conformations. The second stage (it1) consists of simulated annealing protocol based on short molecular dynamic simulations in torsional angle space (bond lengths and angles are fixed) during which the orientation of side chains first, and then both backbone and side chains of the residues at the interface (defined using a 5Å cut-off) are optimized. Finally, the third stage, performed in Cartesian space, involves either a final energy minimization (default in HADDOCK2.4) or a short molecular dynamics simulation in explicit solvent (default in HADDOCK2.2) in order to further optimize the complex. In all of the three stages available information about the interaction can be included in the form of Ambiguous Interaction Restraints or AIRs (or specific distance restraints, e.g. derived from cross-linking mass spectrometry). These are included as an additional restraining term in the energy function that is minimized, effectively biasing the sampling to account for the information at hand. After each stage the generated models are scored and ranked according to the stage-specific HADDOCK score (HS) and only the top ranked models (number defined by the user, with a default of 200) move to the next docking step. The HADDOCK score consists of a linear combination of different terms namely: Intermolecular van der Waals (E vdw ) and electrostatic E elec ) energies, an empirical desolvation potential (E desolv ) [33], the ambiguous interaction restraints energy (E
AIR ) and the buried surface area (BSA). According to the docking stage different weights are associated with each term: - HS it0 = 1.0 E elec + 0.01 E vdw + 1.0 E desolv + 0.1 E AIR - 0.01 BSA - HS it1 = 1.0 E elec + 1.0 E vdw + 1.0 E desolv + 0.1 E AIR - 0.01 BSA - HS ref = 0.2 E elec + 1.0 E vdw + 1.0 E desolv + 0.1 E AIR
Clustering
Once all models have been generated HADDOCK performs a cluster analysis of the final models. By default, this is done based on the fraction of common contacts (FCC) using 0.6 as cutoff [34], but also the interface-ligand root mean square deviation (i-l-RMSD) can be used depending on the user’s selection. Method
In order to use the HADDOCK2.4 web-server registration is required. This can be done at: https://wenmr.science.uu.nl/auth/register/. To have access to all the functionalities required for this protocol it is necessary to request at least the
Expert level for HADDOCK (this can be done in your registration page). In the following paragraphs we will illustrate this protocol step by step, using as test case the 4G6M antibody-antigen complex present in the dataset we used in [14]. This complex describes the interaction between the humanized antibody
Gevokizumab (4G6K) and its cognate antigen, interleukin-1beta (4I1B) (both structures and the code used in this protocol are provided in the GitHub repository: https://github.com/haddocking/HADDOCK-antibody-antigen). We will describe two protocols corresponding to the first two scenarios described the our previous work [14], either hen no information is available for the antigen binding site, or when some vague information about the epitope is provided to guide the docking.
Installation
HMM 3.3 (http://hmmer.org/) [36] 3.
Biopython (https://biopython.org/) 4.
Biopandas (http://rasbt.github.io/biopandas/) [37] 5.
PDB-tools (https://github.com/haddocking/pdb-tools) [38] 6.
ANARCI (http://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred/anarci/) [35] All the instructions are provided in the README.md file of the GitHub repository. 1.
Open a Linux window terminal and download the repository by typing: > git clone https://github.com/haddocking/HADDOCK-antibody-antigen Different strategies can be used to install all the dependencies: a. (Recommended) If you have anaconda installed following the installation instructions provided in the README.md file type: > cd HADDOCK-antibody-antigen > conda env create > conda activate Ab-HADDOCK > cd anarci-1.3 > python2.7 setup.py install conda deactivate > cd .. b. If you do not have anaconda you will need to install all the dependencies separately following the instructions on the README.md file and on the specific websites: i.
Download and install HMMER 3.3: http://hmmer.org/ iii.
Install the required python packages: > cd HADDOCK-antibody-antigen > pip install -r requirements.txt iv.
Install ANARCI: > cd anarci-1.3 > python2.7 setup.py install > cd ../..
Identification of the Hyper Variable loops
Starting from the antibody structure downloaded from the Protein Data Bank (PDB) [39] we will first identify the residues belonging to the HV loops in order to use them to drive the docking. The steps are the following: 1. (Optional) Only if you have used anaconda to install dependencies you need to activate the corresponding environment by typing: > conda activate Ab-HADDOCK Using a terminal window run ANARCI to renumber the antibody (4G6K) according to the Chothia numbering scheme: > python2.7 ImmunoPDB.py -i 4G6K.pdb -o 4G6K_ch.pdb --scheme c –fvonly –rename --splitscfv he argument -i , -o , and --scheme are related to the input file, output file and to the numbering scheme to be used, respectively. --fvonly tells the script to only output the variable domain of the antibody discarding the constant one known to be not involved in the interaction with the antigen. The option -- rename is used to allow the script to rename the antibody chain IDs in H and L for the heavy and light chain respectively. Finally --splitscfv is provided in order to tell the script to split the variable domain of single chain antibodies. 3. Format the antibody structure in order to match the HADDOCK format requirements (see
Note 1 ) and extract the new residue number of the HV loop residues: > python ab_haddock_format.py 4G6K_ch.pdb 4G6K-HADDOCK.pdb A > active.txt
The first argument is the input PDB file that must be formatted according to the Chothia numbering scheme (see
Note 2 ), the second one is the name of the output formatted PDB file and finally the third one is the chain ID to be used in the HADDOCK formatted version of the PDB file. This scripts also outputs a comma separated list of residues associated to the amino acids of the antibody HV loops (see
Figure 1 ). In the above command this list is written the active.txt file. 4.
Add TER and END statements to the HADDOCK-ready PDB file ( ) > pdb_tidy 4G6K-HADDOCK.pdb > oo; mv oo 4G6K-HADDOCK.pdb Figure 1 : Structure of the variable domain of an antibody (PDB code: 4G6K). In red are highlighted the hyper variable loops defined according to the Chothia numbering scheme.
Preparation of the input files
The PDB files used as input in HADDOCK need to be carefully checked in order to fit the HADDOCK format requirements, e.g. removing double occupancies and insertions (relevant in the case of antibodies). For removing double occupancies the script pdb_selaltloc.py provided in the PDB-tools repository (https://github.com/haddocking/pdb-tools) [38] can be used, while insertions have been automatically removed using the ab_ haddock_format.py script as shown in the step 3 of the previous paragraph. Alternatively, to deal with the insertions, it is possible to use the script pdb_delinsertion.py of the PDB-tools repository. .4.
Antibody-antigen docking
For the docking we will use the new HADDOCK2.4 web portal: https://wenmr.science.uu.nl/haddock2.4/. To follow this protocol you need to be registered and have at least
Expert level access. 1.
From an Internet browser go to: https://wenmr.science.uu.nl/haddock2.4/ and click on “
Submit a new job” . 2.
Provide a job name for the docking run. Here we will use: In the
Molecule 1 section for the entries “
Where is the structure provided? ” and "
Which chain of the structure must be used? " leave the default options: “
I am submitting it ” and “
All ”, respectively. Then click on “
Choose file ” and upload the HADDOCK-formatted PDB file of the antibody created at the stage 3 of paragraph 3.2 ( ). Leave the remaining options to their default values (see
Figure 2 ). Figure 2 : Representation of the HADDOCK input page. In the
Molecule 2 section as before leave “
I am submitting it ” and “
All ” for the entries “
Where is the structure provided? ” and "
Which chain of the structure must be used? ", respectively and by clicking on “
Choose file ” upload the antigen PDB file provided in the repository ( ). Leave the remaining options to their default values. 5.
Click “
Next ” at the bottom of the page (see
Note 3 ). .
In the “
Molecule 1 - parameters ” section copy and paste the list of the HV loop residues from the active.txt file created in the step 3 of the previous paragraph and deactivate the option: “
Automatically define passive residues around the active residues ”. Leave the other parameters to their default values. 7.
In the “
Molecule 2 - parameters ” different strategies need to be followed according to the availability or not of information about the epitope residues: a.
When no information about the epitope is provided deactivate the option: “
Automatically define passive residues around the active residues ” and enable the option: “
Automatically define surface residues as passive ”. In the entry “
If you specified that surface residues will be defined automatically as passive, selection will use the following RSA (relative solvent accessibility) cutoff ” leave 0.40 as cutoff. Leave the other options to their default parameters. b.
When some information about the epitope is available deactivate the option: “
Automatically define passive residues around the active residues ” and in the field “
Passive residues (surrounding surface residues) ” copy and paste the list of comma separated epitope residues. Just as an example here we will use the residues: which represent all of the antigen residues within 9Å from the antibody in the complex reference structure (4G6M), filtered by their relative solvent accessibility (≥0.40) upon the removal of the antibody. 8.
Click “
Next ” at the bottom of the page. 9.
In the “
Sampling parameters ” section, depending on whether you are using information about the epitope or not, it is necessary to tune the sampling parameters: a.
If you are not using any information about the epitope change increase the sampling by changing “
Number of structures for rigid body docking ”, “
Number f structures for semi-flexible refinement ” and “
Number of structures for water refinement ”, to 10000, 400 and 400 respectively (see
Note 4 ). b.
If you are providing a loose definition of the epitope leave the entries: “
Number of structures for rigid body docking ”, “
Number of structures for semi-flexible refinement ” and “
Number of structures for water refinement ”, to 1000, 200 and 200 respectively (the default values). Leave the other options unchanged. 10.
Click “
Submit ” at the bottom of the page. 11.
A page reporting a small summary of your run and the progress bar of the docking job will be loaded. This page refreshes every 30 seconds (see
Note 5 ). 12.
Depending on the load on the server, the docking process might take between one and several hours. A link with the results will be sent by e-mail. From this page it is possible to have an overview of the docking run, inspect models, download all cluster representatives or download the full archive of the run, including all models from all stages as a gzipped tar file (typically a few 100MBs to a few GBs in size, depending on the system size). 13.
The result page reports the number of clusters and for the top 10 clusters also the related statistics (HADDOCK score, Size, RMSD, Energies, BSA and Z-score). While the name of the clusters is defined by their size (cluster 1 is the largest, followed by cluster 2 etc..) the top 10 clusters are selected and sorted according to the average HADDOCK score of the best 4 models of each cluster, from the lowest HADDOCK score to the highest. The various models can be directly visualized online by clicking on the eye icon, or downloaded for further analysis. 14.
The bottom of the result page, the “
Model Analysis ” section, presents interactive plots. These allow the user to inspect the distributions of HADDOCK score and associated nergetic terms as a function of the fraction of common contacts (FCC) and interface RMSD (i-RMSD). FCC and i-RMSD are calculated with respect to the overall top scoring model. The points are color-coded by cluster. Clusters can be turned on and off. Finally, in the “
Cluster analysis ” section the distribution of the energetic terms of the HADDOCK score are shown for each cluster (see
Figure 3 ). Figure 3 : Illustration of the result page of HADDOCK2.4. A ) The result page reports different statistics for the top 10 HADDOCK clusters. Those consist of the HADDOCK score, various energetic and geometric terms and the positional interface ligand RMSD (i-l-RMSD) of the top 4 clustered models with respect to the model with the overall lowest HADDOCK score. Their average and standard deviation are calculated from the top 4 ranked models of each cluster. The z-score expresses the number of standard deviations that a cluster score is from the average of all clusters scores. B ) The result page allows to directly visualize the models in an interactive window. C ) View of two of the interactive plots provided in the results page: HADDOCK score versus fraction of common contacts (FCC) (left) and interface RMSD (right) with respect to the top ranked model. n this particular example, it is possible to compare the docking results with the reference structure (4G6M) for example by downloading all the cluster representative models. Those models can be directly downloaded from the result page or can be found in the directory containing all the results. In the CAPRI (Critical PRediction of Interactions) [40] experiment, one of the parameters used is the Ligand RMSD (l-RMSD) which is calculated by superimposing the structures onto the backbone atoms of the receptor (the antibody in this case) and calculating the RMSD on the backbone residues of the ligand (the antigen) (see Figure 4 > alter all, segi='' > align cluster5_1 and chain A, 4G6M-matched and chain A, cycles=0 > rms_cur cluster5_1 and chain B, 4G6M-matched
As example here we used the best structure of the best cluster but the action can be repeated for all the HADDOCK-models (see
Note 6 ). Figure 4 : Comparison of the top ranked model of the top ranked HADDOCK cluster with the reference structure (4G6M). The structures are superimposed on the antibody (ligand RMSD = 8.0Å and interface RMSD = 2.7Å) (see
Note 6 ). The antibody and the antigen of the docking model are represented in blue and green, respectively. The reference structure is reported in grey. The docking was driven by using the hypervariable loops (HV) of the antibody and the epitope residues as defined in the text (paragraph 3.4 step 7b).
Acknowledgments
This work is supported by the European Union Horizon 2020 BioExcel (grant otes The PDB file of the antibody has to be formatted to match the HADDOCK format requirements. Here the antibody is treated as a single chain molecule with no insertions and no overlapping numbering. To validate the PDB format it is also possible to use the pdb_validate.py of the PDB-tools directory (https://github.com/haddocking/pdb-tools). 2.
The PDB file used to extract the hyper variable loops must be numbered according to the Chothia numbering scheme (different schemes would lead to the wrong identification of the HV residues) and the heavy and light chains must have appropriate IDs: H and L, respectively. 3.
After the submission of the two molecule they are validated by running through Molprobity [41]. This also allows for the automatic definition of the histidine protonation state. 4.
In general, the less information is provided the larger the sampling should be. In this specific case the increase of the sampling is necessary in order to enable HADDOCK to sample the entire surface of the antigen. 5.
The user will be notified by email upon successful submission and after completion of the run. It is therefore not required to leave the web page open all the time. 6.
Given the intrinsic chaotic nature of the computations performed by HADDOCK, which distributes computations on a worldwide grid of computers, differences in scores and accordingly cluster rankings might be expected unless the runs are performed on exactly the same hardware, operating system and using the same executable. The overall results in term of models should however be consistent. eferenceseferences