Virtual screening of GPCRs: an in silico chemogenomics approach
Laurent Jacob, Brice Hoffmann, Véronique Stoven, Jean-Philippe Vert
aa r X i v : . [ q - b i o . Q M ] J a n Virtual s reening of GPCRs: an in sili o hemogenomi s approa hLaurent Ja ob ∗ Institut Curie, Paris, F-75248 Fran eINSERM, U900, Paris, F-75248 Fran eE ole des Mines de Paris F-77300 Fran elaurent.ja obensmp.fr Bri e Ho(cid:27)mannInstitut Curie, Paris, F-75248 Fran eINSERM, U900, Paris, F-75248 Fran eE ole des Mines de Paris F-77300 Fran ebri e.hoffmannensmp.frVéronique StovenInstitut Curie, Paris, F-75248 Fran eINSERM, U900, Paris, F-75248 Fran eE ole des Mines de Paris F-77300 Fran everonique.stovenensmp.fr Jean-Philippe VertInstitut Curie, Paris, F-75248 Fran eINSERM, U900, Paris, F-75248 Fran eE ole des Mines de Paris F-77300 Fran ejean-philippe.vertensmp.frNovember 7, 2018Abstra tThe G-protein oupled re eptor (GPCR) superfamily is urrently the largest lass of ther-apeuti targets. In sili o predi tion of intera tions between GPCRs and small mole ules istherefore a ru ial step in the drug dis overy pro ess, whi h remains a daunting task due to thedi(cid:30) ulty to hara terize the 3D stru ture of most GPCRs, and to the limited amount of knownligands for some members of the superfamily. Chemogenomi s, whi h attempts to hara ter-ize intera tions between all members of a target lass and all small mole ules simultaneously,has re ently been proposed as an interesting alternative to traditional do king or ligand-basedvirtual s reening strategies.We propose new methods for in sili o hemogenomi s and validate them on the virtuals reening of GPCRs. The methods represent an extension of a re ently proposed ma hine learn-ing strategy, based on support ve tor ma hines (SVM), whi h provides a (cid:29)exible framework toin orporate various information sour es on the biologi al spa e of targets and on the hemi alspa e of small mole ules. We investigate the use of 2D and 3D des riptors for small mole ules,and test a variety of des riptors for GPCRs. We show fo instan e that in orporating informa-tion about the known hierar hi al lassi(cid:28) ation of the target family and about key residues intheir inferred binding po kets signi(cid:28) antly improves the predi tion a ura y of our model. Inparti ular we are able to predi t ligands of orphan GPCRs with an estimated a ura y of . .1 Introdu tionThe G-protein oupled re eptor (GPCR) superfamily is omprised of an estimated 600-1,000 mem-bers and is the largest known lass of mole ular targets with proven therapeuti value. They areubiquitous in our body, being involved in regulation of every major mammalian physiologi al sys-tem (Bo kaert and Pin, 1999), and play a role in a wide range of disorders in luding allergies, ardiovas ular dysfun tion, depression, obesity, an er, pain, diabetes, and a variety of entral ner-vous system disorders (Deshpande and Penn, 2006; Hill, 2006; Catapano and Manji, 2007). They ∗ The (cid:28)rst two authors ontributed equally to this work1re integral membrane proteins sharing a ommon global topology that onsists of seven trans-membrane alpha heli es, an intra ellular C-terminal, an extra ellular N-terminal, three intra ellularloops and three extra ellular loops. There are four main lasses of GPCRs (A, B, C and D) depend-ing on their sequen e similarity (Horn et al., 2003). Their lo ation on the ell surfa e makes themreadily a essible to drugs, and GPCRs are the targets for the majority of best-selling drugs,representing about of all pres ription pharma euti als on the market (Fredholm et al., 2007).Besides, the human genome ontains several hundred unique GPCRs whi h have yet to be assigneda lear ellular fun tion, suggesting that they are likely to remain an important target lass for newdrugs in the future (Lin and Civelli, 2004).Predi ting intera tions in sili o between small mole ules and GPCRs is not only of parti ularinterest for the drug industry, but also a useful step for the elu idation of many biologi al pro ess.First, it may help to de ipher the fun tion of so- alled orphan GPCRs, for whi h no natural ligandis known. Se ond, on e a parti ular GPCR is sele ted as a target, it may help in the sele tion ofpromising mole ule andidates to be s reened in vitro against the target for lead identi(cid:28) ation.In sili o virtual s reening of GPCRs with lassi al approa hes is however a daunting task forat least two reasons. First, the 3D stru tures are urrently known for only two GPCRs (bovinerhodopsin and human β -adrenergi re eptor). Indeed, GPCRs, like other membrane proteins, arenotoriously di(cid:30) ult to rystallize. As a result, do king strategies for s reening small mole ulesagainst GPCRs are often limited by the di(cid:30) ulty to model orre tly the 3D stru ture of the tar-get. To ir umvent the la k of experimental stru tures, various studies have used 3D stru turalmodels of GPCRs built by homology modeling using bovine rhodopsin as a template stru ture.Do king a library of mole ules into these modeled stru tures allowed the re overy of known lig-ands (Evers and Klabunde, 2005), and even identi(cid:28) ation of new ligands (Cavasotto et al., 2003).However, do king methods still su(cid:27)er from do king and s oring ina ura ies, and homology modelsare not always reliable-enough to be employed in target-based virtual s reening. Methods havebeen proposed to enhan e the quality of the models by global optimization and (cid:29)exible do k-ing (Cavasotto et al., 2003), or by using di(cid:27)erent sets of re eptor models. Nevertheless, thesemethods are expe ted to show limited performan es for GPCRs sharing low sequen e similaritywith rhodopsin, espe ially in the ase of re eptors belonging to lasses B, C and D. Alternatively,ligand-based strategies, also known as quantitative stru ture-a tivity relationship (QSAR), attemptto predi t new ligands from previously known ligands, often using statisti al or ma hine learningapproa hes. Ligand-based approa hes are interesting be ause they do not require the knowledge ofthe target 3D stru ture and an bene(cid:28)t from the dis overy of new ligands. However, their a ura yis fundamentally limited by the amount of known ligands, and degrades when few ligands are known.Although these methods were su essfully used to retrieve strong GPCR binders (Rolland et al.,2005), they are e(cid:30) ient for lead optimization within a previously identi(cid:28)ed mole ular s a(cid:27)old, butare not appropriate to identify new families of ligands for a target. At the extreme, they annot bepursued for the s reening of orphan GPCRs.Instead of fo using on ea h individual target independently from other proteins, a re ent trend inthe pharma euti al industry, often referred to as hemogenomi s, is to s reen mole ules against sev-eral targets of the same family simultaneously (Kubinyi et al., 2004; Jaro h and Weinmann, 2006).This systemati s reening of intera tions between the hemi al spa e of small mole ules and thebiologi al spa e of protein targets an be thought of as an attempt to (cid:28)ll a large 2D intera tionmatrix, where rows orrespond to targets, olumns to small mole ules, and the ( i, j ) -th entry of thematrix indi ates whether the j -th mole ule an bind the i -th target. While in general the matrixmay ontain some des ription of the strength of the intera tion, su h as the asso iation onstant ofthe omplex, we will fo us in this paper on a simpli(cid:28)ed des ription that only di(cid:27)erentiates bindingfrom non-binding mole ules, whi h results in a binary matrix of target-mole ule pairs. This matrix2s already sparsely (cid:28)lled with our urrent knowledge of protein-ligand intera tions, and hemoge-nomi s attempts to (cid:28)ll the holes. While lassi al do king or ligand-based virtual s reening strategiesfo us on ea h single row independently from the others in this matrix, i.e., treat ea h target in-dependently from ea h others, the hemogenomi s approa h is motivated by the observation thatsimilar mole ules an bind similar proteins, and that information about a known intera tion be-tween a ligand and a GPCR ould therefore be a useful hint to predi t intera tion between similarmole ules and similar GPCRs. This an be of parti ular interest when, for example, a parti ulartarget has few or no known ligands, but similar proteins have many: in that ase it is tempting touse the information about the known ligands of similar proteins for a ligand-based virtual s reeningof the target of interest. In this ontext, we an formally de(cid:28)ne in sili o hemogenomi s as theproblem of predi ting intera tions between a mole ule and a ligand (i.e., a hole in the matrix) fromthe knowledge of all other known intera tions or non-intera tions (i.e., the known entries of thematrix).Re ent reviews (Kubinyi et al., 2004; Jaro h and Weinmann, 2006; Klabunde, 2007; Rognan,2007) des ribe several strategies for in sili o hemogenomi s. A (cid:28)rst lass of approa hes, alledligand-based hemogenomi s by Rognan (2007), pool together targets at the level of families (su has GPCR) or subfamilies (su h as purinergi GPCR) and learn a model for ligands at the level of thefamily (Balakin et al., 2002; Klabunde, 2006). Other approa hes, termed target-based hemogenomi approa hes by Rognan (2007), luster re eptors based on ligand binding site similarity and againpool together known ligands for ea h luster to infer shared ligands (Frimurer et al., 2005). Finally,a third strategy termed target-ligand approa h by Rognan (2007) attempts to predi t ligands for agiven target by leveraging binding information for other targets in a single step, that is, without(cid:28)rst attempting to de(cid:28)ne a parti ular set of similar re eptors. This strategy was pioneered byBo k and Gough (2005) to predi t ligands of orphan GPCR. They merged des riptors of ligands andtargets to des ribe putative ligand-re eptor omplexes, and used SVM to dis riminate real omplexesfrom ligand-re eptors pairs that do not form omplexes. Erhan et al. (2006) followed a similar ideawith di(cid:27)erent des riptors, and showed in parti ular that the SVM formulation allows to generalizethe use of ve tors of des riptors to the use of positive de(cid:28)nite kernels to des ribe the hemi al andthe biologi al spa e in a omputationally e(cid:30) ient framework. Erhan et al. (2006) were not able toshow, however, signi(cid:28) ant bene(cid:28)ts with respe t to the individual approa h that learns a separate lassi(cid:28)er for ea h GPCR (ex ept in the ase of orphan GPCRs, for whi h their approa h performedbetter than the baseline random lassi(cid:28)er). Re ently, in the ontext of predi ting intera tionsbetween peptides and di(cid:27)erent alleles of MHC-I mole ules, Ja ob and Vert (2008) followed a similarapproa h and highlighted the importan e of hoosing adequate des riptors for small mole ules andtargets. They obtained state-of-the-art predi tion a ura y for most MHC-I allele, in parti ular forthose with few known binding peptides.In this paper we go one step further in this dire tion and present an in sili o hemogenomi sapproa h spe i(cid:28) ally tailored for the s reening of GPCRs, although the method ould in prin iplebe adapted to other lasses of therapeuti targets. We follow the idea of Bo k and Gough (2005)and the algorithmi tri k of Erhan et al. (2006), whi h allows us to systemati ally test a variety ofdes riptors for both the mole ules and the GPCRs. We test two families of 2D and 3D des riptorsto des ribe mole ules, in luding a new 3D kernel, and six ways to des ribe GPCRs, in luding ades ription of their relative positions in urrent hierar hi al lassi(cid:28) ations of the superfamily, andinformation about key residues likely to be in onta t with the ligand. We test the approa h onthe data of the GLIDA database (Okuno et al., 2006), whi h ontains reported intera tionsbetween human GPCRs and small mole ules, and observe that the hoi e of the des riptors has asigni(cid:28) ant impa t on the a ura y of the models. In parti ular, the best results are rea hed whenusing the des ription of GPCRs within the hierar hi al lassi(cid:28) ation of the superfamily, ombined3ith a set of 2D des riptors of small mole ules. This allows us to obtain dramati improvements ofthe predi tion a ura y with respe t to the individual learning setting. In an experiment where wesimulate the predi tion of ligands for orphan GPCRs, we obtain a ura ies of . , signi(cid:28) antlyabove the baseline a ura y of a random predi tor.2 MethodIn this se tion, we (cid:28)rst review the methods proposed by Bo k and Gough (2005); Erhan et al. (2006)for in sili o hemogenomi s with SVM, before presenting the parti ular des riptors we propose touse for mole ules and GPCRs within this framework.2.1 In sili o hemogenomi s with ma hine learningWe onsider the problem of predi ting intera tions between GPCRs and small mole ules. For thispurpose we assume that a list of target/small mole ule pairs { ( t , m ) , . . . , ( t n , m n ) } , known tointera t or not, is given. Su h information is often available as a result of systemati s reening ampaigns in the pharma euti al industry, or on dedi ated databases. Our goal is then to reate amodel to predi t, for any new andidate pair ( t, m ) , whether the small mole ule m is likely to bindthe GPCR t .A general method to reate the predi tive model is to follow these four steps:1. Choose n tar des riptors to represent ea h GPCR target t in the biologi al spa e by a n tar -dimensional ve tor Φ tar ( t ) = (Φ tar ( t ) , . . . , Φ n tar tar ( t )) ;2. In parallel, hoose n mol des riptors to represent ea h mole ule m in the hemi al spa e by a n mol -dimensional ve tor Φ mol ( m ) = (Φ mol ( m ) , . . . , Φ n mol mol ( m )) ;3. Derive a ve tor representation of a andidate target/mole ule omplex Φ pair ( t, m ) from therepresentations of the target Φ tar ( t ) and of the mole ule Φ mol ( m ) ;4. Use a statisti al or ma hine learning method to train a lassi(cid:28)er able to dis riminate be-tween binding and non-binding pairs, using the training set of binding and non-binding pairs { Φ pair ( t , m ) , . . . , Φ pair ( t n , m n ) } While the (cid:28)rst two steps (sele tion of des riptors) may be spe i(cid:28) to ea h parti ular hemogenomi sproblem, the last two steps de(cid:28)ne the parti ular strategy used for in sili o hemogenomi s. Forexample, Bo k and Gough (2001, 2005) proposed to on atenate the ve tors Φ tar ( t ) and Φ mol ( m ) to obtain a ( n tar + n mol ) -dimensional ve tor representation of the ligand-target omplex Φ pair ( t, m ) ,and to use a SVM as a ma hine learning engine. Erhan et al. (2006) followed a slightly di(cid:27)erentstrategy for the third step, by forming des riptors for the pair ( t, m ) as produ t of small mole uleand target des riptors. More pre isely, given a mole ule m des ribed by a ve tor Φ mol ( m ) and aGPCR t des ribed by a ve tor Φ tar ( t ) , the pair ( t, m ) is represented by the tensor produ t: Φ pair ( t, m ) = Φ tar ( t ) ⊗ Φ mol ( m ) , (1)that is, a ( n tar × n mol ) -dimensional ve tor whose entries are produ ts of the form Φ itar ( t ) × Φ jmol ( m ) ,for ≤ i ≤ n tar and ≤ j ≤ n mol . A SVM is then used as an inferen e engine, to estimate alinear fun tion f ( t, m ) in the ve tor spa e of target/mole ule pairs, that takes positive values forintera ting pairs and negative values for non-intera ting ones.4he main motivation for using the tensor produ t (1) is that it provides a systemati way toen ode orrelations between small mole ule and target features. For example, in the ase of binarydes riptors, the produ t of two features is if both the mole ule and the target des riptors are ,and zero otherwise, whi h amounts to en ode the simultaneous presen e of parti ular features of themole ule and of the target that may be important for the formation of a omplex. A potential issuewith this approa h, however, is that the size of the ve tor representation n tar × n mol for a pair maybe prohibitively large for pra ti al omputation and manipulation. For example, using a ve tor ofmole ular des riptors of size for mole ules, and representing a protein by the ve tor of ounts ofall -mers of amino-a ids in its sequen e ( d t = 20 ×
20 = 400 ) results in more than 400k dimensionsfor the representation of a pair. As pointed out by Erhan et al. (2006), this omputational obsta le an however be over ome when a SVM is used to train the linear lassi(cid:28)er, thanks to a tri k oftenreferred to as the kernel tri k. Indeed, a SVM does not ne essarily need the expli it omputation ofthe ve tors representing the omplexes in the training set to train a model. What it needs, instead,is the inner produ ts between these ve tors, and a lassi al property of tensor produ ts is that theinner produ t between two tensor produ ts Φ pair ( t, m ) and Φ pair ( t ′ , m ′ ) is the produ t of the innerprodu t between Φ tar ( t ) and Φ tar ( t ′ ) , on the one hand, and the inner produ t between Φ mol ( m ) and Φ mol ( m ′ ) , on the other hand. More formally, this property an be written as follows: (Φ tar ( t ) ⊗ Φ mol ( m )) ⊤ (cid:0) Φ tar ( t ′ ) ⊗ Φ mol ( m ′ ) (cid:1) = Φ tar ( t ) ⊤ Φ tar ( t ′ ) × Φ mol ( m ) ⊤ Φ mol ( m ′ ) , (2)where u ⊤ v = u v + . . . + u d v d denotes the inner produ t between two d -dimensional ve tors u and v . In other words, the SVM does not need to ompute the n tar × n mol ve tors to des ribe ea h pair,it only omputes the respe tive inner produ ts in the target and ligand spa es, before taking theprodu t of both numbers.This (cid:29)exibility to manipulate mole ule and target des riptors separately an moreover be om-bined with other tri ks that sometimes allow to ompute e(cid:30) iently the inner produ ts in the targetand ligand spa es, respe tively. Many su h inner produ ts, also alled kernels, have been developedre ently both in omputational biology (S hölkopf et al., 2004) and hemistry (Kashima et al., 2003;Gärtner et al., 2003; Mahé et al., 2005), and an be easily ombined within the hemogenomi sframework as follows: if two kernels for mole ules and targets are given as: K mol ( m, m ′ ) = Φ mol ( m ) ⊤ Φ mol ( m ′ ) ,K tar ( t, t ′ ) = Φ tar ( t ) ⊤ Φ tar ( t ′ ) , (3)then we obtain the inner produ t between tensor produ ts, i.e., the kernel between pairs, by: K (cid:0) ( t, m ) , ( t ′ , m ′ ) (cid:1) = K tar ( t, t ′ ) × K mol ( m, m ′ ) . (4)In summary, as soon as two ve tors of des riptors or kernels K lig and K tar are hosen, we ansolve the in sili o hemogenomi s problem with an SVM using the produ t kernel (4) between pairs.The parti ular des riptors or kernels used should ideally en ode properties related to the ability ofsimilar mole ules to bind similar targets or ligands respe tively.In the next two subse tions, we present di(cid:27)erent possible hoi es of des riptors (cid:21) or kernels (cid:21)for small mole ules and GPCRs, respe tively.2.2 Des riptors for small mole ulesThe problem of expli itly representing and storing small mole ules as (cid:28)nite-dimensional ve torshas a long history in hemoinformati s, and a multitude of mole ular des riptors have been pro-posed (Todes hini and Consonni, 2002). These des riptors in lude in parti ular physi o hemi al5roperties of the mole ules, su h as its solubility or logP, des riptors derived from the 2D stru tureof the mole ule, su h as fragment ounts or stru tural (cid:28)ngerprints, or des riptors extra ted fromthe 3D stru ture (Gasteiger and Engel, 2003). Ea h lassi al (cid:28)ngerprint ve tor and ve tor repre-sentation of mole ules de(cid:28)ne an expli it (cid:16) hemi al spa e(cid:17) in whi h ea h mole ule is representedby a (cid:28)nite-dimensional ve tor, and these ve tor representations an obviously be used as su h tode(cid:28)ne kernels between mole ules (Azen ott et al., 2007). Alternatively, some authors have re entlyproposed some kernels that generalize some of these sets of des riptors and orrespond to inner prod-u ts between large- or even in(cid:28)nite-dimensional ve tors of des riptors. These des riptors en ode,for example, the ounts of an in(cid:28)nite number of walks on the graph des ribing the 2D stru tureof the mole ules (Kashima et al., 2004; Gärtner et al., 2003; Mahé et al., 2005), or various featuresextra ted from the 3D stru tures (Mahé et al., 2006; Azen ott et al., 2007).In this study we sele t two existing kernels, en oding respe tively 2D and 3D stru tural infor-mation of the small mole ules, and propose a new 3D kernel: • The 2D Tanimoto kernel. Our (cid:28)rst set of des riptors is meant to hara terize the 2D stru tureof the mole ules. For a small mole ule m , we de(cid:28)ne the ve tor Φ mol ( m ) as the binary ve torwhose bits indi ate the presen e or absen e of all linear graph of length u or less as subgraphsof the 2D stru ture of l . We hose u = 8 in our experiment, i.e., hara terize the mole ulesby the o urren es of linear subgraphs of length or less, a value previously observed togive good results in several virtual s reening tasks (Mahé et al., 2005). Moreover, instead ofdire tly taking the inner produ t between ve tors as in (3), we use the Tanimoto kernel: K ligand ( l, l ′ ) = Φ lig ( l ) ⊤ Φ lig ( c ′ )Φ lig ( l ) ⊤ Φ lig ( l ) + Φ lig ( l ′ ) ⊤ Φ lig ( l ′ ) − Φ lig ( l ) ⊤ Φ lig ( l ′ ) , (5)whi h was proven to be a valid inner produ t by Ralaivola et al. (2005), giving very ompetitiveresults on a variety of QSAR or toxi ity predi tion experiments. •
3D pharma ophore kernel While 2D stru tures are known to be very ompetitive in ligand-based virtual s reening (Azen ott et al., 2007), we reasoned that some spe i(cid:28) 3D onforma-tions of a few atoms or fun tional groups may be responsible for the intera tion with the target.Thus, we de ided to test des riptors representing the presen e of potential 3-point pharma- ophores. For this, we used the 3D pharma ophore kernel proposed by Mahé et al. (2006),that generalizes 3D pharma ophore (cid:28)ngerprint des riptors. This approa h implies the hoi eof a 3D onformer for ea h mole ule. In absen e of su(cid:30) ient data available for bound ligandsin GPCR stru tures, we hose to build a 3D version of the ligand base in whi h mole ulesare represented in an estimated minimum energy onformation. For ea h of the retainedligands, onformers were generated with the Omega program (OpenEye S ienti(cid:28) Software)using standard parameters, ex ept for a RMSD lustering of the onformers, instead of the . default value. A 3D ligand base was generated by keeping the onformer of lowest energyfor ea h ligand. Partial harges were al ulated for all atoms using the mol harge program(OpenEye S ienti(cid:28) Software) with standard parameters. This ligand base was then used to al ulate a 3D pharma ophore kernel for mole ules (Mahé et al., 2006).We used the freely and publi ly available ChemCPP 1 software to ompute the 2D and 3Dpharma ophore kernel.1Available at http:// hem pp.sour eforge.net. 6.3 Des riptors for GPCRsSVM and kernel methods are also widely used in bioinformati s (S hölkopf et al., 2004), and avariety of approa hes have been proposed to design kernels between proteins, ranging from kernelsbased on the amino-a id sequen e of a protein (Jaakkola et al., 2000; Leslie et al., 2002; Tsuda et al.,2002; Leslie et al., 2004; Vert et al., 2004; Kuang et al., 2005; Cuturi and Vert, 2005) to kernelsbased on the 3D stru tures of proteins (Dobson and Doig, 2005; Borgwardt et al., 2005; Qiu et al.,2007) or on the pattern of o urren es of proteins in multiple sequen ed genomes (Vert, 2002). Thesekernels have been used in onjun tion with SVM or other kernel methods for various tasks relatedto stru tural or fun tional lassi(cid:28) ation of proteins. While any of these kernels an theoreti allybe used as a GPCR kernel in (4), we investigate in this paper a restri ted list of spe i(cid:28) kernelsdes ribed below, aimed at illustrating the (cid:29)exibility of our framework and test various hypothesis. • The Dira kernel between two targets t, t ′ is: K Dirac ( t, t ′ ) = ( if t = t ′ , otherwise. (6)This basi kernel simply represents di(cid:27)erent targets as orthonormal ve tors. From (4) wesee that orthogonality between two proteins t and t ′ implies orthogonality between all pairs ( l, t ) and ( l ′ , t ′ ) for any two small mole ules c and c ′ . This means that a linear lassi(cid:28)er forpairs ( l, t ) with this kernel de omposes as a set of independent linear lassi(cid:28)ers for intera tionsbetween mole ules and ea h target protein, whi h are trained without sharing any informationof known ligands between di(cid:27)erent targets. In other words, using Dira kernel for proteinsamounts to performing lassi al learning independently for ea h target, whi h is our baselineapproa h. • The multitask kernel between two targets t, t ′ is de(cid:28)ned as: K multitask ( t, t ′ ) = 1 + K Dirac ( t, t ′ ) . This kernel, originally proposed in the ontext of multitask learning Evgeniou et al. (2005),removes the orthogonality of di(cid:27)erent proteins to allow sharing of information. As explained inEvgeniou et al. (2005), plugging K multitask in (4) amounts to de omposing the linear fun tionused to predi t intera tions as a sum of a linear fun tion ommon to all GPCRs and of a linearfun tion spe i(cid:28) to ea h GPCR: f ( l, t ) = w ⊤ Φ( l, t ) = w ⊤ general Φ lig ( l ) + w ⊤ t Φ lig ( l ) . A onsequen e is that only data related to the the target t are used to estimate the spe i(cid:28) ve tor w t , while all data are used to estimate the ommon ve tor w general . In our frameworkthis lassi(cid:28)er is therefore the ombination of a target-spe i(cid:28) part a ounting for target-spe i(cid:28) properties of the ligands and a global part a ounting for general properties of the ligandsa ross the targets. The latter term allows to share information during the learning pro ess,while the former ensures that spe i(cid:28) ities of the ligands for ea h target are not lost. • The hierar hy kernel. Alternatively we ould propose a new kernel aimed at en oding thesimilarity of proteins with respe t to the ligands they bind. In the GLIDA database indeed,GPCRs are grouped into lasses based on sequen e homology and fun tional similarity: therhodopsin family ( lass A), the se retin family ( lass B), the metabotropi family ( lass C) and7ome smaller lasses ontaining other GPCRs. The GLIDA database further subdivides ea h lass of targets by type of ligands, for example amine or peptide re eptors or more spe i(cid:28) families of ligands. This also de(cid:28)nes a natural hierar hy that an be used to ompare GPCRs.The hierar hy kernel between two GPCRs was therefore de(cid:28)ned as the number of ommonan estors in the orresponding hierar hy plus one, that is, K hierarchy ( t, t ′ ) = h Φ h ( t ) , Φ h ( t ′ ) i , where Φ h ( t ) ontains as many features as there are nodes in the hierar hy, ea h being set to if the orresponding node is part of t 's hierar hy and otherwise, plus one feature onstantlyset to one that a ounts for the "plus one" term of the kernel. • The binding po ket kernel. Be ause the protein-ligand re ognition pro ess o urs in 3Dspa e in a po ket involving a limited number of residues, we tried to des ribe the GPCRspa e using a representation of this po ket. The di(cid:30) ulty resides in the fa t that althoughthe GPCR sequen es are known, the residues forming this po ket and its pre ise geome-try are a priori unknown. However, the two available X-Ray stru tures, together withmutagenesis data showed that the binding po kets are situated in a similar region for allGPCRs (Krato hwil et al., 2005). In order to identify residues potentially involved in thebinding po ket of GPCRs of unknown stru ture studied in this work, we pro eeded in severalsteps. (a) The two known stru tures (PDB entries 1U19 and 2RH1) were superimposed usingthe STAMP algorithm (Russell and Barton, 1992). In the superimposed stru tures, the retinaland 3-(isopropylamino)propan- 2-ol ligands are very lose, whi h is in agreement with global onservation of binding po kets, as shown on Figure 1. (b) The stru tural alignment of bovinerhodopsin and of human β -adrenergi re eptor was used to generate a sequen e alignment ofthese two proteins. ( ) For both stru tures, in order to identify residues potentially involved instabilizing intera tions with the ligand (residues of the po ket), we sele ted residues that pre-sented at least one atom situated at less than from at least one atom of the ligand. Figure 2shows that these two po kets learly overlap, as expe ted. (d) Residues of the two po kets(as de(cid:28)ned in ( )) were labeled in this stru tural sequen e alignment. These residues werefound to form small sequen e lusters that were in orresponden e in this alignment. These lusters were situated mainly in the api al region of transmembrane segments and in luded afew extra ellular residues. (e) All studied GPCR sequen es, in luding bovine rhodopsin andof human β -adrenergi re eptor were aligned using CLUSTALW (Chenna et al., 2003) withBlosum matri es (Heniko(cid:27) and Heniko(cid:27), 1992). For ea h protein, residues in orresponden ewith a residue of the binding po ket (as de(cid:28)ned above) of either bovine rhodopsin or human β -adrenergi re eptor were retained. This lead to a di(cid:27)erent number of residues per protein,be ause of sequen e variability. For example, in extra ellular regions, some residues frombovine rhodopsin or human β -adrenergi re eptor had a orresponding residue in some se-quen es but not in others. In order to provide a homogeneous des ription of all GPCRs, in thelist of residues initially retained for ea h protein, only residues situated at positions onservedin almost all GPCRs were kept. (f) Ea h protein was then represented by a ve tor whoseelements orresponded to a potential onserved po ket. This des ription, although appearingas a linear ve tor (cid:28)lled with amino a id residues, impli itly odes for a 3D information on there eptor po ket, as illustrated on Figure 2. These ve tors were then used to build a kernelthat allows omparison of binding po kets. The lassi al way to represent motifs of onstantlength as (cid:28)xed length ve tors is to en ode the letter at ea h position by a -dimensionalbinary ve tor indi ating whi h amino a id is present, resulting in a -dimensional ve tor8epresentations. In terms of kernel, the inner produ t between two binding po ket motifs inthis representation is simply the number of letters they have in ommon at the same positions: K pb ( x, x ′ ) = l X i =1 δ ( x [ i ] , x ′ [ i ]) , where l is the length of the binding po ket motifs ( in our ase), x [ i ] is the i -th residue inx and δ ( x [ i ] , x ′ [ i ]) is if x [ i ] = x ′ [ i ] , otherwise. This is the baseline po ket binding kernel.Alternatively, using a polynomial kernel of degree p over the baseline kernel is equivalent, interms of feature spa e, to en oding p -order intera tions between amino a ids at di(cid:27)erent posi-tions. In order to assess the relevan e of su h non-linear extensions we tested this polynomialpo ket binding kernel, K ppb ( x, x ′ ) = (cid:0) K pb ( x, x ′ ) + 1 (cid:1) p . We only used a degree p = 2 , although a more areful hoi e of this parameter ould furtherimprove the performan es. • The binding po ket hierar hy kernel. Be ause of the link between binding po kets and ligandre ognition, we also de(cid:28)ned a new hierar hy based on the sequen e alignment of the bindingpo ket amino a id ve tors without gaps. To do this, we used a PAM matrix with high valuesof gap insertion and extension to ompare ea h ouple of GPCR ve tors. The obtained s oreswere used in UPGMA (Unweighted Pair Group Method with Arithmeti mean) to determinea binding po ket similarity based hierar hy. We obtained a tree omparable to phylogeneti trees, and that happens to be share many substru tures with the GLIDA hierar hy.Figure 1: Representation of the binding po ket of β -adrenergi re eptor (in red) and bovineRhodopsin (in bla k) viewed from the extra ellular surfa e. On the enter of the po ket, 3-(isopropylamino)propan-2-ol and is-retinal have been represented to show the size and the positionof the po ket around ea h ligand. Figure drawn with VMD (Humphrey et al., 1996).9igure 2: 3-(isopropylamino)propan-2-ol and the protein environment of β -adrenergi re ep-tor as viewed from the extra ellular surfa e. Amino a id side hains are represented for ofthe residues (in yan, blue and red) of the binding po ket motif. Transmembranes helixand 3-(isopropylamino)propan-2-ol are olored in bla k and red respe tively. Figure drawn withVMD (Humphrey et al., 1996).3 DataWe used the GLIDA GPCR-ligand database (Okuno et al., 2006) whi h in ludes known lig-ands for and GPCRs from human, rat and mouse. The ligand base ontains highly diversemole ules, from ions and very small mole ules up to peptides. In order to eliminate unwantedmole ules su h as inorgani ompounds and mole ules with unsuitable mole ular weights, we (cid:28)l-tered the GLIDA ligand base using the (cid:28)lter program (OpenEye S ienti(cid:28) Software) with standardparameters. The most important (cid:28)ltering feature here was to keep mole ules of mole ular weightsranging from
Da to
Da. Overall, the GLIDA ligand base was (cid:28)ltered in order to retainmole ules that had the physi o- hemi al hara teristi s of drugs. This (cid:28)lter retained mole ules.Be ause the GLIDA ligand base ontains a few dupli ates, we eliminated these redundan ies, whi hlead to di(cid:27)erent mole ules, available under a 2D des ription (cid:28)les and giving intera tionswith the human GPCRs. Elimination of dupli ates present in the GLIDA base was important herebe ause it ould have lead to over(cid:28)tting in the learning step. For ea h positive intera tion givenby this restri ted set, we generated a negative intera tion involving the same re eptor and one ofthe ligands that was in the database and was not indi ated as one of its ligands. This probablygenerated some false negative points in our ben hmark, and it would be interesting to use experi-mentally tested negative intera tions. We loaded the sequen es of all GPCRs that are able to bindany of these ligands, whi h lead to 80 sequen es, all orresponding to human GPCRs. In the GLIDAdatabase, GPCRs are lassi(cid:28)ed in a hierar hy (as mentioned above) whi h was also loaded for usein the hierar hy kernel.4 ResultsWe ran two di(cid:27)erent sets of experiments on this dataset in order to illustrate two important points.In a (cid:28)rst set of experiments, for ea h GPCR, we 5-folded the data available, i.e. the line of the10 tar \ K lig
2D Tanimoto 3D pharma ophoreDira . ± . . ± . multitask . ± . . ± . hierar hy . ± . . ± . binding po ket . ± . . ± . poly binding po ket . ± . . ± . binding po ket hierar hy . ± . . ± . Table 1: Predi tion a ura y for the (cid:28)rst experiment with various ligand and target kernels.intera tion matrix orresponding to this GPCR. The lassi(cid:28)er was trained with four folds and thewhole data from the other GPCRs, i.e., all other lines of the intera tion matrix. The predi tiona ura y for the GPCR under study was then tested on the remaining fold. The goal of these (cid:28)rstexperiments was to evaluate if using data from other GPCRs improved the predi tion a ura y fora given GPCR. In a se ond set of experiments, for ea h GPCR we trained a lassi(cid:28)er on the wholedata from the other GPCRs, and tested on the data of the onsidered GPCR. The goal was toassess how e(cid:30) ient our hemogenomi s approa h would be to predi t the ligands of orphan GPCRs.In both experiments, the C parameter of the SVM was sele ted by internal ross validation on thetraining set among i , i ∈ {− , − , . . . , , } .For the (cid:28)rst experiment, sin e learning an SVM with only one training point does not reallymake sense and an lead to "anti-learning" less than . performan es, we set all results r involvingthe Dira GPCR kernel on GPCRs with only known ligand to max( r, . . This is to avoid anyartefa tual penalization of the Dira approa h and make sure we measure the a tual improvementbrought by sharing information a ross GPCRs.Figure 3: GPCR kernel Gram matri es ( K tar ) for the GLIDA GPCR data with multitask, hierar hy,binding po ket and binding po ket hierar hy kernels.Table 1 shows the results of the (cid:28)rst experiments with all the ligand and GPCR kernel ombi-nations. For all the ligand kernels, one observes an improvement between the individual approa h(Dira GPCR kernel, . ) and the baseline multitask approa h (multitask GPCR kernel, . ).The latter kernel is merely modeling the fa t that ea h GPCR is uniformly similar to all otherGPCRs, and twi e more similar to itself. It does not use any prior information on the GPCRs,and yet, using it improves the global performan e with respe t to individual learning. Using moreinformative GPCR kernels further improves, sometimes onsiderably, the predi tion a ura y. Inparti ular, the hierar hy kernels add more than . of pre ision with respe t to naive multitaskapproa h. All the other informative GPCR kernels also improve the performan e. The polynomialbinding po ket kernel and the po ket binding hierar hy kernels are almost as e(cid:30) ient as the hierar- hy kernel, whi h is an interesting result. Indeed, one ould fear that using the hierar hy kernel, forthe onstru tion of whi h some knowledge of the ligands may have been used, ould have introdu ed11ias in the results. Su h bias is ertainly absent in the binding po ket kernel. The fa t that thesame performan e an be rea hed with kernels based on the mere sequen e of GPCRs' po kets istherefore an important result. Figure 3 shows four of the GPCR kernels. The baseline multitask isshown as a omparison. Interestingly, many of the subgroups de(cid:28)ned in the hierar hy an be foundin the binding po ket kernel, that is, they are retrieved from the simple information of the bindingpo ket sequen e. This phenomenon is even more visible for the binding po ket hierar hy kernel thatis based on the hierar hy built from the binding po ket alignment s ores.Figure 4: Improvement (as a performan e ratio) of the hierar hy GPCR kernel against the Dira GPCR kernel as a fun tion of the number of training samples available. Restri ted to [2 − samples for the sake of readability.The 3D kernel for the ligands, on the other hand, did not perform as well as the 2D kernel. This an be either explained by the fa t the the pharma ophore kernel is not suited to this problem, orby the fa t that hoosing the onformer of the ligand is not a trivial task. This point is dis ussedbelow.Figure 4 illustrates how the improvement brought by the hemogenomi s approa h varies withthe number of available training points. As one ould have expe ted, the strongest improvement isobserved for the GPCRs with few (less than ) training points (i.e., less than known ligands sin efor ea h known ligand an arti(cid:28) ial non-ligand was generated). When more training points be omeavailable, the improvement is less important, and sharing the information a ross the GPCRs aneven degrade the performan es. This is an important point, (cid:28)rst be ause, as showed on Figure 5,many GPCRs have few known ligands (in parti ular, of them have only two training points),and se ond be ause it shows that when enough training points are available, individual learning willprobably perform as well as or better than our hemogenomi s approa h.Our se ond experiment intends to assess how our hemogenomi s approa h an perform whenpredi ting ligands for orphan GPCRs, i.e., with no training data available for the GPCR of interest.Table 2 shows that in this setting, individual learning performs random predi tion. Naive multitaskapproa h does not improve mu h the performan e, but informative kernels su h as hierar hi al and12igure 5: Distribution of the number of training points for a GPCR. Restri ted to [2 − samplesfor the sake of readability. K tar \ K lig
2D Tanimoto 3D pharma ophoreDira . ± . . ± . multitask . ± . . ± . hierar hy . ± . . ± . binding po ket . ± . . ± . poly binding po ket . ± . . ± . binding po ket hierar hy . ± . . ± . Table 2: Predi tion a ura y for the se ond experiment with various ligand and target kernels.binding po ket kernels a hieve . and . of pre ision respe tively, that is, almost betterthan the random approa h one would get when no data is available. Here again, the fa t that thebinding po ket kernel that only uses the sequen e of the re eptor po ket performs as well as thehierar hy-based kernel is en ouraging. It suggests that given a re eptor for whi h nothing is knownex ept its sequen e, it is possible to make reasonable ligand predi tions.5 Dis ussionWe showed how sharing information a ross the GPCRs by onsidering a hemogenomi s spa e of theGPCR-ligand intera tion pairs ould improve the predi tion performan es. In addition, we showedthat using su h a representation, it was possible to make reasonable predi tions even when no ligandwas known for a given GPCR, that is, to predi t ligands for orphan GPCRs. Our approa h is simplyto apply well known ma hine learning methods in the onstru ted hemogenomi s spa e. We used asystemati way to build su h a spa e by ombining a given representation of the ligands with a givenrepresentation of the GPCRs into a binding-predi tion-oriented GPCR-ligand ouple representation.This allows to use any ligand or GPCR des riptor or kernel existing in the hemoinformati s or13ioinformati s literature, or new ones ontaining other prior information as we tried to propose inthis paper. Our experiments showed that the hoi e of the des riptors was ru ial for the predi tion,and more sophisti ated features for either the ligands or the GPCRs ould probably further improvethe performan es.In all experiments, 3D pharma ophore kernels did not rea h the performan e of 2D kernels forthe ligands. This is apparently in ontradi tion we the idea that protein-ligand intera tion is apro ess o urring in the 3D spa e, and that introdu tion of 3D information should in rease theperforman e. Di(cid:27)erent explanations an be proposed. The hoi e of the low energy onformerwas guided by the following idea. Be ause only two ligand onformations bound to GPCR re- eptors are available, it was not possible to derive any general information that ould be used to hoose a potential bioa tive 3D onformer for ea h mole ule of the ligand base. In this ontext,the only possible reasonable assumption was that, while intera tion with the re eptor will ertainlyperturb the onformational energy surfa e of a (cid:29)exible ligand, high a(cid:30)nity would be observed forligands that bind in a onformation that is not ex eptionally di(cid:27)erent from a lo al free state en-ergy minimum (Boström, 2001). Although there exists a large number of methods for exploringthe onformation spa e of a mole ule, we used the Omega program that performs rapid systemati onformer sear h, be ause it has been showed to present good performan es for retrieving bioa -tive onformations (Boström et al., 2003). However, the set of parameters used to run Omega inthis study (be ause of al ulation time limitations) may not have allowed to rea h a lo al energyminimum: generating a larger number of onformers, with a smaller RMSD lustering value mayhave helped to (cid:28)nd better energy minima, and this ould be further evaluated. Moreover, somestudies report that the bioa tive onformation of a mole ule an di(cid:27)er from the minimum energy onformation, and that signi(cid:28) ant strain energies an indeed be found for mole ules in omplexwith proteins (Perola and Charifson, 2004). We annot rule out the possibility that this is the asefor GPCR ligands. In the future, resolution of additional 3D stru tures in this family will helpto larify this point. One possible improvement of the method ould be to use homology modelsfor the GPCRs, do k the ligand base in the modeled binding po kets, and build a 3D ligand baseusing, for ea h mole ule, the onformer asso iated to the best do king solution. In other familiesof proteins, enzymes for example, where many stru tures are available and an be used to de(cid:28)nebioa tive onformers, the 3D pharma ophore kernel is expe ted to improve performan e, as observedin a previous pure ligand-based study involving ligands in a given series, for whi h the bioa tive onformation an be inferred from a known 3D stru ture (Mahé et al., 2006).Various eviden e suggest that, within a ommon global ar hite ture, a generi binding po ketmainly involving transmembrane regions hosts agonists, antagonists and allosteri modulators. Inorder to identify this po ket automati ally, other studies report the use of sequen e alignment andthe predi tion of transmembrane heli es. Krato hwil et al. (2005) dete ted hypervariable positionsin transmembrane heli es for identi(cid:28) ation of residues forming the binding po ket. The under-lying idea was that onserved residues were probably important for stru ture stabilization, whilevariable positions were involved in ligand binding, in order to a ommodate the wide spe trum ofmole ules that are GPCR substrates. Using this method, they proposed potential binding po k-ets for GPCRs, and found that the orresponding residues were frequently in the GRAP mutantdatabase for GPCRs (Kristiansen et al., 1996). Interestingly, these authors pointed that residues orresponding to these hypervariable positions were found within a distan e of from retinal inthe rhodopsin X-Ray stru ture. Therefore, although we used a di(cid:27)erent method to automati allyextra t binding po ket residues in the GPCR families, our results are in good agreement with thisstudy.Interesting developments of this method ould in lude appli ation to quantitative predi tionof the binding a(cid:30)nities, that would be straightforward using regression algorithms in the same14hemogenomi s spa e. Another possibility is appli ation to other important drug target families,like enzymes or ion hannels (Hopkins and Groom, 2002), for whi h most of the des riptors used forthe GPCRs in this paper ould dire tly be used, and other, more spe i(cid:28) ones ould be designed.From a methodologi al point of view, many re ent developments in multitask learning (Vert et al.,2006; Argyriou et al., 2007; Bonilla et al., 2008) ould be applied to generalize this hemogenomi sapproa h using, for example, other regularization methods.Referen esArgyriou, A., Evgeniou, T., and Pontil, M. (2007). Multi-task feature learning. In B. S hölkopf, J. Platt, and T. Ho(cid:27)man, editors,Adv. Neural. Inform. Pro ess Syst. 19, pages 41(cid:21)48, Cambridge, MA. MIT Press.Azen ott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L., and Baldi, P. (2007). One- to four-dimensional kernels forvirtual s reening and the predi tion of physi al, hemi al, and biologi al properties. J Chem Inf Model, 47(3), 965(cid:21)974.Balakin, K. V., Tka henko, S. E., Lang, S. A., Okun, I., Ivash henko, A. A., and Sav huk, N. P. (2002). Property-based design ofgp r-targeted library. J Chem Inf Comput S i, 42(6), 1332(cid:21)1342.Bo k, J. R. and Gough, D. A. (2001). Predi ting protein-protein intera tions from primary stru ture. Bioinformati s, 17(5), 455(cid:21)460.Bo k, J. R. and Gough, D. A. (2005). Virtual s reen for ligands of orphan g protein- oupled re eptors. J Chem Inf Model, 45(5),1402(cid:21)1414.Bo kaert, J. and Pin, J. P. (1999). Mole ular tinkering of g protein- oupled re eptors: an evolutionary su ess. EMBO J, 18(7),1723(cid:21)1729.Bonilla, E., Chai, K. M., and Williams, C. (2008). Multi-task gaussian pro ess predi tion. In J. Platt, D. Koller, Y. Singer, andS. Roweis, editors, Advan es in Neural Information Pro essing Systems 20. MIT Press, Cambridge, MA.Borgwardt, K., Ong, C., S hönauer, S., Vishwanathan, S., Smola, A., and Kriegel, H.-P. (2005). Protein fun tion predi tion via graphkernels. Bioinformati s, 21(Suppl. 1), i47(cid:21)i56.Boström, J. (2001). Reprodu ing the onformations of protein-bound ligands: a riti al evaluation of several popular onformationalsear hing tools. J Comput Aided Mol Des, 15(12), 1137(cid:21)1152.Boström, J., Greenwood, J. R., and Gottfries, J. (2003). Assessing the performan e of omega with respe t to retrieving bioa tive onformations. J Mol Graph Model, 21(5), 449(cid:21)462.Catapano, L. A. and Manji, H. K. (2007). G protein- oupled re eptors in major psy hiatri disorders. Bio him Biophys A ta, 1768(4),976(cid:21)993.Cavasotto, C. N., Orry, A. J. W., and Abagyan, R. A. (2003). Stru ture-based identi(cid:28) ation of binding sites, native ligands andpotential inhibitors for g-protein oupled re eptors. Proteins, 51(3), 423(cid:21)433.Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G., and Thompson, J. D. (2003). Multiple sequen ealignment with the lustal series of programs. Nu lei A ids Res, 31(13), 3497(cid:21)3500.Cuturi, M. and Vert, J.-P. (2005). The ontext-tree kernel for strings. Neural Network., 18(4), 1111(cid:21)1123.Deshpande, D. A. and Penn, R. B. (2006). Targeting g protein- oupled re eptor signaling in asthma. Cell Signal, 18(12), 2105(cid:21)2120.Dobson, P. and Doig, A. (2005). Predi ting enzyme lass from protein stru ture without alignments. J. Mol. Biol., 345(1), 187(cid:21)199.Erhan, D., L'heureux, P.-J., Yue, S. Y., and Bengio, Y. (2006). Collaborative (cid:28)ltering on a family of biologi al targets. J Chem InfModel, 46(2), 626(cid:21)635.Evers, A. and Klabunde, T. (2005). Stru ture-based drug dis overy using gp r homology modeling: su essful virtual s reening forantagonists of the alpha1a adrenergi re eptor. J Med Chem, 48(4), 1088(cid:21)1097.Evgeniou, T., Mi helli, C., and Pontil, M. (2005). Learning multiple tasks with kernel methods. J. Ma h. Learn. Res., 6, 615(cid:21)637.Fredholm, B. B., H ¶ kfelt, T., and Milligan, G. (2007). G-protein- oupled re eptors: an update. A ta Physiol (Oxf), 190(1), 3(cid:21)7.Frimurer, T. M., Ulven, T., Elling, C. E., Gerla h, L.-O., Kostenis, E., and Högberg, T. (2005). A physi ogeneti method to assignligand-binding relationships between 7tm re eptors. Bioorg. Med. Chem. Lett., 15(16), 3707(cid:21)3712.15ärtner, T., Fla h, P., and Wrobel, S. (2003). On graph kernels: hardness results and e(cid:30) ient alternatives. In B. S hölkopf andM. Warmuth, editors, Pro eedings of the Sixteenth Annual Conferen e on Computational Learning Theory and the SeventhAnnual Workshop on Kernel Ma hines, volume 2777 of Le ture Notes in Computer S ien e, pages 129(cid:21)143, Heidelberg.Gasteiger, J. and Engel, T., editors (2003). Chemoinformati s : a Textbook. Wiley.Heniko(cid:27), S. and Heniko(cid:27), J. G. (1992). Amino a id substitution matri es from protein blo ks. Pro Natl A ad S i U S A, 89(22),10915(cid:21)10919.Hill, S. J. (2006). G-protein- oupled re eptors: past, present and future. Br J Pharma ol, 147 Suppl 1, S27(cid:21)S37.Hopkins, A. L. and Groom, C. R. (2002). The druggable genome. Nat Rev Drug Dis ov, 1(9), 727(cid:21)730.Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F. E., and Vriend, G. (2003). GPCRDB information system for G protein- oupled re eptors. Nu l. A ids Res., 31(1), 294(cid:21)297.Humphrey, W., Dalke, A., and S hulten, K. (1996). Vmd: visual mole ular dynami s. J Mol Graph, 14(1), 33(cid:21)8, 27(cid:21)8.Jaakkola, T., Diekhans, M., and Haussler, D. (2000). A Dis riminative Framework for Dete ting Remote Protein Homologies. J.Comput. Biol., 7(1,2), 95(cid:21)114.Ja ob, L. and Vert, J.-P. (2008). E(cid:30) ient peptide-MHC-I binding predi tion for alleles with few known binders. Bioinformati s. Toappear.Jaro h, S. E. and Weinmann, H., editors (2006). Chemi al Genomi s: Small Mole ule Probes to Study Cellular Fun tion. ErnstS hering Resear h Foundation Workshop. Springer.Kashima, H., Tsuda, K., and Inoku hi, A. (2003). Marginalized Kernels between Labeled Graphs. In T. Fau ett and N. Mishra, editors,Pro eedings of the Twentieth International Conferen e on Ma hine Learning, pages 321(cid:21)328. AAAI Press.Kashima, H., Tsuda, K., and Inoku hi, A. (2004). Kernels for graphs. In B. S hölkopf, K. Tsuda, and J. Vert, editors, Kernel Methodsin Computational Biology, pages 155(cid:21)170. MIT Press.Klabunde, T. (2006). Chemogenomi s approa hes to ligand design. In Ligand Design for G Protein- oupled Re eptors, hapter 7,pages 115(cid:21)135. Wiley-VCH.Klabunde, T. (2007). Chemogenomi approa hes to drug dis overy: similar re eptors bind similar ligands. Br J Pharma ol, 152, 5(cid:21)7.Krato hwil, N. A., Malherbe, P., Lindemann, L., Ebeling, M., Hoener, M. C., M hlemann, A., Porter, R. H. P., Stahl, M., and Gerber,P. R. (2005). An automated system for the analysis of g protein- oupled re eptor transmembrane binding po kets: alignment,re eptor-based pharma ophores, and their appli ation. J Chem Inf Model, 45(5), 1324(cid:21)1336.Kristiansen, K., Dahl, S. G., and Edvardsen, O. (1996). A database of mutants and e(cid:27)e ts of site-dire ted mutagenesis experiments ong protein- oupled re eptors. Proteins, 26(1), 81(cid:21)94.Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., and Leslie, C. (2005). Pro(cid:28)le-based string kernels for remote homologydete tion and motif extra tion. J Bioinform Comput Biol, 3(3), 527(cid:21)550.Kubinyi, H., Müller, G., Mannhold, R., and Folkers, G., editors (2004). Chemogenomi s in Drug Dis overy: A Medi inal ChemistryPerspe tive. Methods and Prin iples in Medi inal Chemistry. Wiley-VCH.Leslie, C., Eskin, E., and Noble, W. (2002). The spe trum kernel: a string kernel for SVM protein lassi(cid:28) ation. In R. B. Altman,A. K. Dunker, L. Hunter, K. Lauerdale, and T. E. Klein, editors, Pro eedings of the Pa i(cid:28) Symposium on Bio omputing 2002,pages 564(cid:21)575. World S ienti(cid:28) .Leslie, C. S., Eskin, E., Cohen, A., Weston, J., and Noble, W. S. (2004). Mismat h string kernels for dis riminative protein lassi(cid:28) ation.Bioinformati s, 20(4), 467(cid:21)476.Lin, S. H. S. and Civelli, O. (2004). Orphan g protein- oupled re eptors: targets for new therapeuti interventions. Ann Med, 36(3),204(cid:21)214.Mahé, P., Ueda, N., Akutsu, T., Perret, J.-L., and Vert, J.-P. (2005). Graph kernels for mole ular stru ture-a tivity relationship analysiswith support ve tor ma hines. J Chem Inf Model, 45(4), 939(cid:21)51.Mahé, P., Ralaivola, L., Stoven, V., and Vert, J.-P. (2006). The pharma ophore kernel for virtual s reening with support ve torma hines. J Chem Inf Model, 46(5), 2003(cid:21)2014.Okuno, Y., Yang, J., Taneishi, K., Yabuu hi, H., and Tsujimoto, G. (2006). Glida: Gp r-ligand database for hemi al genomi drugdis overy. Nu lei A ids Res, 34(Database issue), D673(cid:21)D677.Perola, E. and Charifson, P. S. (2004). Conformational analysis of drug-like mole ules bound to proteins: an extensive study of ligandreorganization upon binding. J Med Chem, 47(10), 2499(cid:21)2510.16iu, J., Hue, M., Ben-Hur, A., Vert, J.-P., and Noble, W. S. (2007). A stru tural alignment kernel for protein stru tures. Bioinformati s,23(9), 1090(cid:21)1098.Ralaivola, L., Swamidass, S. J., Saigo, H., and Baldi, P. (2005). Graph kernels for hemi al informati s. Neural Netw., 18(8), 1093(cid:21)1110.Rognan, D. (2007). Chemogenomi approa hes to rational drug design. Br J Pharma ol, 152, 38(cid:21)52.Rolland, C., Gozalbes, R., Ni ola (cid:9), E., Paugam, M.-F., Coussy, L., Barbosa, F., Horvath, D., and Revah, F. (2005). G-protein- oupledre eptor a(cid:30)nity predi tion based on the use of a pro(cid:28)ling dataset: Qsar design, synthesis, and experimental validation. J Med Chem,48(21), 6563(cid:21)6574.Russell, R. B. and Barton, G. J. (1992). Multiple protein sequen e alignment from tertiary stru ture omparison: assignment of globaland residue on(cid:28)den e levels. Proteins, 14(2), 309(cid:21)323.S hölkopf, B., Tsuda, K., and Vert, J.-P. (2004). Kernel Methods in Computational Biology. MIT Press.Todes hini, R. and Consonni, V. (2002). Handbook of Mole ular Des riptors. Wiley-VCH.Tsuda, K., Kin, T., and Asai, K. (2002). Marginalized Kernels for Biologi al Sequen es. Bioinformati s, 18, S268(cid:21)S275.Vert, J.-P. (2002). A tree kernel to analyze phylogeneti pro(cid:28)les. Bioinformati s, 18, S276(cid:21)S284.Vert, J.-P., Saigo, H., and Akutsu, T. (2004). Lo al alignment kernels for biologi al sequen es. In B. S hölkopf, K. Tsuda, and J. Vert,editors, Kernel Methods in Computational Biology, pages 131(cid:21)154. MIT Press.Vert, J.-P., Ba h, F., and Evgeniou, T. (2006). Low-rank matrix fa torization with attributes.17ositions on β -adrenergi re eptor
109 110
113 114 115 116 117 118 121 175 183 195
199 200 β -adrenergi re eptor M W T D V L C V T I R N T Y A5-hydroxytryptamine 5A re eptor V W I D V L C C T I I E S Y AAdenosine A2b re eptor V L A V L V L T Q I I K K M VGamma-aminobutyri a id type B re eptor E D E E A V E G H T L G S F DRelaxin 3 re eptor 2 L V L T V L N V Y I V G L Y Qpositions on β -adrenergi re eptor
203 204 207 208 212 282 286
290 293
311 312 313 315 316 β -adrenergi re eptor S S S F L F W F F N Y L N W G Y5-hydroxytryptamine 5A re eptor S T A F L F W F F E K F L W G YAdenosine A2b re eptor N F C V L F W V H N M A I L S HGamma-aminobutyri a id type B re eptor G S A W E F L Y H R L T V G L VRelaxin 3 re eptor 2 R V A F L F W N H T F T T C A HTable 3: Residues of 5-hydroxytryptamine 5A re eptor, Adenosine A2b re eptor, Gamma-aminobutyri a id type B re eptor and Relaxin3 re eptor 2 (shown as examples) aligned with β -adrenergi re eptor binding site amino a ids. the binding po ket motif of β -adrenergi re eptor has been used as referen e to determine residues involved in the formation of the binding site of the other GPCRs. Bold olumns orrespond to the residues shown on Figure 2.other GPCRs. Bold olumns orrespond to the residues shown on Figure 2.