[PDF] An Inductive Logic Programming Approach to Validate Hexose Binding Biochemical Knowledge

Abstract

Hexoses are simple sugars that play a key role in many cellular pathways, and in the regulation of development and disease mechanisms. Current protein-sugar computational models are based, at least partially, on prior biochemical findings and knowledge. They incorporate different parts of these findings in predictive black-box models. We investigate the empirical support for biochemical findings by comparing Inductive Logic Programming (ILP) induced rules to actual biochemical results. We mine the Protein Data Bank for a representative data set of hexose binding sites, non-hexose binding sites and surface grooves. We build an ILP model of hexose-binding sites and evaluate our results against several baseline machine learning classifiers. Our method achieves an accuracy similar to that of other black-box classifiers while providing insight into the discriminating process. In addition, it confirms wet-lab findings and reveals a previously unreported Trp-Glu amino acids dependency.

Full PDF

aa r X i v : . [ q - b i o . O T ] O c t An Inductive Logic Programming Approach toValidate Hexose Binding Biochemical Knowledge

Houssam Nassif , , Hassan Al-Ali , Sawsan Khuri , , Walid Keirouz , andDavid Page , Department of Computer Sciences, Department of Biostatistics and Medical Informatics,University of Wisconsin-Madison, USA Department of Biochemistry and Molecular Biology, Center for Computational Science, University of Miami, The Dr. John T. Macdonald Foundation Department of Human Genetics,University of Miami Miller School of Medicine, Florida, USA Department of Computer Science, American University of Beirut, Lebanon

Abstract.

Hexoses are simple sugars that play a key role in many cel-lular pathways, and in the regulation of development and disease mech-anisms. Current protein-sugar computational models are based, at leastpartially, on prior biochemical ﬁndings and knowledge. They incorporatediﬀerent parts of these ﬁndings in predictive black-box models. We in-vestigate the empirical support for biochemical ﬁndings by comparingInductive Logic Programming (ILP) induced rules to actual biochemicalresults. We mine the Protein Data Bank for a representative data setof hexose binding sites, non-hexose binding sites and surface grooves.We build an ILP model of hexose-binding sites and evaluate our resultsagainst several baseline machine learning classiﬁers. Our method achievesan accuracy similar to that of other black-box classiﬁers while providinginsight into the discriminating process. In addition, it conﬁrms wet-labﬁndings and reveals a previously unreported

Trp-Glu amino acids de-pendency.

Key words:

ILP, Aleph, rule generation, hexose, protein-carbohydrateinteraction, binding site, substrate recognition

Inductive Logic Programming (ILP) has been shown to perform well in predict-ing various substrate-protein bindings (e.g., [9, 26]). In this paper we apply ILPto a diﬀerent and well studied binding task.Hexoses are 6-carbon simple sugar molecules that play a key role in diﬀerentbiochemical pathways, including cellular energy release, signaling, carbohydratesynthesis, and the regulation of gene expression [24]. Hexose binding proteinsbelong to diverse functional families that lack signiﬁcant sequence or, often,structural similarity [16]. Despite this fact, these proteins show high speciﬁcityto their hexose ligands. The few amino acids (also called residues) present at

An ILP Approach to Validate Hexose Binding Biochemical Knowledge the binding site play a large role in determining the binding site’s distinctivetopology and biochemical properties and hence the ligand type and the protein’sfunctionality.Wet-lab experiments discover hexose-protein properties. Computational hex-ose classiﬁers incorporate diﬀerent parts of these ﬁndings in black-box models asthe base of prediction. No work to date has taken the opposite approach: givenhexose binding sites data, what biochemical rules can we extract with no priorbiochemical knowledge, and what is the performance of the resulting classiﬁerbased solely on the extracted rules?This work presents an ILP classiﬁer that extracts rules from the data withoutprior biochemical knowledge. It classiﬁes binding sites based on the extractedbiochemical rules, clearly specifying the rules used to discriminate each instance.Rule learning is especially appealing because of its easy-to-understand format.A set of if-then rules describing a certain concept is highly expressive and read-able [18]. We evaluate our results against several baseline machine learning clas-siﬁers. This inductive data-driven approach validates the biochemical ﬁndingsand allows a better understanding of the black-box classiﬁers’ output.

Although no previous work tackled data-driven rule generation or validation,many researchers studied hexose binding.

From the biochemical perspective, Rao et al. [21] fully characterized the archi-tecture of sugar binding in the Lectin protein family and identiﬁed conservedloop structures as essential for sugar recognition. Later, Quiocho and Vyas [20]presented a review of the biochemical characteristics of carbohydrate bindingsites and identiﬁed the planar polar residues (

Asn, Asp, Gln, Glu, Arg ) as themost frequently involved residues in hydrogen bonding. They also found that thearomatic residues

Trp, Tyr , and

Phe , as well as

His , stack against the apolarsurface of the sugar pyranose ring. Quiocho and Vyas also pinpointed the roleof metal ions in determining substrate speciﬁcity and aﬃnity. Ordered watermolecules bound to protein surfaces are also involved in protein-ligand interac-tion [15].Taroni et al. [29] analyzed the characteristic properties of sugar binding sitesand described a residue propensity parameter that best discriminates sugar bind-ing sites from other protein-surface patches. They also note that simple sugarstypically have a hydrophilic side group which establishes hydrogen bonds and ahydrophobic core that is able to stack against aromatic residues. Sugar bindingsites are thus neither strictly hydrophobic nor strictly hydrophilic, due to thedual nature of sugar docking. In fact, as Garc´ıa-Hern´andez et al. [11] showed,some polar groups in the protein-carbohydrate complex behave hydrophobically. itle Suppressed Due to Excessive Length 3

Some of this biochemical information has been used in computational work withthe objective of accurately predicting sugar binding sites in proteins. Taroni etal. [29] devised a probability formula by combining individual attribute scores.Shionyu-Mitsuyama et al. [23] used atom type densities within binding sites todevelop an algorithm for predicting carbohydrate binding. Chakrabarti et al. [5]modeled one glucose binding site and one galactose binding site by optimiz-ing their binding aﬃnity under geometric and folding free energy constraints.Other researchers formulated a signature for characterizing galactose bindingsites based on geometric constraints, pyranose ring proximity and hydrogenbonding atoms [27, 28]. They implemented a 3D structure searching algorithm,COTRAN, to identify galactose binding sites.More recently, researchers used machine learning algorithms to model hexosebinding sites. Malik and Ahmad [17] used a Neural Network to predict generalcarbohydrate as well as speciﬁc galactose binding sites. Nassif et al. [19] usedSupport Vector Machines to model and predict glucose binding sites in a widerange of proteins.

The Protein Data Bank (PDB) [2] is the largest repository of experimentallydetermined and hypothetical three-dimensional structures of biological macro-molecules. We mine it for proteins crystallized with the most common hexoses:galactose, glucose and mannose [10]. We ignore theoretical structures and ﬁlesolder than PDB format 2 .

1. We eliminate redundant structures using PISCES [30]with a 30% overall sequence identity cut-oﬀ. We use Swiss-PDBViewer [14]to detect and discard sites that are glycosylated or within close proximity toother ligands. We check the literature to ensure that no hexose-binding site alsobinds non-hexoses. The ﬁnal outcome is a non-redundant positive data set of 80protein-hexose binding sites (Table 1).We also extract an equal number of negative examples. The negative set iscomposed of non-hexose binding sites and of non-binding surface grooves. Wechoose 22 binding-sites that bind hexose-like ligands: hexose or fructose deriva-tives, 6-carbon molecules, and molecules similar in shape to hexoses (Table 2).We also select 27 other-ligand binding sites, ligands who are bigger or smallerthan hexoses (Table 2). Finally, we specify 31 non-binding sites: protein sur-face grooves that look like binding-sites but are not known to bind any ligand(Table 3).We use 10-folds cross-validation to train, test and validate our approach. Wedivide the data set in 10 stratiﬁed folds, thus preserving the proportions of theoriginal set labels and sub-groups.

An ILP Approach to Validate Hexose Binding Biochemical Knowledge

Table 1.

Inventory of the hexose-binding positive data setHexose PDB ID Ligand PDB ID Ligand PDB ID LigandGlucose 1BDG GLC-501 1ISY GLC-1471 1SZ2 BGC-10011EX1 GLC-617 1J0Y GLC-1601 1SZ2 BGC-20011GJW GLC-701 1JG9 GLC-2000 1U2S GLC-11GWW GLC-1371 1K1W GLC-653 1UA4 GLC-14571H5U GLC-998 1KME GLC-501 1V2B AGC-12031HIZ GLC-1381 1MMU GLC-1 1WOQ GLC-2901HIZ GLC-1382 1NF5 GLC-125 1Z8D GLC-9011HKC GLC-915 1NSZ GLC-1400 2BQP GLC-3371HSJ GLC-671 1PWB GLC-405 2BVW GLC-6021HSJ GLC-672 1Q33 GLC-400 2BVW GLC-6031I8A GLC-189 1RYD GLC-601 2F2E AGC-4011ISY GLC-1461 1S5M AGC-1001Galactose 1AXZ GLA-401 1MUQ GAL-301 1R47 GAL-11011DIW GAL-1400 1NS0 GAL-1400 1S5D GAL-7041DJR GAL-1104 1NS2 GAL-1400 1S5E GAL-7511DZQ GAL-502 1NS8 GAL-1400 1S5F GAL-1041EUU GAL-2 1NSM GAL-1400 1SO0 GAL-5001ISZ GAL-461 1NSU GAL-1400 1TLG GAL-11ISZ GAL-471 1NSX GAL-1400 1UAS GAL-15011JZ7 GAL-2001 1OKO GLB-901 1UGW GAL-2001KWK GAL-701 1OQL GAL-265 1XC6 GAL-90111L7K GAL-500 1OQL GAL-267 1ZHJ GAL-11LTI GAL-104 1PIE GAL-1 2GAL GAL-998Mannose 1BQP MAN-402 1KZB MAN-1501 1OUR MAN-3011KLF MAN-1500 1KZC MAN-1001 1QMO MAN-3021KX1 MAN-20 1KZE MAN-1001 1U4J MAN-10081KZA MAN-1001 1OP3 MAN-503 1U4J MAN-1009

In this work, we ﬁrst extract multiple chemical and spatial features from thebinding site. We then apply ILP to generate rules and classify our data set.

We view the binding site as a sphere centered at the ligand. We compute thecenter of the hexose-binding site as the centroid of the coordinates of the hexosepyranose ring’s six atoms. For negative sites, we use the center of the cavityor the ligand’s central point. The farthest pyranose-ring atom from the ring’scentroid is located 2 . itle Suppressed Due to Excessive Length 5 Table 2.

Inventory of the non-hexose-binding negative data setPDB ID Cavity Center Ligand PDB ID Cavity Center LigandHexose-like ligands1A8U 4320, 4323 BEZ-1 1AI7 6074, 6077 IPH-11AWB 4175, 4178 IPD-2 1DBN pyranose ring GAL-1021EOB 3532, 3536 DHB-999 1F9G 5792, 5785, 5786 ASC-9501G0H 4045, 4048 IPD-292 1JU4 4356, 4359 BEZ-11LBX 3941, 3944 IPD-295 1LBY 3944, 3939, 3941 F6P-2951LIU 15441, 15436, 15438 FBP-580 1MOR pyranose ring G6P-6091NCW 3406, 3409 BEZ-601 1P5D pyranose ring G1P-6581T10 4366, 4361, 4363 F6P-1001 1U0F pyranose ring G6P-9001UKB 2144, 2147 BEZ-1300 1X9I pyranose ring G6Q-6001Y9G 4124, 4116, 4117 FRU-801 2B0C pyranose ring G1P-4962B32 3941, 3944 IPH-401 4PBG pyranose ring BGP-469Other ligands11AS 5132 ASN-1 11GS 1672, 1675 MES-31A0J 6985 BEN-246 1A42 2054, 2055 BZO-5551A50 4939, 4940 FIP-270 1A53 2016, 2017 IGP-3001AA1 4472, 4474 3PG-477 1AJN 6074, 6079 AAN-11AJS 3276, 3281 PLA-415 1AL8 2652 FMN-3601B8A 7224 ATP-500 1BO5 7811 GOL-6011BOB 2566 ACO-400 1D09 7246 PAL-13111EQY 3831 ATP-380 1IOL 2674, 2675 EST-4001JTV 2136, 2137 TES-500 1KF6 16674, 16675 OAA-7021RTK 3787, 3784 GBS-300 1TJ4 1947 SUC-11TVO 2857 FRZ-1001 1UK6 2142 PPI-13001W8N 4573, 4585 DAN-1649 1ZYU 1284, 1286 SKM-4012D7S 3787 GLU-1008 2GAM 11955 NGA-5023PCB 3421, 3424 3HB-550

Table 3.

Inventory of the non-binding surface groove negative data setPDB ID Cavity Center PDB ID Cavity Center PDB ID Cavity Center1A04 1424, 2671 1A0I 1689, 799 1A22 29271AA7 579 1AF7 631, 1492 1AM2 12771ARO 154, 1663 1ATG 1751 1C3G 630, 8881C3P 1089, 1576 1DXJ 867, 1498 1EVT 2149, 22291FI2 1493 1KLM 4373, 4113 1KWP 12121QZ7 3592, 2509 1YQZ 4458, 4269 1YVB 1546, 18141ZT9 1056, 1188 2A1K 2758, 3345 2AUP 22462BG9 14076, 8076 2C9Q 777 2CL3 123, 9482DN2 749, 1006 2F1K 316, 642 2G50 26265, 316722G69 248, 378 2GRK 369, 380 2GSE 337, 106182GSH 6260 An ILP Approach to Validate Hexose Binding Biochemical Knowledge binding groove [15, 19, 20]. We discard hydrogen atoms since most PDB entrieslack them. We do not extract residues.For every extracted atom we record its PDB-coordinates, its charge, hydro-gen bonding, and hydrophobicity properties, and its atomic element and name.Every PDB ﬁle has orthogonal coordinates and all atom positions are recordedaccordingly. We compute atomic properties as done by Nassif et al. [19]. The par-tial charge measure per atom is positive, neutral, or negative; atoms can formhydrogen bonds or not; hydrophobicity measures are considered as hydrophobic,hydroneutral, or hydrophilic. Finally, every PDB-atom has an atomic elementand a speciﬁc name. For example, the residue histidine (

His ) has a particularNitrogen atom named

ND1 . This atom’s element is Nitrogen, and name is

ND1 .Since

ND1 atoms only occur in

His residues, recording atomic names leaks in-formation about their residues.

We use the ILP engine Aleph [25] to learn ﬁrst-order rules. We run Aleph withinYap Prolog [22]. To speed the search, we use Aleph’s heuristic search. We esti-mate the classiﬁer’s performance using 10-fold cross-validation.We limit Aleph’s running time by restricting the clause length to a maximumof 8 literals, with only one in the head. We set the Aleph parameter explore totrue, so that it will return all optimal-scoring clauses, rather than a single one, ina case of a tie. The consequent of any rule is bind (+ site ), where site is predictedto be a hexose binding site. No literal can contain terms pertaining to diﬀerentbinding sites. As a result, site is the same in all literals in a clause.The literal describing the binding site center is: point (+ site, − id, − X, − Y, − Z ) (1)where site is the binding site and id is the binding center’s unique identiﬁer. X , Y , and Z specify the PDB-Cartesian coordinates of the binding site’s centroid.Literals describing individual PDB-atoms are of the form: point (+ site, − id, − X, − Y, − Z, − charge, − hbond, − hydro, − elem, − name ) (2)where site is the binding site and id is the individual atom’s unique identiﬁer. X , Y , and Z specify the PDB-Cartesian coordinates of the atom. charge is the par-tial charge, hbond the hydrogen-bonding, and hydro the hydrophobicity. Lastly, elem and name refer to the atomic element and its name (see last paragraph ofprevious section).Clause bodies can also use distance literals: dist (+ site, + id, + id, distance, error ) . (3)The dist predicate, depending on usage, either computes or checks the distance between two points. site is the binding site and the id s are two unique pointidentiﬁers (two PDB-atoms or one PDB-atom and one center). distance is their itle Suppressed Due to Excessive Length 7 Euclidean distance apart and error the tolerated distance error, resulting in amatching interval of distance ± error . We set error to 0 . ND1 ”, or “an atom’s charge is not positive”. Syntactically we do thisby relating PDB-atoms’ variables to constants using “equal” and “not equal”literals: equal (+ setting, setting ) , (4) not equal (+ f eature, f eature ) . (5) f eature is the atomic features charge , hbond and hydro . In addition to theseatomic features, setting includes elem and name .Aleph keeps learning rules until it has covered all the training positive set, andthen it labels a test example as positive if any of the rules cover that example.This has been noted in previous publications to produce a tendency towardgiving more false positives [6, 7]. To limit our false positives count, we restrictcoverage to a maximum of 5 training-set negatives. Since our approach seeksto validate biological knowledge, we aim for high precision rules. Restrictingnegative rule coverage also biases generated rules towards high precision. The Aleph testing set error averaged to 32 .

5% with a standard deviation of10 . . , . pos cover − neg cover ≤

2. Even though Aleph was onlylooking at atoms, valuable information regarding amino acids can be inferred.For example

ND1 atoms are only present within the amino acid

His , and a rulerequiring the presence of

ND1 is actually requiring

His . We present the rules’biochemical translation while replacing speciﬁc atoms by the amino acids theyimply. The queried site is considered hexose binding if any of these rules apply:1. It contains a

Trp residue and a

Glu with an

OE1

Oxygen atom that is8 .

53 ˚A away from an Oxygen atom with a negative partial charge (

Glu, Asp amino acids, Sulfate, Phosphate, residue C-terminus Oxygen).[Pos cover = 22, Neg cover = 4]2. It contains a

Trp , Phe or Tyr residue, an

Asp and an

Asn . Asp and an

Asn ’s OD1

Oxygen atoms are 5 .

24 ˚A apart.[Pos cover = 21, Neg cover = 3]3. It contains a

Val or Ile residue, an

Asp and an

Asn . Asp and

Asn ’s OD1

Oxygen atoms are 3 .

41 ˚A apart.[Pos cover = 15, Neg cover = 0]

An ILP Approach to Validate Hexose Binding Biochemical Knowledge

4. It contains a hydrophilic non-hydrogen bonding Nitrogen atom (

Pro, Arg )with a distance of 7 .

95 ˚A away from a

His ’s ND1

Nitrogen atom, and 9 .

60 ˚Aaway from a

Val or Ile ’s CG1

Carbon atom.[Pos cover = 10, Neg cover = 0]5. It has a hydrophobic

CD2

Carbon atom (

Leu, Phe, Tyr, Trp, His ), a

Pro ,and two hydrophilic

OE1

Oxygen atoms (

Glu, Gln ) 11 .

89 ˚A apart.[Pos cover = 11, Neg cover = 2]6. It contains an

Asp residue B , two identical atoms Q and X , and a hydrophilichydrogen-bonding atom K . Atoms K , Q and X have the same charge. B ’s OD1

Oxygen atom share the same Y-coordinate with K and the same Z-coordinate with Q . Atoms X and K are 8 .

29 ˚A apart.[Pos cover = 8, Neg cover = 0]7. It contains a

Ser residue, and two

NE2

Nitrogen atoms (

Gln, His ) 3 .

88 ˚Aapart.[Pos cover = 8, Neg cover = 2]8. It contains an

Asn residue and a

Phe, Tyr or His residue, whose

CE1

Carbonatom is 7 .

07 ˚A away from a Calcium ion.[Pos cover = 5, Neg cover = 0]9. It contains a

Lys or Arg , a

Phe, Tyr or Arg , a

Trp , and a Sulfate or aPhosphate ion.[Pos cover = 3, Neg cover = 0]Most of these rules closely reproduce current biochemical knowledge. One inparticular is novel. We will discuss rule relevance in Section 7.2.

We evaluate our performance by comparing Aleph to several baseline machinelearning classiﬁers.

Unlike Aleph, the implemented baseline algorithms require a constant-lengthfeature vector input. We change our binding-site representation accordingly. Wesubdivide the binding-site sphere into concentric shells as suggested by Bagleyand Altman [1]. Nassif et al. [19] subdivided the sphere into 8 layers centered atthe binding-site centroid. The ﬁrst layer had a width of 3 ˚A and the subsequent7 layers where 1 ˚A each. Their results show that the layers covering the ﬁrst5 ˚A, the subsequent 3 ˚A and the last 2 ˚A share several attributes. We therebysubdivide our binding-site sphere into 3 concentric layers, with layer width of itle Suppressed Due to Excessive Length 9 p of each residue category. We cate-gorize the residue features into “low”, “normal” and “high”. A residue categoryfeature is mapped to “normal” if its percentage is within 2 × √ p of the expectedvalue p . It is mapped to “low” if it falls below, and to “high” if it exceeds thecut-oﬀ. Table 4 accounts for the diﬀerent residue categories, their expected per-centages, and their cut-oﬀ values mapping boundaries. Given a binding site, ouralgorithm computes the percentage of amino acids of each group present in thesphere, and records its nominal value. We ignore the concentric layers, since asingle residue can span several layers. Table 4.

Residue grouping scheme, expected percentage, and mapping boundariesResidue Amino Acids Expected Lower UpperCategory Percentage Bound BoundAromatic

His, Phe, Trp, Tyr

Ala, Ile, Leu, Met, Val

Asn, Cys, Gln, Gly, Pro, Ser, Thr

Asp, Glu

Arg, Lys

The ﬁnal feature vector is a concatenation of the atomic and residue features.It contains the total number of atoms and the atomic property fractions for eachlayer, in addition to the residue features. It totals 27 continuous and 5 nominalfeatures.

This section details our implementation and parametrization of the baselinealgorithms. Refer to Mitchell [18] and Duda et al. [8] for a complete descriptionof the algorithms. k -Nearest Neighbor The scale of the data has a direct impact on k -NearestNeighbor’s ( k NN) classiﬁcation accuracy. A feature with a high data mean andsmall variance will a priori inﬂuence classiﬁcation more than one with a smallmean and high variance, regardless of their discrimination power [8]. In orderto put equal initial weight on the diﬀerent features, the data is standardized byscaling and centering.Our implementation handles nominal values by mapping them to ordinalnumbers. It uses the Euclidean distance as a distance function. It chooses the best k via a leave-one-out tuning method. Whenever two or more k ’s yield the sameperformance, it adopts the larger one. If two or more examples are equally distantfrom the query, and all may be the k th nearest neighbor, our implementationrandomly chooses. On the other hand, if a decision output tie arises, the queryis randomly classiﬁed.We also implement feature backward-selection (BS k NN) using the steepest-ascent hill-climbing method. For a given feature set, it removes one feature ata time and performs k NN. It adopts the trial leading to the smaller error. Itrepeats this cycle until removing any additional feature increases the error rate.This implementation is biased towards shorter feature sets, going by Occam’srazor principle.

Naive Bayes

Our Naive Bayes (NB) implementation uses a Laplacian smooth-ing function. It assumes that continuous features, for each output class, followthe Gaussian distribution. Let X be a continuous feature to classify and Y theclass. To compute P ( X | Y ), it ﬁrst calculates the normal z -score of X given Y us-ing the Y -training set’s mean µ Y and standard deviation s Y : z Y = ( x − µ Y ) /s Y .It then converts the z -score into a [0 ,

1] number by integrating the portions of thenormal curve that lie outside ± z . We use this number to approximate P ( X | Y ).This method returns 1 if X = µ , and decreases as X steps away from µ : P ( X | Y ) = Z ∞| z Y | normalCurve + Z −∞−| z Y | normalCurve . (6) Decision Trees

Our Decision Tree implementation uses information gain asa measure for the eﬀectiveness of a feature in classifying the training data. Weincorporate continuous features by dynamically deﬁning new discrete-valued at-tributes that partition the continuous attribute value into a discrete set of in-tervals. We prune the resulting tree using a tuning set. We report the results ofboth pruned (Pr DT) and unpruned decision trees (DT).

Perceptron

Our perceptron (Per) implementation uses linear units and per-forms a stochastic gradient descent. It is therefore similar to a logistic regression.It automatically adjusts the learning rate, treats the threshold as another weight,and uses a tuning set for early stopping to prevent overﬁtting. We limit our runsto a maximum of 1000 epochs. itle Suppressed Due to Excessive Length 11

Sequential Covering

Sequential Covering (SC) is a propositional rules learnerthat returns a set of disjunctive rules covering a subset of the positive examples.Our implementation uses a greedy approach. It starts from the empty set andgreedily adds the best attribute that improves rule performance. It discretizescontinuous attributes using the same method as Decision Trees. It sets the rulecoverage threshold to 4 positive examples and no negative examples. The bestattribute to add is the one maximizing: | entropy ( parent ) − entropy ( child ) | ∗ numberOf P ositives ( child ) . (7) We apply the same 10-folds cross-validation to Aleph and all the baseline classi-ﬁers. Table 5 tabulates the error percentage per testing fold, the mean, standarddeviation and the 95% level conﬁdence interval for each classiﬁer.

Table 5. k NN BS k NN NB DT Pr DT Per SC Aleph0 25 . . .

75 31 .

25 37 . .

75 31 .

25 25 .

01 25 . . . .

25 25 . .

75 31 .

25 37 .

52 18 .

75 18 .

75 25 . . . . . .

03 18 .

75 18 .

75 37 . .

25 12 . .

04 25 . . . . . . . .

255 31 .

25 31 .

25 37 . .

25 18 .

75 37 . .

25 18 .

756 31 .

25 18 .

75 25 . . .

25 37 . . .

07 31 .

25 25 . . . .

25 31 .

25 37 . .

758 18 .

75 18 .

75 31 .

25 25 . . .

25 31 .

25 25 .

09 31 .

25 31 .

25 50 . . .

25 43 .

75 25 . . .

63 25 . . . .

25 35 . .

25 32 . .

47 6 .

59 8 .

44 12 .

22 9 .

22 7 .

34 8 .

23 10 . .

71 20 .

29 28 .

97 18 .

77 19 .

66 29 .

76 20 .

37 24 . .

54 29 .

71 41 .

03 36 .

23 32 .

84 40 .

24 32 .

13 40 . Our SC implementation learns a propositional rule that covers at least 4positive and no negative examples. It then removes all positive examples coveredby the learned rule. It repeats the process using the remaining positive examples.Running SC over the whole data set generates the following rules, sorted bycoverage. Together they cover 63 positives out of 80. A site is hexose-binding ifany of these rules apply:1. If layer 1 negatively charged atoms density > . and layer 2 positively charged atoms density < . and layer 3 negatively charged atoms density > . If layer 1 non hydrogen-bonding atoms density < . and layer 1 hydrophobic atoms density > . and layer 3 hydrophilic atoms density > . If layer 1 negatively charged atoms density > . and layer 1 hydroneutral atoms density < . and layer 1 non hydrogen-bonding atoms density > . and layer 3 atoms number < . If layer 1 negatively charged atoms density > . and layer 2 atoms number > . and layer 1 negatively charged atoms density < . Despite its average performance, the main advantage of ILP is the insight itprovides to the underlying discrimination process.

Aleph’s error rate of 32 .

5% has a p -value < . . . , . , . .

09% error while Nassif et al. [19] glucose-binding site classiﬁerreports an error of 8 . Contrary to black-box classiﬁers, ILP provides a number of interesting insights.It infers most of the established biochemical information about residues and itle Suppressed Due to Excessive Length 13

Table 6.

Error rates achieved by general and speciﬁc sugar binding site classiﬁers. Notmeant as a direct comparison since the data sets are diﬀerent.Program Error (%) Method and Data setGeneral sugar binding sites classiﬁersILP hexose predictor 32 .

50 10-folds cross-validation, 80 hexose and80 non-hexose or non-binding sitesShionyu-Mitsuyama et al. [23] 31 .

00 Test set, 61 polysaccharide bindingsitesTaroni et al. [29] 35 .

00 Test set, 40 carbohydrate binding sitesMalik and Ahmad [17] 39 .

00 Leave-one-out, 40 carbohydrate and116 non-carbohydrate binding sitesSpeciﬁc sugar binding sites classiﬁersCOTRAN [27] 5 .

09 Overall performance over 6-folds, to-taling 106 galactose and 660 non-galactose binding sitesNassif et al. [19] 8 .

11 Leave-one-out, 29 glucose and 35 non-glucose or non-binding sites relations just from the PDB-atom names and properties. We hereby interpretAleph’s rules detailed in Section 5.Rules 1, 2, 5, 8 and 9, rely on the aromatic residues

Trp, Tyr and

Phe .This highlights the docking interaction between the hexose and the aromaticresidues [17, 27, 28]. The aromatic residues stack against the apolar sugar pyra-nose ring which stabilizes the bound hexose.

His is mentioned in many of therules, along-side other aromatics (5, 8) or on its own (4, 7). Histidine provides asimilar docking mechanism to

Trp, Tyr and

Phe [20].All nine rules require the presence of a planar polar residue (

Asn, Asp, Gln,Glu, Arg ). These residues have been identiﬁed as the most frequently involvedin the hydrogen-bonding of hexoses [20]. The hydrogen bond is probably themost relevant interaction in protein binding in general.Rules 1, 2, 3, 5 and 6 call for acidic residues with a negative partial charge(

Asp, Glu ), or for atoms with a negative partial charge. The relative high neg-ative density observed may be explained by the dense hydrogen-bond networkformed by the hexose hydroxyl groups.Some rules require hydrophobic atoms and residues, while others requirehydrophilic ones. Rule 5 requires both and reﬂects the dual nature of sugardocking, composed of a polar-hydrophilic aspect establishing hydrogen bondsand a hydrophobic aspect responsible for the pyranose ring stacking [29].A high residue-sugar propensity value reﬂects a high tendency of that residueto be involved in sugar binding. The residues having high propensity values arethe aromatic residues, including histidine, and the planar polar residues [29].This fact is reﬂected by the recurrence of high propensity residues in all rules.

Rules 8 and 9 require, and rule 1 is satisﬁed by, the presence of diﬀerentions (Calcium, Sulfate, Phosphate), conﬁrming the relevance of ions in hexosebinding [20].Rule 6 speciﬁes a triangular conformation of three atoms within the binding-site. This highlights the relevance of the binding-site’s spatial features. On theother hand, we note the absence of the site’s centroid literal from the resultingAleph rules. The center is merely a geometric parameter and does not have anyfunctional role. In fact, the binding site center feature was not used in most com-putational classifying approaches. Taroni et al. [29] and Malik and Ahmad [17]ignore it, Shionyu-Mitsuyama et al. [23] use the pyranose ring C Phe/Tyr and

Asn/Asp in theLectin protein family. This dependency is reﬂected in rules 2 and 8. Similarly, rule1 suggests a dependency between

Trp and

Glu , a link not previously identiﬁedin literature. This novel relationship merits further investigation and highlightsthe rule-discovery potential of ILP.

In addition to providing a basis for comparison with Aleph, the baseline algo-rithms shed additional light on our data set and hexose binding site properties.Naive Bayes and Perceptron return the highest mean error rates, 35 . k NN algorithm, with the lowest meanerror rate (25 . k NN’s good performance further highlights both thecorrelation between our features, and the data’s non-linearity.Like Aleph, Sequential Covering’s rules provide insight into the discriminat-ing process. Unlike Aleph’s ﬁrst-order logic rules, propositional rules are lessexpressive and reﬂect a smaller number of biochemical properties. We herebyinterpret SC’s rules detailed in Section 6.3.Although SC uses an explicit representation of residues, it completely ignoresthem in the retained rules. Only atomic biochemical features inﬂuence the pre-diction. This may be due to the fact that it is the binding-site’s atoms, rather itle Suppressed Due to Excessive Length 15 than overall residues, that bind to and stabilize the docked hexose. These atomsmay not be mapped to conserved speciﬁc residues.Another general ﬁnding is that most rule antecedents are layer 1 features.This reﬂects the importance of the atoms lining the binding-site, which establishdirect contact with the docking hexose. Layers 2 and 3 are farther away and hencehave weaker interaction forces.Only four amino acid atoms have a partial negative charge in our represen-tation, in addition to the infrequent Sulfate and Phosphate Oxygens [19]. Theﬁrst rule, covering most of the positive examples, clearly suggests a binding-sitewith a high density of negatively charged atoms. The ﬁrst and third antecedentsexplicitly specify layers with a negatively charged atomic density above somethresholds. The second one implicitly states so by opting for a non-positivelycharged layer. The relative high negative density observed may be explained bythe dense hydrogen-bond network formed by the hexose hydroxyl groups [20].The fourth rule is similar. It imposes a bond on the ﬁrst layer’s negative charge,between 0 . . In this work, we present the ﬁrst attempt to model and predict hexose bindingsites using ILP. We investigate the empirical support for biochemical ﬁndings bycomparing Aleph induced rules to actual biochemical results. Our ILP systemachieves a similar accuracy as other general protein-sugar binding sites black-boxclassiﬁers, while oﬀering insight into the discriminating process. With no priorbiochemical knowledge, Aleph was able to induce most of the known hexose-protein interaction biochemical rules, with a performance that is not signiﬁcantlydiﬀerent than several baseline algorithms. In addition, ILP ﬁnds a previouslyunreported dependency between

Trp and

Glu , a novel relationship that meritsfurther investigation.

Acknowledgments.

This work was partially supported by US National In-stitute of Health (NIH) grant R01CA127379-01. We would like to thank JoseSantos for his inquiries that led us to improve our Aleph representation.

References

1. Bagley, S.C., Altman, R.B.: Characterizing the microenvironment surrounding pro-tein sites. Protein Science 4(4), 622–635 (1995)2. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H.,Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Research28(1), 235–242 (2000)3. Betts, M.J., Russell, R.B.: Amino acid properties and consequences of substitu-tions. In: Barnes, M.R., Gray, I.C. (eds.) Bioinformatics for Geneticists, pp. 289–316. John Wiley & Sons, West Sussex, UK (2003)4. Bobadilla, L., Nino, F., Narasimhan, G.: Predicting and characterizing metal-binding sites using Support Vector Machines. In: Proceedings of the InternationalConference on Bioinformatics and Applications. pp. 307–318. Fort Lauderdale, FL(2004)5. Chakrabarti, R., Klibanov, A.M., Friesner, R.A.: Computational prediction of na-tive protein ligand-binding and enzyme active site sequences. Proceedings of theNational Academy of Sciences of the United States of America 102(29), 10153–10158 (2005)6. Davis, J., Burnside, E.S., de Castro Dutra, I., Page, D., Ramakrishnan, R., SantosCosta, V., Shavlik, J.: View Learning for Statistical Relational Learning: Withan application to mammography. In: Proceedings of the 19th International JointConference on Artiﬁcial Intelligence. pp. 677–683. Edinburgh, Scotland (2005)7. Davis, J., Burnside, E.S., de Castro Dutra, I., Page, D., Santos Costa, V.: Anintegrated approach to learning Bayesian Networks of rules. In: Proceedings of the16th European Conference on Machine Learning. pp. 84–95. Porto, Portugal (2005)8. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classiﬁcation. Wiley-Interscience,New York, second edn. (2001)9. Finn, P., Muggleton, S., Page, D., Srinivasan, A.: Pharmacophore discovery usingthe Inductive Logic Programming system PROGOL. Machine Learning 30(2-3),241–270 (1998)10. Fox, M.A., Whitesell, J.K.: Organic Chemistry. Jones & Bartlett Publishers,Boston, MA, 3rd edn. (2004)11. Garc´ıa-Hern´andez, E., Zubillaga, R.A., Chavelas-Adame, E.A., V´azquez-Contreras, E., Rojo-Dom´ınguez, A., Costas, M.: Structural energetics of protein-carbohydrate interactions: Insights derived from the study of lysozyme binding toits natural saccharide inhibitors. Protein Science 12(1), 135–142 (2003)12. Gilis, D., Massar, S., Cerf, N.J., Rooman, M.: Optimality of the genetic code withrespect to protein stability and amino-acid frequencies. Genome Biology 2(11),research0049 (2001)13. Gold, N.D., Jackson, R.M.: Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships. Journal of MolecularBiology 355(5), 1112–1124 (2006)14. Guex, N., Peitsch, M.C.: SWISS-MODEL and the Swiss-PdbViewer: An environ-ment for comparative protein modeling. Electrophoresis 18(15), 2714–2723 (1997)itle Suppressed Due to Excessive Length 1715. Kadirvelraj, R., Foley, B.L., Dyekjær, J.D., Woods, R.J.: Involvement of water incarbohydrate-protein binding: Concanavalin A revisited. Journal of the AmericanChemical Society 130(50), 16933–16942 (2008)16. Khuri, S., Bakker, F.T., Dunwell, J.M.: Phylogeny, function and evolution of the cu-pins, a structurally conserved, functionally diverse superfamily of proteins. Molec-ular Biology and Evolution 18(4), 593–605 (2001)17. Malik, A., Ahmad, S.: Sequence and structural features of carbohydrate binding inproteins and assessment of predictability using a Neural Network. BMC StructuralBiology 7, 1 (2007)18. Mitchell, T.M.: Machine Learning. McGraw-Hill International Editions, Singapore(1997)19. Nassif, H., Al-Ali, H., Khuri, S., Keirouz, W.: Prediction of protein-glucose bindingsites using Support Vector Machines. Proteins: Structure, Function, and Bioinfor-matics 77(1), 121–132 (2009)20. Quiocho, F.A., Vyas, N.K.: Atomic interactions between proteins/enzymes and car-bohydrates. In: Hecht, S.M. (ed.) Bioorganic Chemistry: Carbohydrates, chap. 11,pp. 441–457. Oxford University Press, New York (1999)21. Rao, V.S.R., Lam, K., Qasba, P.K.: Architecture of the sugar binding sites incarbohydrate binding proteins—a computer modeling study. International Journalof Biological Macromolecules 23(4), 295–307 (1998)22. Santos Costa, V.: The life of a logic programming system. In: de la Banda, M.G.,Pontelli, E. (eds.) Proceedings of the 24th International Conference on Logic Pro-gramming. pp. 1–6. Udine, Italy (2008)23. Shionyu-Mitsuyama, C., Shirai, T., Ishida, H., Yamane, T.: An empirical approachfor structure-based prediction of carbohydrate-binding sites on proteins. ProteinEngineering 16(7), 467–478 (2003)24. Solomon, E., Berg, L., Martin, D.W.: Biology. Brooks Cole, Belmont, CA, 8th edn.(2007)25. Srinivasan, A.: The Aleph Manual, 4th edn. (2007),