[PDF] A clustering-based biased Monte Carlo approach to protein titration curve prediction

Abstract

In this work, we developed an efficient approach to compute ensemble averages in systems with pairwise-additive energetic interactions between the entities. Methods involving full enumeration of the configuration space result in exponential complexity. Sampling methods such as Markov Chain Monte Carlo (MCMC) algorithms have been proposed to tackle the exponential complexity of these problems; however, in certain scenarios where significant energetic coupling exists between the entities, the efficiency of the such algorithms can be diminished. We used a strategy to improve the efficiency of MCMC by taking advantage of the cluster structure in the interaction energy matrix to bias the sampling. We pursued two different schemes for the biased MCMC runs and show that they are valid MCMC schemes. We used both synthesized and real-world systems to show the improved performance of our biased MCMC methods when compared to the regular MCMC method. In particular, we applied these algorithms to the problem of estimating protonation ensemble averages and titration curves of residues in a protein.

Full PDF

AA clustering-based biased Monte Carlo approach toprotein titration curve prediction

Arun V. Sathanur

Paciﬁc Northwest National LaboratorySeattle, WA, [email protected]

Nathan A. Baker

Paciﬁc Northwest National LaboratoryRichland, WA, [email protected]

Abstract —In this work, we developed an efﬁcient approachto compute ensemble averages in systems with pairwise-additiveenergetic interactions between the entities. Methods involvingfull enumeration of the conﬁguration space result in exponentialcomplexity. Sampling methods such as Markov Chain MonteCarlo (MCMC) algorithms have been proposed to tackle theexponential complexity of these problems; however, in certainscenarios where signiﬁcant energetic coupling exists between theentities, the efﬁciency of the such algorithms can be diminished.We used a strategy to improve the efﬁciency of MCMC bytaking advantage of the cluster structure in the interaction energymatrix to bias the sampling. We pursued two different schemesfor the biased MCMC runs and show that they are valid MCMCschemes. We used both synthesized and real-world systems toshow the improved performance of our biased MCMC methodswhen compared to the regular MCMC method. In particular, weapplied these algorithms to the problem of estimating protonationensemble averages and titration curves of residues in a protein.

Index Terms —Markov Chain Monte Carlo (MCMC), En-semble Averages, Energy Minimization, Discrete Optimization,Protein Titration

I. I

NTRODUCTION

Machine learning (ML) is playing an increasingly larger rolein helping shape our understanding of complex biological phe-nomena through a number of related disciplines such as bio-physics, biochemistry, molecular dynamics and bioinformaticsto name a few [1]–[3]. One recurring problem in many of thesedisciplines is the assignment of states to pairwise-interactingentities so as to minimize the total energy of the system and tocompute the ensemble averages of properties that govern thebehavior of the system. The brute-force complexity of suchproblems is k N where k is the number of possible states perentity and N is the total number of entities. The interactionsof these N entities can be described by a connected graphand the discrete optimization moves cannot be performed inisolation. One such problem involves titration state assignmentto the amino acid residues in a protein.The protein titration problem involves prediction of theprotonation ensemble averages for each residue as a functionof pH; residue p K a values can be derived from the resultingfunction [4]. When residues are found in a protein, their p K a values can vary signiﬁcantly from isolated “ideal” residuesdue to the inﬂuence of the interacting protein residues. Theenergy function G ( P ) in equation 1 is represented as the sumof the intrinsic energies of the titratable residues and the pair- wise interaction terms between residues residing on a rigidbackbone. The interaction energy terms are dependent on thetitration states where we assume that each titratable residuehas one of two states: protonated or de-protonated. G ( P ) = i = N (cid:88) i =1 γ i ln (10) kT (cid:0) pH − p K inta i (cid:1) + i = N (cid:88) i =1 i>j j = N (cid:88) j =1 γ i γ j U int ( P i , P j ) (1)Here N is the total number of titratable residues. P i de-notes whether the residue ‘i’ is protonated or deprotonated, U int ( P i , P j ) denotes the interaction energy between residues i and j based on the states P i and P j . P denotes the N dimensional vector that describes the system conﬁguration; the i th element of this vector denotes the state of the i th residue.The value p K inta i is the intrinsic p K a of the residue i and γ i indicates whether the residue is charged ( γ i = 1 ) or uncharged( γ i = 0 ).The property is a function f of the conﬁguration vector P ,the quantity of interest is the ensemble average ¯ f = (cid:80) N m =1 f ( P m ) exp ( − βG ( P m )) (cid:80) N m =1 exp ( − βG ( P m )) (2)Here β is a constant given by the reciprocal of the productof the Boltzmann constant and the temperature in absolutescale. The protein titration curve prediction is the value of ¯ γ i as a function of pH. This work is concerned with an efﬁcientand accurate estimation of the titration curves for all titratableresidues in a protein where the interaction energy matrix U int has a good cluster structure.This work includes the following contributions: • We propose a biased MCMC scheme to efﬁciently esti-mate ensemble averages by leveraging the cluster struc-ture of the interaction energy matrix. • Through theoretical analyses, we justify our samplingschemes and show why the proposed schemes representvalid MCMC runs. • Using experiments on synthetic and real-world systems,we show improvement over MCMC techniques that donot exploit the cluster structure. a r X i v : . [ q - b i o . B M ] O c t ection II describes the related work in estimating ensembleaverages for the protein titration problem, Section III describesthe standard MCMC procedure and introduces our biasedMCMC approach. Section IV provides details on the algo-rithms and demonstrates that these are valid MCMC methods.Section V describes experiments on synthetic protein-residuenetworks and also explores the effect of clustering quality onthe performance of the algorithm. Section VI then details theexperiments on real-world datasets. Section VII concludes thepaper with pointers on the future work.II. R ELATED WORK

Tanford and Roxby [5] developed a mean-ﬁeld approach tocompute the titration curves, approximating interactions by aneffective average quantity to reduce computational cost. Themean-ﬁeld approach works well when the interaction energyterms are small compared to the intrinsic terms. It thereforeperforms poorly in the presence of strong energetic interac-tions between titration sites. Bashford and Karplus developedan improved reduced-site approximation [6] by eliminatingthose residues that are almost-always protonated or deproto-nated for a given pH and thereby improving the computationalperformance. Related approaches use cluster structures withina protein-residue system and other approximations to the directcalculation of the partition function [7], [8]. Alternatively,methods such as PROPKA simplify several steps such as thecomputation of the difference in solvation free-energy by usingsimpliﬁed energy calculations and heuristic methods for p K a estimates [9]–[11].MCMC approaches to titration state prediction naturallyincorporate interactions between the residues through thetotal energy calculation. Slow MCMC convergence is drivenby several factors, particularly the presence of energeticallycoupled clusters of titratable residues whose states changetogether [12]. Some authors [12], [13] have addressed thisconvergence problem by allowing pairs and triplets of sites tosimultaneously change their states. Our approach builds on thisidea and extends the approach to arbitrary clusters as expressedby the data.Finally, the authors in a recent work [4] use a computer-vision inspired graph-cut optimization algorithm which ﬁndslowest-energy titration states in a computationally efﬁcientmanner. The authors use the optimum state to compute amean-ﬁeld titration curve, similar to methods used by Tanfordand Roxby [5], and subject to the limitations described above.This approach requires a submodularity property of the energyinteractions which may not be satisﬁed by all the residues.Residues with energy interactions that are not submodularmust have states assigned using brute-force or MCMC ap-proaches.III. E NSEMBLE AVERAGE CALCULATIONS AND THE

MCMC

ALGORITHM

The protein titration problem described above closely re-sembles the popular Ising model that has been extensivelystudied [14]. Ensemble average of system properties can be calculated by a traditional MCMC algorithm where thesystem starts with a random conﬁguration and, at each move,one residue is selected at random. The energy difference iscalculated by ﬂipping the protonation state of the selectedresidue. The move is accepted with a probability equal to min (cid:0) , e − β ∆ G (cid:1) where ∆ G is the energy difference for thestate change. Thus, conﬁgurations leading to lower energy arealways accepted while those resulting in increase in energyare accepted with a probability less than 1. The Metropolisalgorithm is ﬁrst run for T steps called burn-in allowing thesystem to equilibrate and, following that, sampling is done for T steps to compute ensemble averages.Previous authors [12] have noted that the traditional MCMCscheme of changing states one residue at a time, may–in cer-tain cases–never sample states that require multiple transitions.Therefore, they introduced moves that allowed changing twostates at a time. We generalized this idea to exploit the clusterstructure present in protein residue interaction networks [15].While our focus was protein residue networks, the approach isgeneral enough to be applied in any Ising-like problem wherethe interaction energy matrix has cluster structure. The basicidea is to decompose the set of titratable residues into a setof clusters based on the interaction energies. Residues withineach of the clusters have a stronger coupling between eachother than interactions between residues belonging to differentclusters. Allowing correlated state changes in residues thatform strongly coupled clusters can avoid visiting energeticallyinfeasible states and hence can accelerate the convergence ofthe MCMC runs.Using the matrix of the pairwise interaction energies U int ,we ﬁrst decomposed the set of N titratable residues into l (energetically strong) clusters C , C , ..., C l with n , n , ..., n l members each. We require that n i ≥ , ≤ i ≤ l , i.e., thereare at-least two residues per cluster. The remainder of theresidues that are not strongly coupled with each other forma separate weak cluster C l +1 with n l +1 = N − (cid:80) i = li =1 n i .We used brute-force enumeration on each of these l strongclusters, calculating the k r lowest-energy conﬁgurations foreach cluster C r and storing them. We described in a SectionIV how the C l +1 is treated differently from the energeticallystrong clusters C i , ≤ i ≤ l .IV. D ETAILS OF THE BIASED

MCMC

METHOD

Our approach initially breaks the joint probability massfunction (PMF) of the conﬁguration space into l + 1 blocksof correlated distributions with no correlations between theblocks. This is similar to the clique decomposition of jointdistributions in undirected graphical models [16] with thecliques being replaced by non-overlapping clusters. This de-composition allows us to cast conﬁguration sampling in termsindependent sub-sequences corresponding to the residues thatconstitute each of the clusters. Since the individual clustersizes are not too big (owing to the fact that strong interactionsare conﬁned to a small number of nearest neighbors in 3Dspace [17]), the total number of conﬁgurations under consid-eration in each energetically strong cluster C r can be restrictedo the most probable k r each (or lowest k r in terms of energy).A brute-force strategy based on energy cut-off helps enumeratethe sub-sequences of the conﬁguration vectors correspondingto each of the clusters. We sampled the p n l +1 conﬁgurationsfor the weakly-coupled cluster C l +1 , uniformly at random,because the number n l +1 could potentially be too large forbrute-force enumeration. The ratio of the conﬁguration spacesizes explored by the biased MCMC approach ( | Ω b | ) againstthe regular MCMC approach ( | Ω b | ) is given by | Ω b || Ω r | = (cid:16)(cid:81) lr =1 k r (cid:17) p N − (cid:80) i = lr =1 n r p N = l (cid:89) r =1 (cid:18) k r p n r (cid:19) . (3)A valid state in the reduced conﬁguration space consistsof adjacent subsequences corresponding to the clusters C through C l +1 . For cluster C r , where r goes from through l , the subsequences will correspond to one of conﬁgurationsfrom the top- k r considered for that cluster. For the last cluster C l +1 , the corresponding subsequence will be one of the pos-sible p n l +1 conﬁgurations sampled randomly as in traditionalMCMC approaches. Although we sample the subsequencesfrom the individual clusters in an independent fashion, weaccount for the correlations across clusters by computingenergies from Equation 1 that include all the cross-clusterinteractions needed for accurate energy computation. A. Satisfying ergodicity and detailed balance conditions

For a MCMC algorithm to sample the conﬁgurations fromthe underlying joint distribution, two conditions need to besatisﬁed: • Ergodicity: It should be possible to go from any conﬁg-uration to any other conﬁguration in a ﬁnite number ofsteps, and • Detailed Balance: The equilibrium rate of transitionsinto any conﬁguration P µ must be equal to the rate oftransitions out of P µ .Let π ( P µ ) denote the steady-state (w.r.t. to the Markovchain) probability of the conﬁguration P µ . π ( P µ ) should bethe Boltzmann probability under equilibrium conditions. Let T ( P µ , P ν ) denote the proposed transition probability from P µ to P ν for the simulated Markov chain as part of theMCMC procedure. Also let A ( P µ , P ν ) denote the acceptanceprobability of this proposed transition. Then the followingequation needs to be satisﬁed for all pairs of states P µ and P ν to ensure the detailed balance: π ( P µ ) T ( P µ , P ν ) A ( P µ , P ν )= π ( P ν ) T ( P ν , P µ ) A ( P ν , P µ ) (4)If both of the conditions in Equation 5 are satisﬁed, then thesteady-state simulated Markov Chain samples the Boltzmannprobability. T ( P µ , P ν ) = T ( P ν , P µ ) (5a) A ( P µ , P ν ) A ( P ν , P µ ) = e − β ( G ( P ν ) − G ( P µ )) (5b) There are multiple ways to design the transitions in our ap-proach; we considered two schemes. Approach 1 ﬁrst choosesa cluster uniformly at random and then chooses a sub-sequenceconﬁguration within that again uniformly at random. Approach2 chooses sub-sequence conﬁgurations uniformly at random inall the clusters. B. Approach 1

In this approach, the transition consists of two steps: • First choose a cluster C r at random. • If r ∈ [1 , l ] , then perturb the sub-sequence correspondingto C r by choosing a conﬁguration uniformly at randomfrom the ( k r − conﬁgurations excluding the currentsub-sequence. If r = ( l + 1) , then choose a residue atrandom in the cluster C l +1 and ﬂip its state.Consider a valid initial state P µ for a biased MCMC run. Byusing one perturbation as described above, we end up in anoverall system conﬁguration denoted by P ν . The transitionprobability for this can be easily seen as T ( P µ , P ν ) = T ( P ν , P µ ) = (cid:40) l +1)( k r − , if r ∈ [1 .l ] . l +1) n l +1 , r = l + 1 . (6)This demonstrates the symmetric nature of the transition pro-posal probability for the chain. The Markov chain is ergodicbecause there is a ﬁnite probability of choosing any sub-sequence conﬁgurations in the clusters through l and aﬁnite probability of reaching any sub-sequence conﬁgurationin cluster l + 1 in a ﬁnite number of steps. Note that weexclude the current sub-sequence so that the probability of thetransition ( P µ → P µ ) is . Thus Approach 1 represents a validMCMC procedure. C. Approach 2

In the second approach, an overall transition is obtained byapplying transitions to all the l +1 subsequences correspondingto the clusters identiﬁed: • For each cluster C r where r ∈ [1 , l ] , perturb the sub-sequence corresponding to C r by choosing a conﬁgura-tion uniformly at random from the k r conﬁgurations • For the cluster C l +1 , choose a residue at random in thecluster C l +1 and ﬂip its state.Using similar arguments as in the case of Approach 1, wecan show that Approach 2 also represents a valid MCMCprocedure. D. A note on the clustering methods

A weighted interaction graph is deﬁned by the absolutevalue of the interaction energy matrix; this graph is parti-tioned into mutually exclusive clusters. There are two majorchallenges for this step. First, the cut edges edges need tobe minimized: the entries that straddle neighboring clusterssince minimizing the number of cut edges allows for abetter sampling accuracy. Second, the size of each partitionshould be balanced as much as possible in order to achievecomputational efﬁciency. To address both these issues, we usedhe hierarchical graph partitioning algorithm

METIS [18]. Theuser must provide the number of partitions ( l ) as a hyper-parameter; the weak cluster C l +1 is not used in this clusteringstrategy. Other clustering strategies, such as the connectedcomponents method [15], can use the weak cluster C l +1 . E. The full biased MCMC algorithm

This section outlines the full algorithm to execute the biasedMCMC approaches. The version of the algorithm we presentuses Approach 1 for the perturbation of the states within theMCMC procedure and the

METIS algorithm to determine theset of clusters. We used an adaptive strategy to determinethe values of k r per cluster based on a cut-off value suchas retaining the energy states up to k B T from the energy ofthe lowest state in that cluster. Algorithm 1

Biased MCMC Input pHV als List of pH values pK inta , U int Self and interaction energies k No. clusters desired (Optional) k r No. lowest energy states in C r (Optional) N , N Burn-in / Sampling steps Output ¯ f i for i = to N Average protonation fractions Construct a weighted, undirected graph G , with residues as nodes Set weight W ij between nodes i and j as (cid:12)(cid:12) U int ( i, j ) (cid:12)(cid:12) Use a clustering algorithm to decompose G into a set of l clusters for pH in P vals do S r ⇐ lowest k r energy conﬁgurations in C r via enumeration P ⇐ [ P P ....P r , , , P l ] where P r is chosen uniformlyrandom from S r for i = to N do ¯ f i ( pH ) = 0 . for step = 1 to N + N do Choose cluster C r uniformly from the set of l clusters Sample P (cid:48) r uniformly at random from S r − P r P (cid:48) ⇐ (cid:104) P P ....P (cid:48) r , , , P l (cid:105) (only the state for C r isaltered) a = min (1 . , exp (cid:18) − (cid:18) (cid:16) G ( P (cid:48) ) − G ( P ) (cid:17) k B T (cid:19)(cid:19) ) P = P (cid:48) with probability a if step ≥ N then for i = to N do if P ( i ) = 1 then ¯ f i ( pH ) = ¯ f i ( pH ) + 1 . for i = to N do ¯ f i ( pH ) = ¯ f i ( pH ) N V. E

XPERIMENTS ON A SYNTHETIC SYSTEMS

In order to assess the biased MCMC algorithm, we ﬁrstconducted experiments on a synthesized but realistic systemstitratable residues. Synthetic systems allow us to vary systemproperties in a controlled manner. The system is described bya vector of randomized self energies and a sparse matrix ofinteraction energies such that the interaction energy matrix canhave a known density (sparsity), predeﬁned strong and weakcluster structure, and background noise. The ﬁrst system with 16 residues had two well deﬁned clusters of 8 residues each.The second system also of 16 residues had two well-deﬁnedclusters of 6 residues each and 4 residues which did not formany meaningful cluster. The visualizations of the interactionenergy matrices are shown in the insets of Figures 1 and 2,respectively.We then compared the average error in the predicted pro-tonation fractions (at a single pH value) over all the titratableresidues between regular MCMC and the proposed biasedMCMC algorithm using both the perturbation approaches (A1and A2 in the ﬁgure). We set ﬁxed values for k r : 16 and12 for the two systems, respectively. The small sizes of thesesystems allowed for the exact computation of the partitionfunction and protonation fraction via full enumeration of allpossible states. The full enumeration method is used as theground truth to compare the accuracy of the newly developedbiased MCMC method with that of the regular MCMC. Wecomputed the average protonation error for a varied numberof MCMC steps for both the methods. In all cases, the burn-inphase of MCMC was ﬁxed at 20% of the total number of steps.Finally, averaging was performed over multiple MCMC runs(30 in our experiments) with different initialization settings. Fig. 1. Error behavior comparison between regular MCMC and the proposedbiased MCMC approach for System 1. The interaction energy matrix is shownin the inset.Fig. 2. Error behavior comparison between regular MCMC and the proposedbiased MCMC approach for System 2. The interaction energy matrix is shownin the inset.

For both the systems and both the approaches A1 andA2, we observed that the biased MCMC method providessigniﬁcant reduction in the error over the regular MCMCmethod as seen in Figures 1 and 2 with consistently bettererformance from approach A1. When compared across thetwo systems, we observe that the performance of the biasedMCMC method slightly worsened for System 2 (with 2 strongclusters and a weak cluster) as compared to System 1 (whichhas only 2 strong clusters). This is potentially attributable tothe presence of the weak cluster which does not allow forthe selection of the most probable k r states. Finally, the sizesof the conﬁguration spaces explored in biased MCMC forthe two systems are approximately 0.4% and 3.5% of thetotal conﬁguration space sizes. Thus, clustering effectivelynarrowed down the regions of the conﬁguration space thatcontribute most to the partition function.Since ﬁnding an optimal clustering is an NP-hard problem[19], and most clustering algorithms are polynomial-timeheuristics, it is often difﬁcult to obtain quality guaranteeson the cluster assignments. Thus, to assess the impact ofclustering, we perturbed the well-deﬁned cluster structure ofSystem 1 by interchanging the cluster memberships of , ,and residues at random between the two strong clustersengineered. Table I summarizes how the accuracy of the biasedMCMC method decreased as the strong clustering structurewas progressively diminished. We chose steps for boththe MCMC methods in this experiment. MCMC Type n P = 0 n P = 1 n P = 2 n P = 4 Biased 1.3% 2.9% 7.0% 12.5%Regular 5.4% 5.4% 5.4% 5.4%TABLE IT

HE EFFECT OF SUB - OPTIMAL CLUSTER STRUCTURE ON THE AVERAGEPROTONATION FRACTION ERROR WITH THE BIASED

MCMC

APPROACH . VI. E

XPERIMENTS ON REAL - WORLD PROTEIN SYSTEMS

We assessed the efﬁcacy of the proposed biased MCMCmethodology via experiments on two real-world protein sys-tems. The two proteins that we investigated were ( and ) with 14 and 12 titratable residues, respectively. Asbefore, these were small enough to be validated against theground-truth answers from a complete enumeration routine.Finally, for the real-world systems, instead of the protonationfractions, we computed the entire titration curves over a rangeof pH values (from . to . in steps of . ) [4]. Weﬁrst deﬁned an appropriate error metric for the comparison.Let p denote the pH value and let f r ( p ) , f b ( p ) , and f t ( p ) denote the curves that represent the protonation fraction ofa given residue as a function of the pH , p , computed viaregular MCMC, biased MCMC and full enumeration methodsrespectively. The metric ( (cid:15) xt ( i ) ) is the area corresponding tothe absolute difference between the titration curve for a givenresidue i computed from a given MCMC scheme( x ) and thetitration curve for residue i from full enumeration which servesas the ground truth( t ): (cid:15) xt ( i ) = 1 n R k = n R (cid:88) k =1 (cid:90) p H p L (cid:12)(cid:12) f ix,k ( p ) − f t ( p ) (cid:12)(cid:12) dp (7)The bounds for the integral are deﬁned by pH values p L and p H that correspond to protonation fractions of 0.05 and 0.95 for the exact curve. The error metric was computed byusing Simpson’s rule for numerical integration and averagedby the number of independent runs n R (indexed by k ). Wealso deﬁned an average metric ¯ (cid:15) xt over the number of titratableresidues N to facilitate comparison between methods using asingle metric: N (cid:80) i = Ni =1 ( (cid:15) xt ( i )) .Figures 3 and 4 show the comparison of the error metrics (cid:15) rt and (cid:15) bt for all the residues of the proteins and , respectively, for each of the titratable residues. In eachof n R = 100 runs, both the regular MCMC and the biasedMCMC approaches used , steps, of which, the ﬁrst , steps corresponded to the burn-in phase. The biasedMCMC method uses Approach 1 for the transitions and used2 clusters determined by METIS and with the k r valuesdetermined by adaptive strategy described earlier. Fig. 3. Error performance of the biased MCMC vs that of the regular MCMCfor various titratable residues that make up the protein 2GUS. The MCMCruns used 10,000 steps and the error metrics were averaged over 100 runs.Fig. 4. Error performance of the biased MCMC vs that of the regular MCMCfor various titratable residues that make up the protein 4IL7. The MCMC runsused 10,000 steps and the error metrics were averaged over 100 runs.

Figures 3 and 4 also show that the biased MCMC approachwas more accurate for nearly all the residues. Furthermore, asmall number of residues showed signiﬁcant improvement inaccuracy over the regular MCMC method. Table II shows thepercentage improvement in the error metric ¯ (cid:15) bt , for the biasedMCMC over the error metric for regular MCMC, namely, ¯ (cid:15) rt . R is the interval for protonation fractions from which the limitsof the integral (in terms of pH values) were computed. Therelative performance gain of the biased MCMC method wasslightly better for the larger interval.While these results showed a noticeable improvement inthe error performance of the biased MCMC method over theegular MCMC method, the improvement was less signiﬁcantwhen compared to the synthetic systems with stronger clusterstructure. This investigation is part of our ongoing work. Protein R =[0.05,0.95], R =[0.01,0.99]2GUS -18.7% -19.5%4IL7 -13.4% -15.9%TABLE IIR ELATIVE IMPROVEMENT IN THE ERROR METRICS FOR BIASED

MCMC

OVER REGULAR

MCMC

FOR TWO REAL - WORLD PROTEINS . VII. C

ONCLUSIONS

In this work, we presented a novel biased MCMC algorithmfor computing ensemble averages in systems characterizedby pair-wise interactions and speciﬁcally protein residue net-works. The algorithm exploited the cluster structure of theinteraction energy matrix and allowed the residue protonationstates to change together. We showed that our new schemesrepresent valid MCMC runs that satisfy ergodicity and detailedbalance requirements. We implemented our biased MCMCalgorithms and applied them to a number small protein residuenetwork systems, both synthetic and real-world and showedthat our new schemes provide improved accuracy in theprotonation fraction and hence titration curve estimates. Wefurther showed the sensitivity of the methods to the qualityof the clustering assignments. Our future work will focus onbetter clustering strategies to further improve the performanceon real-world systems, optimizing the hyper-parameters andapplication of the method to larger real-world protein residuenetworks. VIII. A

CKNOWLEDGMENTS

This work was funded by NIH grant R01 GM069702.R

EFERENCES[1] A. Ramanathan, C. S. Chennubhotla, P. K. Agarwal, and C. B. Stanley,“Large-scale machine learning approaches for molecular biophysics,”

Biophysical Journal , vol. 108, no. 2, p. 370a, 2015.[2] V. Botu and R. Ramprasad, “Adaptive machine learning frameworkto accelerate ab initio molecular dynamics,”

International Journal ofQuantum Chemistry , vol. 115, no. 16, pp. 1074–1083, 2015. [3] P. Baldi and S. Brunak,

Bioinformatics: the machine learning approach .MIT press, 2001.[4] E. Purvine, K. Monson, E. Jurrus, K. Starr, and N. A. Baker, “Energyminimization of discrete protein titration state models using graphtheory,”

Journal of Physical Chemistry , vol. 120, pp. 8354–8360, 2016.[5] C. Tanford and R. Roxby, “Interpretation of protein titration curves.application to lysozyme,”

Biochemistry , vol. 11, no. 11, pp. 2192–2198,1972.[6] D. Bashford and M. Karplus, “Multiple-site titration curves of proteins:an analysis of exact and approximate methods for their calculation,”

TheJournal of Physical Chemistry , vol. 95, no. 23, pp. 9556–9561, 1991.[7] M. K. Gilson, “Multiple-site titration and molecular modeling: Tworapid methods for computing energies and forces for ionizable groupsin proteins,”

Proteins: Structure, Function, and Bioinformatics , vol. 15,no. 3, pp. 266–282, 1993.[8] J. Myers, G. Grothaus, S. Narayanan, and A. Onufriev, “A simple clus-tering algorithm can be accurate enough for use in calculations of pksin macromolecules,”

Proteins: Structure, Function, and Bioinformatics ,vol. 63, no. 4, pp. 928–938, 2006.[9] C. R. Sondergaard, M. H. Olsson, M. Rostkowski, and J. H. Jensen,“Improved treatment of ligands and coupling effects in empirical calcu-lation and rationalization of p K a values,” Journal of Chemical Theoryand Computation , vol. 7, no. 7, pp. 2284–2295, 2011.[10] M. H. M. Olsson, C. R. Sondergaard, M. Rostkowski, and J. H. Jensen,“PROPKA3: Consistent treatment of internal and surface residues in em-pirical p K a predictions,” Journal of Chemical Theory and Computation ,vol. 7, no. 2, pp. 525–537, 2011.[11] M. H. Olsson, “Protein electrostatics and pka blind predictions; contribu-tion from empirical predictions of internal ionizable residues,”

Proteins:Structure, Function, and Bioinformatics , vol. 79, no. 12, pp. 3333–3345,2011.[12] P. Beroza, D. Fredkin, M. Okamura, and G. Feher, “Protonation ofinteracting residues in a protein by a monte carlo method: applicationto lysozyme and the photosynthetic reaction center of rhodobactersphaeroides.”

Proceedings of the National Academy of Sciences , vol. 88,no. 13, pp. 5804–5808, 1991.[13] B. Rabenstein, G. M. Ullmann, and E.-W. Knapp, “Calculation ofprotonation patterns in proteins with structural relaxation and molecularensembles–application to the photosynthetic reaction center,”

Europeanbiophysics journal , vol. 27, no. 6, pp. 626–637, 1998.[14] D. Chandler,

Introduction to modern statistical mechanics , 1987.[15] M. Vijayabaskar and S. Vishveshwara, “Interaction energy based proteinstructure networks,”

Biophysical journal , vol. 99, no. 11, pp. 3704–3715,2010.[16] D. Koller and N. Friedman,

Probabilistic graphical models: principlesand techniques . MIT press, 2009.[17] E. Estrada, “Universality in protein residue networks,”

Biophysicaljournal , vol. 98, no. 5, pp. 890–900, 2010.[18] G. Karypis and V. Kumar, “A fast and high quality multilevel schemefor partitioning irregular graphs,”