Helix modelling through the Mardia-Holmes model framework and an extension of the Mardia-Holmes model
aa r X i v : . [ s t a t . O T ] O c t Helix modelling through the Mardia-Holmes modelframework and an extension of the Mardia-Holmesmodel
Mai F Alfahad , John T Kent , and Kanti V Mardia Department of Statistics, University of Leeds, Leeds LS2 9JT, UK Department of Statistics, University of Oxford OX1 3LB, [email protected], [email protected], [email protected]
Abstract
For noisy two-dimensional data, which are approximately uniformly distributednear the circumference of an ellipse, Mardia & Holmes (1980) developed a modelto fit the ellipse. In this paper we adapt their methodology to the analysis of helixdata in three dimensions. If the helix axis is known, then the Mardia-Holmes modelfor the circular case can be fitted after projecting the helix data onto the planenormal to the helix axis. If the axis is unknown, an iterative algorithm has beendeveloped to estimate the axis. The methodology is illustrated using simulatedprotein α -helices. We also give a multivariate version of the Mardia-Holmes modelwhich will be applicable for fitting an ellipsoid and in particular a cylinder. Keywords and phrases.
Fitted ellipse, Fitted circle, Principal component analysis,Helix axis, Maximum likelihood, Least squares A mathematical helix is a curve in three dimensional space, of the form f ( t ) = r cos tr sin tct (1)(e.g., O’Neill, 1997, p. 16), augmented by an arbitrary rotation and shift in R , as the“time” t ranges through the real line. This helix is called “right-handed” since whenlooked at from above, x and x move in in a counter-clockwise direction around a circleas t increases, i.e. as the axis position x gets closer to to observer.A statistical helix is obtained from (1) by adding noise at equally spaced time points t i = iβ to give data x i = f ( t i ) + ǫ i , i = 1 , . . . , n, (2)1here ǫ i are small noise terms, typically modelled by independent isotropic normal dis-tributions, ǫ i ∼ N ( , σ I ) . The structural parameters of the helix data are • the radius r > • the pitch 2 πc (the amount of vertical movement after one rotation around the helix);and • the turn angle β .An important application of helix models is to secondary protein structure, where acommon structure is the right-handed α -helix, (see e.g. Campbell & Farrell, 2009). Aprotein α -helix can treated as a data set of “landmarks” lying near a helix by focusing onspecific atoms such as C α atoms.An important task when presented with helix data is to estimate the axis. Variousstatistical methods have been proposed in the literature. Mardia et al. (2018) used maxi-mum likelihood estimates under various assumptions about the parameters. In particularif β is known, it is also possible to use a modified least squares algorithm to compute theMLE; this particular algorithm was called OptLS by Alfahad et al. (2018) and this namewill be used in this paper. There are many compositional methods to calculate the axis;see for examples, ˚Aqvist (1986) and Rotfit by Christopher et al. (1996).In this paper, we develop a new method, by adapting the Mardia & Holmes (1980)(M-H) model for data in the plane. The paper is laid out as follows. In Section 2, theMH model for data in the plane is reviewed. Then in Section 3 the MH model is adaptedto estimate the helix axis for three-dimensional data using a projection into the plane.Section 4 illustrates the use of the model on some simulated data. Mardia & Holmes (1980) (M-H) model was originally designed to analyze megalithic data,in particular stones clustered uniformly around an ellipse, or as a special case, a circle.The M-H model has several parameters: • a concentration parameter κ > • a location parameter in the plane a = ( a , a ) T representing the centre of the ellipse;and • a 2 × y − a ) T Σ − ( y − a ) = 1 . (3)2he M-H model treats n data points in the plane x , . . . , x n as independent observa-tions from the density f ( y ) = C ( κ ) | Σ | − / exp {− κ [( y − a ) T Σ − ( y − a ) − } , (4)where C ( κ ) = ( κ/ π ) / / { π Φ( κ / ) } is the normalization constant. This model has itsmode on the circumference of the ellipse, that is, the values of y satisfying (3).If Σ = ρ I , ρ >
0, and I is the 2 × ρ .The circular version of the M-H model is the appropriate version for helix data. Ifdata { x i } follow the helix model (1)-(2), then the projected data y i = (cid:20) y i y i (cid:21) = (cid:20) x i x i (cid:21) will approximately follow the M-H model.The adjective “approximately” is needed for two reasons. (i) The angular parts ofthe helix data (i.e. θ i = atan2( x i , x i ), i = 1 , . . . , n ) are not i.i.d.; the distribution of θ i depends on t i . However, provided β/ π is not a simple fraction, we expect the θ i to bewell spread around the circle.(ii) The radial part of the helix distribution (i.e. the distribution of ( x i + x i ) / ) from(2) is not quite the same as the radial part of the M-H distribution (i.e. the distributionof ( y i + y i ) / from (4). However, the two radial distributions will be very similar underhigh concentration, i.e. if κ is large, by matching the parameters κ = 1 /σ .Estimation in the M-H model can be done by maximum likelihood. However, since theMLEs do not exist in closed form, an iterative algorithm is needed. The simplest procedureis to choose plausible initial estimates and then to use a black box optimization algorithm(e.g. the function nlm in R) to carry out the maximization.Here is a set of choices of initial estimates for the circular case, given data { y i } : • Estimate a by ˆ a init , the vector mean of the data. • Estimate ρ by ˆ ρ init , the average distance between the data and ˆ a init , i.e. n − P | y i − ˆ a init | . • Estimate κ by the reciprocal of the sample variance of the radial part of the centereddata, 1 / ˆ κ init = var {| y i − ˆ a init |} . In the nlm procedure it is convenient to use unconstrained parameters. Hence we workwith η = log( κ ) and τ = log( ρ ). The output of this procedure is a set of estimates and avalue MLL, say, for the maximized log likelihood. If an arbitrary rotation and shift are added to the mathematical helix model (1), then itis convenient to write the model in the form x i = r (cos t i ) u + r (sin t i ) v + ct i w + b + ε i , (5)3here u , v , w are three-dimensional orthonormal vectors. In particular, w is the helixaxis. Further, R = [ u v w ] is a 3 × b represents the shiftterm. The rotated errors ε i = R ǫ i still follow independent isotropic normal distributions, ε i ∼ N ( , σ I ).If R is a possible estimate of the rotation matrix, then the first two components of therotated data y i = (cid:20) y i y i (cid:21) = (cid:20) ( R T x i ) ( R T x i ) (cid:21) will approximately follow the M-H model. Since the fit of the M-H model is invariantunder rotations in the plane, the maximized M-H log likelihood for the { y i } depends onlyon the third column of R , i.e. w , and not on the relative orientation of the first twocolumns. Hence write M LL ( w ) for the maximized M-H log likelihood, depending on thechoice of helix axis w .To estimate w we maximize M LL ( w ) over w . As in the last section it is necessaryto use an interative numerical method such as the R routine nlm , starting from an initialestimate of w .A suitable initial estimate w init can be found using e.g. modified principal componentanalysis or OptLS (Alfahad et al., 2018) or Rotfit (Christopher et al., 1996).A unit vector is a constrained vector in three dimensions. For optimization purposes, itis helpful to represent it using unconstrained two-dimensional coordinates. For example,once an initial estimate has been selected, we can rotate the data so that the initialestimate points to the north pole, [0 0 1] and represent deviations about the north poleusing stereographic coordinates p = ( p , p ) T , say, where w can be written in terms of p as follows: w = 2 p p + p ,w = 2 p p + p ,w = − p + p p + p . Then define a function f ( p , p ) = M LL ( w ) (6)and maximize this function numerically starting at p = .Note that the M-H procedure involves a nested use of numerical optimization. At theinner level, numerical optimization is used to maximize the M-H log likelihood, assumingthe helix axis w is given, yielding a maximized log likelihood M LL ( w ). At the outerlevel, we maximize (6) over p , i.e. over the choice of w .We now apply the method to estimate the axis for two real helices (Helices 7 and 8from Mardia et al (2018)). For Helix 7 the estimated axis is (0 . , − . , . T andfor Helix 8 the estimated axis is (0 . , . , − . T . Their estimates from OptLSare respectively (0 . , − . , . T and (0 . , . , − . T . The cosine and theirangle for the two cases are 0.9999315, θ = 0.01170463; 0.9995821, θ = 0.0289111 These in-dicate that for the two cases, these estimates are very similar. The next section, examinestheir mean square error through a simulation study.4 Simulation
In this section, we illustrate our M-H procedure for estimating the helix axis on 100simulated helices that mimic a protein α -helix for different choices of sample size n andparameter values r , c , and σ (with β = 2 π/ . α -helix see Mardia (2013), Branden & Tooze (1999) and Creighton(1993). We have 100 estimates of the helix axis ˆ w M-H ,i by the M-H procedure and ˆ w Opt ,i by the OptLS method. To calculate the accuracy of these estimates, we define the meansquare errors (MSEs) in terms of the means of the inner products,MSE M-H = 1 − ˆ¯ w T M-H w , MSE
Opt = 1 − ˆ¯ w T Opt w , where the inner products are sample means,ˆ¯ w M-H = 1100 X i =1 ˆ w M-H ,i , ˆ¯ w Opt = 1100 X i =1 ˆ w Opt ,i , and w = (0 , , T is the axis pointing to north pole. Let θ be the angle between theestimated axis ˆ¯ w M-H (or ˆ¯ w Opt ) and w . If this angle vanishes, θ = 0, then the estimatedaxis is a perfect fit, then the inner product is 1 (Deville et al., 2008), so that the MSE isequal to zero.We illustrate the algorithm with an example of one simulated dataset that mimics along protein α helix, where n = 30 , r = 2 . , c = 5 . / (2 π ) , β = 2 π/ . , σ = 0 . w = (0 , , T . We also estimate the axis of this dataset by OptLS. Theestimated helix axis by M-H procedure is ˆ w M − H = ( − . × − , . × − , . T and the estimated helix axis by OptLS is ˆ w Opt = ( − . × − , . × − , . T .The MSE by M-H procedure is 2 . × − and the MSE by OptLS is 1 . × − . Thisresult shows that OptLS is more accurate than M-H algorithm as the MSE is smaller bya factor of two.Table 1 shows the MSE for six different simulated datasets. These data sets havebeen constructed with a variety of parameter choices. Set 2 mimics a long protein α helix( n = 30) and set 3 mimics short protein α helix ( n = 12). The remaining data sets aremodified versions of these sets. In particular, the error variance has been decreased forset 1 from σ = 0 .
05 to σ = 0 .
001 and increased for set 4 from σ = 0 .
05 to σ = 0 . r = 2 . r = 7); in addition for set 5, the pitchparameter c has been reduced (from c = 5 . / (2 π ) = 0 .
859 to c = 0 . / (2 π ) = . β = 2 π/ . . π ) 5 . π ) 5 . π ) 5 . π ) 0 . π ) 5 . π ) σ . × − . × − . × − . × − . × − . × − OptLS 1 . × − . × − . × − . × − . × − . × − Mardia and Holmes’s bivariate model can be extended to any dimension on replacingthe ellipse by an ellipsoid. Namely, let X be a random vector in d dimension then theirdistribution has the probability density function (p.d.f.) f ( x ; µ, Σ , κ ) = C ( κ ) | Σ | − / exp {− κ (( x − µ ) T Σ − ( x − µ ) − } (7)where x is dimension d the concentration parameter κ > µ takes any value in R d andΣ is a positive definite matrix. It turns out that the normalizing constant is given by aparabolic cylindrical function and is given below. For the right cylinder, the two smalleigenvalues are equal a priori. The axis is the z-axis if we take the largest eigenvalue asthe third one; that is a thin and long ellipsoid so may be relevant ”tangentially” thoughthe multivariate Mardia-Holmes model will be useful for any inference related to fittingan ellipsoid.We will use the general summary of elliptic family by Azzilini(2014,pp.168-169) ofwhich this is a member but not studied. Let us write r = ( x − µ ) T Σ − ( x − µ ) . Then the pdf can be rewritten as f ( x ; µ, Σ , κ ) = C ( κ ) | Σ | − / p ( r ) (8)where C ( κ ) = Γ( d/ / (2 π d/ b ( κ )) , with b ( κ ) = R ∞ r d − p ( r ) dr and p ( r ) = exp {− κ ( r − } .We need now to evaluate b ( κ ) given by b ( κ ) = Z ∞ r d − { exp − κ ( r − } dr.
6n fact, it can be expressed in terms of the parabolic cylindrical function defined by (see,Abramowitz and Stegun, 1965, Chapter 9, p.688). U ( a, z ) = 1Γ( a + ) exp (cid:18) − z (cid:19) Z ∞ s a − exp (cid:18) − s − zs (cid:19) ds. We have b ( κ ) = Γ( a + ) exp( − κ )2 κ a +14 U ( a, −√ κ ) , (9)where a = ( d − /
2. Note that if d = 2 n then a = n − and if d = 2 n + 1 then a = n .For d = 3, we have a = 1.We now summaries a few properties The mode of the distribution is given by( x − µ ) T Σ − ( x − µ ) = 1 . Further, E ( X ) = µ, E ( X − µ )( X − µ ) T = α Σ , (10)where α is a function of κ . The most general equation of ellipsoid is given by( x − µ ) T Σ − ( x − µ ) = 1so 1 in the LHS does NOT need any adjustments. It can be noted that the model worksto fit a general ellipsoid when there is a very high probability of concentration around( x − µ ) T Σ − ( x − µ ) = 1where Σ is positive definite. References
Abramowitz, M. & Stegun, I. A. (1972).
Handbook of mathematical functions . NationalBureau of Standards, Washington, 10th printing with corrections edition.Alfahad, M., Kent, J. T., & Mardia, K. V. (2018). Statistical methods for analysis ofhelices.
Sankhya A . In press.˚Aqvist, J. (1986). A simple way to calculate the axis of an α -helix. Computers & Chem-istry , 10(2), 97–99.Azzalini, A. & Capitanio, A. (2014).
The skew-normal and related families . CambridgeUniversity Press, Cambridge.Branden, C. & Tooze, J. (1999).
Introduction to Protein Structure . Garland Pub.Campbell, M. K. & Farrell, S. O. (2009).
Biochemistry . Brooks/Cole, sixth, internationaledition. 7hristopher, J. A., Swanson, R., & Baldwin, T. O. (1996). Algorithms for finding theaxis of a helix: fast rotational and parametric least-squares methods.
Computers andChemistry , 20(3), 339–345.Creighton, T. (1993).
Proteins: Structures and Molecular Properties . Freeman, secondedition.Deville, J., Rey, J., & Chabbert, M. (2008). Comprehensive analysis of the helix-x-helixmotif in soluble proteins.
Proteins: Structure, Function, and Bioinformatics , 72(1),115–135.Mardia, K. V. (2013). Statistical approaches to three key challenges in proteins structuralbioinformatics.
Journal of the Royal Statistical Society: Series C (Applied Statistics) ,62(3), 487–514.Mardia, K. V. & Holmes, D. (1980). A statistical analysis of megalithic data under ellipticpattern.
Journal of the Royal Statistical Society , 143(3), 293–302.Mardia, K. V., Sriram, K., & Deane, C. M. (2018). A statistical model for helices withapplications.
Biometrics , 74, 845–854.O’Neill, B. (1997).