[PDF] GLN -- a method to reveal unique properties of lasso type topology in proteins

Abstract

Geometry and topology are the main factors that determine the functional properties of proteins. In this work, we show how to use the Gauss linking integral (GLN) in the form of a matrix diagram - for a pair of a loop and a tail - to study both the geometry and topology of proteins with closed loops e.g. lassos. We show that the GLN method is a significantly faster technique to detect entanglement in lasso proteins in comparison with other methods. Based on the GLN technique, we conduct comprehensive analysis of all proteins deposited in the PDB and compare it to the statistical properties of the polymers. We found that there are significantly more lassos with negative crossings than those with positive ones in proteins, the average value of maxGLN (maximal GLN between loop and pieces of tail) depends logarithmically on the length of a tail similarly as in the polymers. Next, we show the how high and low GLN values correlate with the internal exibility of proteins, and how the GLN in the form of a matrix diagram can be used to study folding and unfolding routes. Finally, we discuss how the GLN method can be applied to study entanglement between two structures none of which are closed loops. Since this approach is much faster than other linking invariants, the next step will be evaluation of lassos in much longer molecules such as RNA or loops in a single chromosome.

Full PDF

GGLN – a method to reveal unique properties oflasso type topology in proteins

Wanda Niemyska , Kenneth C. Millett and Joanna I. Sulkowska ∗ Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2,02-097 Warsaw, Poland Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw,Poland Department of Mathematics, University of California Santa Barbara, CA 93106, USA

Abstract

Geometry and topology are the main factors that determine thefunctional properties of proteins. In this work, we show how to usethe Gauss linking integral (GLN) in the form of a matrix diagram –for a pair of a loop and a tail – to study both the geometry and topol-ogy of proteins with closed loops e.g. lassos. We show that the GLNmethod is a signiﬁcantly faster technique to detect entanglement inlasso proteins in comparison with other methods. Based on the GLNtechnique, we conduct comprehensive analysis of all proteins depositedin the PDB and compare it to the statistical properties of the poly-mers. We found that there are signiﬁcantly more lassos with negativecrossings than those with positive ones in proteins, the average valueof maxGLN (maximal GLN between loop and pieces of tail) dependslogarithmically on the length of a tail similarly as in the polymers.Next, we show the how high and low GLN values correlate with the in-ternal ﬂexibility of proteins, and how the GLN in the form of a matrixdiagram can be used to study folding and unfolding routes. Finally, wediscuss how the GLN method can be applied to study entanglementbetween two structures none of which are closed loops. Since this ap-proach is much faster than other linking invariants, the next step willbe evaluation of lassos in much longer molecules such as RNA or loopsin a single chromosome.

Introduction

The protein backbone describes a collection of space curves, a type of spatialstructure that mathematicians have been analysing and comparing for along time. One well-known measure of how two such curves interact with ∗ [email protected] a r X i v : . [ q - b i o . B M ] M a y ne another is the Gauss linking integral, which is related to Ampere’s lawof electrostatics and has important applications in modern physics. For twooriented closed curves the Gauss linking integral is always integer, called thelinking number, giving an integer invariant describing the number of timesone curve winds around the other. The linking number of two not linkedcurves is 0, while the Hopf link is the simplest link with linking numberequal to +1 or -1, depending upon the relative orientation of the curves [1],see Supplementary Information Fig. 1.Protein chains are open curves which is often challenging for mathemati-cians, and induces high computational complexity of algorithms involvingrandomness and statistics [2, 3], as in the case of identifying knots [4], slip-knots [5, 6] and links in proteins [7]. Against such a backdrop, the factthat Gauss linking integral may be deﬁned generally for open curves andcalculated precisely for polygonal chains makes this measure particularlyattractive.The ﬁrst biological applications of the Gauss linking integral are found instudies of DNA structure [8]. In 2002, Røgen and Fain applied this measurefor comparing and eﬀective classifying protein structures [9]. More recently,the Gauss integral has been used for identifying linking in domain-swappedprotein dimers [10].In this paper we show that the Gauss linking integral, which we denoteby GLN, captures unique properties of lasso proteins (Fig. 1), another typeof non-trivial topology identiﬁed recently in proteins containing a disulﬁdeor other type of bridge [11, 12]. Complex lasso topology is found in at least18% of all proteins with disulﬁde bridges in a non-redundant subset of PDB,and thus represents the largest group of proteins with non-trivial topology.Lassos occur in structures with disulﬁde (or other) bridges creating a loopand a pair of termini. When at least one terminus of a protein backboneis entangled with the covalent loop (closed by such a bridge) a topologi-cally complex structure is formed. The topology is identiﬁed by a spanningspeciﬁc surface (i.e. minimal surface) on the covalent loop (Fig. 1) andidentifying the crossings of the tails and the surface [11]. Currently severalclasses of lasso structures in proteins are known. In addition to the triviallasso L , the principal structures are the single lasso L , the double lasso L ,and the triple lasso L , depending upon whether the loop is pierced once,twice and three times, respectively, by the same tail, which goes throughthe loop and turns back several times. The structure with more than onepiercing from the same direction is called a lasso supercoiling LS (when onetail pierces the loop then winds around the protein chain comprising theloop and pierces it again). Another case identiﬁed in proteins is the two-2

5 40 45 50 55 60 65 35404550556065 Residue ID (beginning of a segment) R e s i d u e I D ( e n d o f a s e g m e n t ) min GLN = - 0.6max GLN = 0.8 Figure 1: Left panel: An example of a lasso conﬁguration of L type, witha disulﬁde bridge (in orange) closing a covalent loop, and a minimal surface(in gray) which spans the loop and is pierced twice by the tail. Middlepanel: A cartoon representation of a hydrolase protein (PDB code 5uiw,chain B), with disulﬁde bridge between amino acids 10 and 34. It is of L type, with minimal surface (in gray) and tails coloured according to the GLNvalues between their segments and whole loop. Right panel: The topologicalﬁngerprint of a lasso based on the GLN matrix for the same protein. Eachcell of the matrix corresponds to the GLN value between the disulﬁde loopand the speciﬁc subchain of the tail (here C terminus, the longer one), wherethe id of the ﬁrst residue is on the x-axis and the id of the last residue is onthe y-axis, thus the left bottom corner corresponds to the whole tail. TheC-tail in the middle panel is colored according to the diagonal of the matrix.sided lasso LL (when a loop is pierced by both tails). It is important to notethat from mathematical point of view all classes of lassos are topologicallyequivalent to trivial lasso L because the free ends are not prevented fromunwinding. And even if we connected free ends not disturbing windings,except lasso supercoil LS the rest would be still topologically equivalent totrivial lasso. But, from biological point of view, they are still very interest-ing complex structures. For example, a correlation between a type of lassotopology and the speciﬁc function of protein has been identiﬁed [11]. Allproteins that form any type of lasso are collected in the LassoProt database[12].Proteins with lassos are found in all domains of life and possess diversefunctions [11, 12]. Lasso topology can inﬂuence thermodynamics propertiesand biological activity of proteins [13, 14]. Cystein bridges provide stabilityto protein structures and a non-trivial topology can enhance this inﬂuence[7, 15]. However, it is also known that non-trivial topology hinders the fold-3ng pathway [16], leading to possible misfolding [17]. How evolution solvesthis delicate balance is one of the open questions. There are many othersat the interface of biology and mathematics. What is the role of the lasso?Is there a correlation between the lasso type and the biological function?How do these proteins fold in oxidative conditions? The latter questionhowever does not concern the lasso peptides which are class of ribosomallysynthesized posttranslationally modiﬁed natural products found in bacteria.However these peptides have a diverse set of pharmacologically relevant ac-tivities, including inhibition of bacterial growth, receptor antagonism, andenzyme inhibition [18]. Thus, can lasso topology be useful in bioengineeringor in pharmacological applications to design proteins with desired fold, sta-bility or other features? In polymer chemistry, lassos (known as tadpoles)are used to design materials with desired properties [19, 20, 21]. Since las-sos are deﬁned using open curves they are also inspiring mathematicians toconstruct topological tools capable of classifying them [22, 23]. However, upto now, the question of whether a loop and a tail can be entangled in proteinwhile the minimal surface spanned on the loop is not pierced, hasn’t beenasked. How might this entanglement inﬂuence protein biophysical proper-ties? The Gauss linking integral approach could reveal more informationabout lasso proteins than the previous geometric method.The aim of this research is to better understand the entanglement oflasso proteins and its inﬂuence on their thermodynamical properties. To doso we ﬁrst introduce a new technique based on the Gauss linking integraland, then, apply it to assess the topological complexity of proteins withdisulﬁde bridges. We show that GLN provides new information about theentanglement of the loop and tails, related to geometric features of theminimal disc piercings but, in addition, identiﬁes entangled proteins withdiﬀerent complex lasso topology. We introduce GLN ﬁngerprint to displaythe local winding of a protein backbone and as another method to quantifyentanglement in proteins with non-trivial linking topology. Finally, we useGLN as descriptor to study the free energy landscape of proteins and showinﬂuence of non-trivial topology on proteins stability and folding pathway. Results

Our new approach relies on the deﬁnition of the Guass linking integral.Let us ﬁrst consider a protein chain with a disulﬁde bond connecting twoamino acids that, in this way, creates an unknotted covalent loop. Thecomplementary parts of the chain are the tails. When at least one tail4ierces a minimal surface spanned on the loop, the entire structure is called acomplex lasso (Fig. 1). In this study, we compute the Gauss linking integral,which we denote by GLN, quantifying the linking between each tail and theclosed loop. The GLN is an algebraic measure of how many times (andin which direction) the tail winds around the loop, with cancellation. Forexample, a value of GLN close to 1 means that the tail winds around theloop more or less once, in total. In the most simple cases, the tail passes oncethrough the surface spanned on the loop (in a positive direction, followingnatural orientation of protein from the N terminus to C terminus). Suchstructure resembles the single lasso called L . If the direction is reversed, thelinking number is close to −

1. Note that, in complex cases, the tail can passaround the loop twice in a positive direction and once in a negative directionfor an algebraic total of about 1. Moreover, by deﬁnition, the linking numberof two unlinked curves is 0 although one can not infer with certainty thatlinking number 0 curves can be separated. This is demonstrated by the“Whitehead” link in which the algebraic linking of the two closed loops iszero but they are geometrically entangled and one chain intersects a minimalsurface spanned on the other chain at least twice in opposite and thereforecancelling directions. We will present conditions to identify and classifyproteins with cystein bridges.

GLN deﬁnition from protein perspective

The mathematical deﬁnition of linking number between two closed curves in3 dimensions is given by the Gauss double integral. In the case of proteins,the molecular chains become collections of points, i.e., positions of C α atoms,and the integrals may be replaced by sums of exact quantities determinedby pairs of segments connecting the points as determined by the molecularchain [24]. We must relax the expectation of having an integer indicator oflinking as we perform the double Gauss integral over open chains. See thesection Materials and Methods for the details. We propose the analysis offour main values for each pair consisting of a loop and a tail:1) whGLN : the GLN value of a loop and a whole tail,2) minGLN , and3) maxGLN respectively, the minimum and maximum values of GLN between a loop andany fragment of a tail, and4) max | GLN | = max { maxGLN, − minGLN } .Additionally, for each triple of a loop and two tails, we consider max | GLN | max | GLN | values for both tails. Wedetermine the positive directions of windings according to natural directionof a protein chain; oriented from the N -terminus to the C -terminus. A high maxGLN or low minGLN indicate that the corresponding part of a tailsigniﬁcantly winds around a loop in a ”positive” or ”negative” direction,respectively. Usually the minimal surface spanned on the loop is pierced bythis part of the tail.We analyzed the entire set of all 5,106 non-redundant proteins in theProtein Data Bank with at least one disulﬁde bridge (13,320 covalent loopsin a total) from the LassoProt database [12]. See Materials and Methodssection for the details about the dataset.Application of GLN to this dataset reveals the gaussian distribution withlong tail as shown in Fig. 2. In the majority of cases, the GLN is near 0.2indicating proteins in which t the minimal surface spanned on the loop isprobably not pierced. However, the long tail shows that, in high fractionof chains with cysteine bridges at least one tail signiﬁcantly winds aroundthe loop. For example, in 21% of chains, we have at least one loop with max | GLN | > . .

4% of loops, we have max | GLN | > .

6. Thevalue 0 . max | GLN | > . max | GLN | ≤ . The GLN ﬁngerprint as a method to classify lasso structures

To identify the correlation between topology and geometry of proteins, weadopt the idea of topological ﬁngerprint used to exhibit the internal knotsin proteins called slipknots [6, 25]. Here, we present the linking complexityin the form of a matrix diagram – for a pair of a loop and a tail – that showsthe GLN between the loop and the entire tail and each of its subchains.The analysis of our dataset reveals that covalent loops in proteins canbe classiﬁed into a few distinct motifs, represented by particular patternswithin the matrix diagrams. Four characteristic motifs are shown in Fig. 3.Each point of the matrix corresponds to a speciﬁc subchain of the tail, wherethe id of the ﬁrst residue is on the x-axis and the id of the last residue ison the y-axis. As a consequence, the left bottom corner corresponds to thewhole tail. The color intensity indicates the value of the GLN between thedisulﬁde loop and the speciﬁc subchain of the tail. A red color indicates6 ax2|GLN|0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 F r a c t i o n Figure 2: The histogram of max | GLN | values for all closed loops (createdby a disulﬁde bridge) in the set of 5106 non-redudant proteins. The dottedcurve shows the fraction of loops having max | GLN | greater than the valueon the x-axis. Almost 10% of the loops have max | GLN | greater than 0 . • gL , no clear colorfull patches in the matrix indicating that the tail doesnot wind around the loop. • gL , there is one colorfull patch in the matrix (e.g. in the left bottomcorner) indicating that the tail winds around the loop once. The color indi-cates the direction. • gL , there are two patches in diﬀerent colors in the matrix, (e.g. oneon the left edge and second one on the bottom edge). This indicates thatthe tail winds around the loop in one direction and then in the oppositedirection. (This spatial arrangement can be observed by following the leftedge of the matrix in a descending direction: the beginning of the analyzedsegment remains the same - beginning of the tail - while the end of the ana-lyzed segment is moving towards the end of the tail. When we approach thepatch, a color begins to appear meaning the tail begins to wind around theloop. Below the colorfull patch we again see white indicating that the tailwinds around the loop but in the opposite direction thereby cancelling theinitial winding contribution. Thus the windings ”cancel” themselves andthe corner of matrix is again almost white (see Fig. 1).).7 gL , there are four colorfull patches in the matrix, e.g. one in the middlein the diﬀerent color than three other patches; this indicates that the tailwinds around the loop in one direction, then turns and winds around theloop in the opposite direction, and ﬁnally turns back one more time. • gL n , for any natural n , there is speciﬁc, dependent on n , number of col-orfull patches (namely (cid:4) n +12 (cid:5) · (cid:4) n +22 (cid:5) ) in the matrix; this indicates that tailwinds around the loop n times, each next time in the opposite direction. • gLS , there is usually one big patch in one color which at some point be-comes very intensive - claret or navy in the case of negative and positivewindings, respectively; this means that the tail winds around the loop inone direction (making a full circle) and then winds around it one more timein the same direction. • gLL , if both matrices for two tails have at least one colorfull patch; thisindicates that both tails wind around the loop.Similar GLN matrices indicate the same topological motifs even thoughthe chains may have a diﬀerent structure. Examples of the same GLNmatrices for proteins with very low sequence similarity are shown in Sup-plementary Information (Fig. 4 and Fig. 5). The motifs gL n , gLS and gLLusually correspond to the lasso types L n , LS and LL, respectively. The GLNmatrices reveal much more detail about the geometry of the chains with las-sos. By analysing the location, size and color of a collection of patches onemay deduce which parts of the tail wind around the loop and how fast andtightly they wind. For the most part intense patches correspond to the tailpiercing the minimal surface spanned on the loop. This is not always thecase since the tail may make almost full circle around the loop, but do notpierce the minimal surface spanned on the loop (see Table 1). Such complexconﬁgurations had not been identiﬁed by methods that studied intersetionswith the minimal surface spanned on the loop [11]. Classiﬁcation of lasso protein structures and entangled butunpierced loops

In this section we describe some methods to classify proteins with lassosbased on the Gauss linking integral. We propose a precise classiﬁcation ofloop-tail pairs having distinct linking motifs presented by the GLN ﬁnger-prints (Fig. 3). This is based on three positive real numbers t L , t L + , t LS (forinstance t L , t L + ≈ . , t LS ≈ . • gL - if max | GLN | ≤ t L , • gLS - if max | GLN | > t LS ;In the all next three cases we demand that max | GLN | ∈ ( t L , t LS ], and:8igure 3: Topological ﬁngerprints – GLN matrices. Left, the ﬁngerprintsgL (top) and gL (bottom), respectively, for proteins with one and threepiercings of the , based on proteins with pdb codes 1i1j and 2ehg. Right, theﬁngerprints gL (top) and gLS (bottom), respectively, for proteins with twopiercings of the minimal surface spanned on the loop in the opposite directionand the same direction (supercoiling), based on proteins with pdb codes 2ehgand 1zd0. Arrows begin in the places on the matrices where color is rapidlychanging implying that the tail is in the critical phase of winding aroundthe loop and the GLN is quickly increasing or decreasing. On the other side,on the diagrams they indicate the neighborhoods of possible correspondingpiercings. The colors of the arrows indicate directions of windings.9 LLLS L gLgLgLgLS gL a x G L N GLN classif cationMinimal surface classiifii cation m i n G L N i i Figure 4: Classiﬁcation of proteins with closed covalent loop based on theminimal surface technique (left) and GLN technique (right). As much as98% of structures are classiﬁed in an analogous way by both techniques(corresponding points are colored in the same way on both plots). However,on the right, plot types are divided more regularly since the correspondingclassiﬁcation is based only on the GLN values. To diﬀerenciate between thetypes gL and gL on the plots (green and red dots, respectively) oneneeds the third coordinate - whGLN value. • gL - if exactly one value of maxGLN and − minGLN is greater than t L , • gL - if both values maxGLN and − minGLN are greater than t L and | whGLN | ≤ t L + , • gL - if both values maxGLN and − minGLN are greater than t L and | whGLN | > t L + .One can consider whole triple consisting of a loop and two tails: if one ofthe tails is classiﬁed as gL , then we say that the triple is of the type of thesecond tail; if both tails are classiﬁed in diﬀerent way than gL , we say thatthe triple is of the type gLL .Let L denote the sum of types L n for any natural n ≥ , L and L , see [12]). Let L denotethe sum of types L n +1 for any natural n ≥ ). We found that it is possible to choose particular valuesof t L , t L + , t LS (i.e. t L = 0 . , t L + = 0 . , t LS = 1 .

55) such that as much as98% of loops are classiﬁed in an analogous way by both the techniques ofminimal surfaces and the GLN as shown in the Fig. 4 (see SupplementaryInformation Fig. 5 for detailed comparison). Most of the remaining 2%of loops are structures with intriguing properties that were not recognized10efore [11]. We split them into the three groups.The ﬁrst group consists of proteins in which the minimal surface spannedon the loops are not pierced but the tails strongly wind around the loop, orthe surfaces spanned on loops are twisted and wind around the tails. Whenthe loop is twisted it appears that there is not enough space to thread thetail through the loop although it is composed of more than 100 amino acids.There are only 15 such proteins among the set of non-redundant chains of alength lower than 500 amino acids (see Table 1), with max | GLN | > .

69 andno piercings. One can ask how does this type of entanglement inﬂuence thefree energy landscape of the protein in oxidizing conditions? We speculatethat, in this case, some part of the conﬁgurational space is excluded fromprotein backbone exploration during folding. Unwanted threading will haveto backtrack thereby slowing down folding or even leading to missfolding.The second group contains proteins with high | GLN | values and theclosed loops that are pierced by the tails, but, in minimal surface technique,these piercings are interpreted as being too shallow and are reduced, i.e.they are not taken into account. (Generally, this is a reasonable approachsince, for instance, all helices that are crossing surfaces usually do cross themat least three times on a short distance. We wish to interpret this as simplyone meaningful crossing. However, it is not an easy problem to distinguishshallow crossings from relevant ones (see Supplementary Information Fig. 6)and the parallel analysis of GLN matrices may be very helpful in recognizingwhich reductions are justiﬁed or are spatially reasonable.)The third group consists of structures with low max | GLN | value butwith tails piercing the minimal surface spanned on the loops. There are only9 such loops (0.01% of the analyzed data set), see Supplementary Informa-tion, Table 1. (These structures have max | GLN | ≤ . max | GLN | < . Unique biophysical features of lasso proteins

An analysis of the statistics concerning GLN reveals interesting featuresfrom the biological point of view. First of all, the windings in the negativedirection occur signiﬁcantly more often than those in the positive direction.For example, among the loops of gL type over 63% have a negative GLNvalue (see Fig. 5, panel B). However, a detailed analysis of basic physico-chemical properties (a type of amino acids, type of disulﬁde bridge [26]) does11able 1: “Entangled“ proteins without piercing through a covalent loopclosed by a disulﬁde bridge. Based on loops from non-redundant chainsof a length lower than 500 amino acids, which are not pierced, but have max | GLN | > . Protein Loop Tail Max | GLN | (chain) range whGLN values reveals a noticeable depressionaround the value − . maxGLN depends logarithmically on the length of a tail, up to a lengthof around 40 amino acids. Next, maxGLN saturates and remains stablearound the value 0.25 (0.55 for polymers) (see Fig. 6).Finally, the analysis of B-factors (the temperature factor) shows that12 ax GLN min GLN wh GLN300025002000150010005000 9050407040 02000100030004000 } -2 210-1 -2 210-1-1 10.50-0.5 1.5-1.5 5000min GLN max GLN -2 -1 0 1 2wh GLN random polymers Figure 5: Distribution of maxGLN , minGLN and whGLN values basedon the 13,320 loops closed by disulﬁde bridges. Panels A,B,C indicate thatthere are more negative GLN values than positive ones in proteins. A)Histogram of all maxGLN and minGLN values that are greater than 0 . − .

15, and 53% of them are negative. B) Histogram of all maxGLN or minGLN values (only greater value - in the sense of absolutevalue - from each pair is taken into account here) from the loops of gL type- over 63% of them are negative. C) Histogram of all whGLN values in theanalyzed dataset revealing the local minimum around the value − .

5. D)Histogram of whGLN for random polymers. −

20 0 20 40 60 80 100 120 140 160N tai ength0.00.10.20.30.40.50.60.7 A v e r age m a x | G L N | Fitted function (log)Data −

20 0 20 40 60 80 100 120 140 160C tai ength0.00.10.20.30.40.50.60.7

Fitted function ( log)Data −

20 0 20 40 60 80 100 120 140 160Tai ength0.00.10.20.30.40.50.60.7

Data - Lasso ProteinsRandom Polimers

C tail length Tail length

0 40 80 120 A v e r a g e m a x | G L N | N tail length

0 40 80 120 0 40 80 120

Figure 6: Left and middle panels: average max | GLN | values for diﬀerentlengths of N and C-tails, respectively - ﬁrst they grow logarithmically, thenthey become more or less constant, equal to about 0 .

25. Right panel: com-parison of average max | GLN | values for diﬀerent lengths of tails in proteins(N and C-tails counted together) and for random polymers. The plot revealsa similar pattern but with much higher GLN values in polymers, stabilizingaround 0 .

55. 13n chains with short loops amino acids for which | GLN | between the loopand the tail’s fragment from begining to the amino acid is the highest, havehigher B-factors than average ones. Moreover, amino acids for which | GLN | between the loop and the unit segment corresponding to the amino acid isthe highest (often those segments pierce the minimal surface spanned onthe loop) – have signiﬁcantly lower B-factors, lower even than amino acidscreating cysteine bridges. For all loops the tendency is similar, howevera little bit less strong (see Table 2). This suggests that the parts of tailspiercing the loops spanning surfaces are more stable, while the parts of tailsbetween bridges and crossings ﬂuctuate more. This is in agreement withavailable experimental data for lasso type polypetides [27].Table 2: Correlation between GLN values (of unit segments of tails andwhole loop) and B-factors for corresponding amino acids in lasso proteins.Second column: proteins with loops consisting of less than 50 amino acidsare taken into account. Third column: all loops.Average B -factor for amino acids – Short loops All loopsall < amino acids > | GLN | values and inversely - high | GLN | values correlate with low B-factors. This again suggests that pieces of thetail winding around the loop are more stable that the other segments of thetail. Applications of the GLN ﬁngerprint

Understanding the mechanism by which proteins fold to their native struc-ture is a central problem in protein science [28]. In the case of a majorityof proteins, native contacts are suﬃcient to drive the folding of the protein14

0 40 50 60 70 80 304050607080

Residue ID (begining of a segment) R e s i d u e I D ( e n d o f a s e g m e n t ) Residue ID (begining of a segment) low |GLN| high B-factor low |GLN| high B-factor

Figure 7: Correlation between GLN values and B-factors shown in the GLNmatrices for proteins (left: pdb id 4ors, with the loop closed by aminoacids 89-186, right: pdb id 2ehg, with the loop 58-145; matrices are forN-terminals), both of gL type. On the right edge of the matrix, B-factorsare in black and | GLN | values between unit segments and whole loop are ingreen. Note that when a local | GLN | is high it usually means that the tailis just winding around the loop, which results in color changes on the leftedge of the matrix. When local | GLN | is low, the tail is often far from theloop, not winding around it as signiﬁcantly at that location.[29, 30, 31] since their free energy landscape is minimally frustrated [32].The fraction of native contacts, called Q, was shown to be a good reactioncoordinate to study the folding mechanism for a majority of proteins [28].However, in the case of proteins with non-trivial topology (e.g. the smallestknotted protein MJ0366 [33]), Q merely represents the progress of folding[34].Next, we show that the GLN values and the GLN ﬁngerprint can revealinformation, hidden from Q, about the topology based on unfolding path-ways simulated with a structure based model [35]. In fact, in the case ofthe ribonuclease U2 protein with the gL motif (the loop is pierced threetimes), GLN values reveal an ensemble of the transition states composedof at least two unfolding pathways: via the slipknot topology [16, 36] ordirect unthreading (see Fig. 8). Moreover, superposition of the ﬁngerprintsover the time shows how the protein backbone travels through the availableconformation space. The same technique can be applied to reveal untyingof even more complex topologies such as the supercoling motif gLS (onetail winding around the loop and piercing it two or more times from thesame site). The unfolding pathway for a protein with gLS is shown inSupplementary Information Fig. 7. 15

10 11090100807060 60 70 80 100

Residue ID(begining of a segment) R e s i d u e I D ( e n d o f a s e g m e n t ) Pathway IIPathway I gLgLgL

140 150 160 170 gL gL

290 320 340 360 gL Native protein gL Time Time0

Figure 8: Example of two topologically diﬀerent unfolding routes identiﬁedwith GLN method for the ribonuclease U2 (pdb ID 3agn) with gL motif(the closed loop is pierced three times). Left panel: the GLN matrix atthe native conformation. Middle panel: visualization of unfolding via un-threading internal loop toward gL motif, next single unthreading to trivialtopology. Each column of this matrix corresponds to the single time framein the simulation and represents left edge of the GLN matrix for this frame.Right panel: untying to gL geometry, next untying via slipknot motif to gL . 16he application of the GLN is not limited to studying lasso proteins orproteins with links [7]. Since the GLN measures mutual entanglement its ﬁn-gerprint is diﬀerent for “the same“ protein with two topologies – unknottedand knotted (see Supplementary Information Fig. 8) [37]. Furthermore, thepattern of the GLN ﬁngerprint can be used to identify the type of secondarystructures of the protein which are usually visible via a contact map. Note,that the shape of the contact map depends on the cutoﬀ distance used todetermine physical contacts while GLN does not depend on additional pa-rameters. Moreover, sign of GLN (blue or red color on the matrix) indicatesthe ”direction of contact”, i.e. from this it can be deduced on which side thefragments of protein chain being in contact pass each other (for more detailssee Supplementary Information Fig. 8, Fig. 9). Thus, the GLN ﬁngerprintof a native conformation can be used as a reference value for a reactioncoordinate in studying the folding pathways of protein. Discussion and conclusions

We have shown that the GLN method is a signiﬁcantly faster technique todetect entanglement in proteins with closed loops in the comparison withthe methods which rely on minimal surfaces spanning the covalent loops[11]. The method also reveals much more information about the geometryof chains with lassos which may lead to the new biological and chemicaldiscoveries. However, the algorithm based on the surfaces has the advantageof giving precise information about the exact residues that cross the spanningsurface which may lead to an important insight from the biological point ofview. We believe both approaches can compliment each other and, together,help focus study on important features of the protein.The GLN ﬁngerprint can also be used to compare proteins e.g. duringCASP or CAPRI competition. Indeed, it can be pushed further, so that theGLN ﬁngerprint provides a powerful tool to be used to improve already verysuccessful deep learning algorithms used to predict tertiary and quaternarystructure of proteins via image recognition [38].The present method can be applied to any structure in which a loop andtail can be deﬁned. Apart from the cysteine bridge loops investigated here, aloop can be formed, among others, by a salt bridge, by a hydrogen bond, orby ions. An example of the last case is the human transport protein (PDBcode 1n84), with the loop closed by Tyr95-Fe339-Asp63 interaction whosespanning surface is pierced by C-terminal tail (Thr250) [39] thus forminglasso of gL type. 17oreover, one can apply GLN approach to study entanglement be-tween two structures none of which are closed loops. Lately new algorithm,GISA, was proposed to study local entanglement in protein chains and otherbiopolymers [40]. The algorithm computes Gauss integrals between manypairs of quite short fragments of chain and ﬁnds rare invariant values. Itcan be helpful in search for knots, links and highly entangled conﬁgurationsnot previously described as well. Furthermore since this approach is muchfaster than other linking invariants it will provide a very useful technique tostudy loops in a single chromosome as well as chromosome entanglement inthe cell [41, 42]. Current methods allow one to describe single chromosomeswith high resolution (thousands of beads). This number is already an orderof magnitude bigger than the typical length of the protein. Materials and Methods

Gaussian linking number.

A deﬁnition of linking number between twoclosed curves γ and γ in 3 dimensions is given by the Gauss double integral, GLN ≡ π (cid:73) γ (cid:73) γ (cid:126)r (1) − (cid:126)r (2) | (cid:126)r (1) − (cid:126)r (2) | · ( d(cid:126)r (1) × d(cid:126)r (2) ) , (1)where (cid:126)r (1) and (cid:126)r (2) are positions of two curves. Gauss proved that, for closedoriented curves, this integral is always integer, is an invariant up to isotopy,and measures how many times one curve winds around the second one. Inthe protein case chains become collections of points, i.e., positions of C α atoms { (cid:126)r ( k )1 , (cid:126)r ( k )2 , . . . (cid:126)r ( k ) N k } , for the chains of the length N k , k = 1 ,

2. Theintegrals may be replaced by sums over segments d (cid:126)R ( k ) i = (cid:126)r ( k ) i +1 − (cid:126)r i ( k ) , forwhich we use the midpoint approximation (cid:126)R ( k ) i = ( (cid:126)r ( k ) i +1 + (cid:126)r i ( k ) ) /

2. We canreplace the requirement of having oriented closed loops by oriented openarcs giving a real value as a measure of linking rather than an integer. Wecan then perform the double Gauss discrete integral over the open chains,

GLN ≡ π N − (cid:88) i =1 N − (cid:88) j =1 (cid:126)R (1) i − (cid:126)R (2) j | (cid:126)R (1) i − (cid:126)R (2) j | · ( d (cid:126)R (1) i × d (cid:126)R (2) j ) . (2) Note, one can simply employ the Banchoﬀ method on the openchain to explicitly calculate this integral [24]. G ( i, j ) := (cid:126)R (1) i − (cid:126)R (2) j | (cid:126)R (1) i − (cid:126)R (2) j | · ( d (cid:126)R (1) i × d (cid:126)R (2) j ) , (3) i ∈ { . . . N − } , j ∈ { . . . N − } , and consider a pair of a tail of a length N and a loop of a length N . We calculate and then analyze four mainvalues for each pair of a loop and a tail: • whGLN : value of the Gauss double integral between a loop and wholetail, whGLN = 14 π N − (cid:88) i =1 N − (cid:88) j =1 G ( i, j ); (4) • minGLN ( maxGLN ): minimum (maximum) value of the Gauss dou-ble integral between a loop and any fragment of a tail, minGLN = min k,l ∈{ ...N − } ,k

We use the set of 5,106 non-redundant proteins withat least one bridge from LassoProt database [12], March 2016. By non-redundant we mean sequence similarity is lower than 35%, including X-ray, NMR, CEM structures and proteins with unresolved parts. We choseonly one chain from each protein and identiﬁed 13,320 covalent loops in atotal. This dataset includes 1,276 chains with unresolved parts which werereconstructed with Gaprepair [43] based on Modeller [44]. For details seeSupplementary Information ﬁle.

The minimal surface method and molecular visualization.

Thesurface is approximated by a discrete triangulation as described in [11, 12].To distinguish structures with the same number of piercings but where theway he minimal surface spanned on the loop is pierced is diﬀerent, an ori-entation of the surface spanned on the disulﬁde loop was introduced. Twopiercings may occur if the tail pierces the loop in one direction and thenthe inverse (the L structure), or pierces it twice in the same direction,winding around the loop (the LS structure). Additionally Pylasso [45] and19yLink [46] plugin for PyMOL were used to facilitate analysis and performMolecular graphics. Molecular dynamics simulation.

The kinetics data were obtainedbased on a coarse-grained model and conducted using the Gromacs packagewith SMOG software [35] employing parameters from [47].

Random lassos sampling.

Phantom lassos (polymers deprived of anyinteractions and volume) were created by connecting phantom loops andphantom tails. Phantom loops were created as equilateral polygons usingthe dedicated algorithm [48] and tested earlier in the [49].

Acknowledgments

The authors would like to thank Szymon Niewieczerzal, Bartosz Gren for help withrunning simulations, Eleni Panagiotou, Pawel Dabrowski-Tumanski for useful dis-cussions. This work was ﬁnanced from the budget of Polish Ministry for Science andHigher Education Grant [

Author Contribution

J.I.S., K.C.M. and W.N designed the work, W.N. and J.I.S performed the workand wrote the paper.

Additional information

Supplementary Information is attached.

Competing ﬁnancial interests:

Theauthors declare no competing ﬁnancial interests.