[PDF] Networks of motifs from sequences of symbols

Abstract

We introduce a method to convert an ensemble of sequences of symbols into a weighted directed network whose nodes are motifs, while the directed links and their weights are defined from statistically significant co-occurences of two motifs in the same sequence. The analysis of communities of networks of motifs is shown to be able to correlate sequences with functions in the human proteome database, to detect hot topics from online social dialogs, to characterize trajectories of dynamical systems, and might find other useful applications to process large amount of data in various fields.

Full PDF

aa r X i v : . [ q - b i o . M N ] S e p Networks of motifs from sequences of symbols

Roberta Sinatra, , , ∗ Daniele Condorelli, , and Vito Latora

1, 2 Dipartimento di Fisica ed Astronomia, Universit`a di Catania, and INFN, Via S. Soﬁa 64, 95123 Catania, Italy Laboratorio sui Sistemi Complessi, Scuola Superiore di Catania, Via San Nullo 5/i, 95123 Catania, Italy Dipartimento di Scienze Chimiche, Sezione di Biochimica e Biologia Molecolare,Universit`a di Catania, Viale A. Doria 6, 95125 Catania, Italy ∗ (Dated: May 30, 2018)We introduce a method to convert an ensemble of sequences of symbols into a weighted directed networkwhose nodes are motifs, while the directed links and their weights are deﬁned from statistically signiﬁcant co-occurences of two motifs in the same sequence. The analysis of communities of networks of motifs is shown tobe able to correlate sequences with functions in the human proteome database, to detect hot topics from onlinesocial dialogs, to characterize trajectories of dynamical systems, and might ﬁnd other useful applications toprocess large amount of data in various ﬁelds. PACS numbers: 89.75.Fb, 89.75.Hc, 87.18.Cf

There are many examples in biology, in linguistics and inthe theory of dynamical systems, where information residesand has to be extracted from corpora of raw data consisting insequences of symbols. For instance, a written text in Englishor in another language is a collection of sentences, each sen-tence being a sequence of the letters from a given alphabet.Not all sequences of letters are possible, since the sentencesare organized on a lexicon of a certain number of words. Inaddition to this, different words are used together in a struc-tured and conventional way [S1, S3]. Similarly, in biology,DNA nucleotides or aminoacid sequence data can be seenas corpora of strings [3, 5, 6, S4]. For example, it is wellknown that proteomes are far from being a random assem-bly of peptides, since clustering of aminoacids [7] and strongcorrelations among proteomic segments [8] have been clearlydemonstrated. These results give meaning to the metaphor ofprotein sequences regarded as texts written in a still unknownlanguage [3, 9]. Sequences of symbols can also be found intime series generated by dynamical systems. In fact, a trajec-tory in the phase space can be transformed into sequence ofsymbols, by the so-called “symbolic dynamic” approach [10].The basic idea is to partition phase space into a ﬁnite numberof regions, each of which is labelled with a different sym-bol. In this way, each initial condition gives rise to a sequenceof symbols representing the initial cell, the cell occupied atthe ﬁrst iterate, the cell occupied at the second iterate, and soforth.In all the examples mentioned above, the main challengeis to decipher the message contained in the corpora of datasequences, and to infer the underlying rules that govern theirproduction. In order to do this, one needs: i) to detect thefundamental units carrying information, like words do in lan-guage, and ii) to study their combination syntax in the ensem-ble of sequences. In fact, information in its general mean-ing is located not only at the level of strings, but also in theircorrelation patterns [11, 12]. In this Letter, we introduce amethod to transform a generic corpus of strings, such as writ-ten texts, protein sequence data, sheet music, a collection ofdance movement sequences [13], into a network representing the signiﬁcant and fundamental units of the original messagetogether with their relationships. The method relies on a sta-tistical procedure to detect patterns carrying relevant informa-tion, and works as follows. We ﬁrst construct a dictionaryof the recurrent strings of k letters, called k -motifs. Recur-rent strings play, in this more general context, the same roleas words in written or spoken languages. We then construct a k -motif network, a graph in which each node is one entry ofthe dictionary, and a directed arc between two nodes is drawnwhen the ordered co-occurence of the two motifs is statisti-cally signiﬁcant in the dataset analyzed. We will show howthe analysis of topological properties of networks of k -motifs,such as the detection of community structures [14, 15], allowsto extract important information encoded in the original data.In particular, we will consider the application of the methodto datasets in three different domains, namely, biological se-quences of proteins, messages from online social networks,and sequences of symbols generated by the trajectories of adynamical system.Let us consider an ensemble S of S sequences of symbols.Each sequence s ( s = 1 , , . . . , S ) is a string of letters from analphabet A of A letters, A ≡ { σ , σ , ..., σ A } . In general, thestrings can have different lengths. We indicate by l s the lengthof sequence s , and by L = P Ss =1 l s the total length of theensemble. An example is provided by proteomes. A proteomeis a collection of S ≈ proteins of a species. Each proteinis a sequence of length l s , ranging from to , madeof symbols from an alphabet A with A = 20 letters, A ≡{ σ , σ , ..., σ } , where each σ labels one of the aminoacids aprotein can be made of. We deﬁne as k -string a segment of k contiguous letters x x . . . x k , where x i ∈ A ∀ i . The numberof all possible k -strings is A k , while from the ensemble ofsequences S we can select only L − S · ( k − overlapping k -strings, so that some of the possible k -strings do not occurr,some of them occur once, others more than once, either in thesame or in different sequences of symbols. We deﬁne as: p obs ( x x · · · x k ) = c ( x x · · · x k ) P ( x ,x , ··· ,x k ) ∈A k c ( x x · · · x k ) (1)the observed probability of a string x x . . . x k . This prob-ability is obtained by counting the total number of times, c ( x x . . . x k ) , the string actually occurs in the sequences ofthe ensemble. To assess for the statistical signiﬁcance of thestring, the probability in Eq. 1 has to be compared with the ex-pected probability p exp ( x x · · · x k ) of the string occurrence.The latter can be evaluated under different assumptions. Infact, the joint probability p ( x x · · · x k ) can be written as: p ( x x · · · x k ) = p ( x x · · · x k − ) p ( x k | x x · · · x k − ) , and different approximations for the conditional probabili-ties p ( x k | x x · · · x k − ) lead to different values of the ex-pected probability p exp ( x x · · · x k ) . Namely, if we as-sume that the occurrence of a letter does not depend onany of the previous letters, i.e. p ( x k | x x · · · x k − ) = p ( x k ) , the expected probability is simply given by the prod-uct of the relative frequencies of the string’s component let-ters: p exp ( x x · · · x k ) = p obs ( x ) · · · p obs ( x k ) [16, S7].By using instead a ﬁrst order Markov approximation, i.e. p ( x k | x x · · · x k − ) = p ( x k | x k − ) , the expected proba-bility can be expressed in the form: p exp ( x x · · · x k ) = p obs ( x ) p obs ( x | x ) · · · p obs ( x k | x k − ) , where p obs ( x j | x i ) is extracted from the countings as: p obs ( x j | x i ) = c ( x i x j ) / P x j c ( x i x j ) = p obs ( x i x j ) /p obs ( x i ) . This latter as-sumption is based on the fact that there is a minimal amountof memory in the sequence: a symbol of the sequence is cor-related to the previous one only. Here, we go beyond theapproximation of Markov chains of order 1, by retaining asmuch memory as possible [S4]. We assume: p exp ( x x · · · x k ) = p obs ( x x · · · x k − ) ·· p obs ( x k | x · · · x k − ) (2)where the conditional probabilities can be evaluated fromcountings as: p obs ( x k | x · · · x k − ) = c ( x x · · · x k ) P x k c ( x x · · · x k ) (3)or can be expressed in terms of the observed probability forshorter sequences as: p obs ( x k | x · · · x k − ) = p obs ( x · · · x k ) p obs ( x · · · x k − ) (4)By using the latter expression, we can ﬁnally write the ex-pected probabilities in a more compact form: p exp ( x ) = p obs ( x ) p exp ( x x ) = p obs ( x x ) p exp ( x x x ) = p obs ( x x ) p obs ( x x ) p obs ( x ) ... = .... (5) p exp ( x x · · · x k ) = p obs ( x · · · x k − ) ·· p obs ( x · · · x k ) p obs ( x · · · x k − ) This way, the expected probability of a given k -string is evalu-ated based on observations for strings of up to ( k − symbols.Therefore, by predicting the probability of appearance with ahigh order Markov model, our method allows to highlight thetrue k -body correlations subtracting from them the effects dueto ( k − and lower order correlations. Based on observed andexpected probabilities, a test of statistical signiﬁcance, for in-stance a Z -score, is then performed for each k -string. We de-ﬁne k -motifs or recurrent k -strings , the statistically-relevantstrings whose observed and expected number of occurrencesare such as to validate the statistical test adopted, and we in-dicate as Z k the dictionary composed by all the selected k -motifs [18].Once we have constructed a lexicon of fundamental units,the next goal is to represent in a graph the way they are com-bined together. Recurrent k -strings can be distributed differ-ently along the sequences: they can appear in single sequenceor in more than one sequence, alone or in clusters. To ex-tract the non trivial patterns of correlated appearance of k -motifs, we need to evaluate the probability for the randomco-occurrence of two motifs, when these are uncorrelated. Weestimate ﬁrst the expected probability that motif X is followedby motif Y within a generic sequence of the ensemble S , thenwe sum over all the sequences of S . We denote as p ( X ) and p ( Y ) the probabilities of ﬁnding the two motifs in S . In se-quence s , motif X can occupy positions ranging from the ﬁrstto the ( l s − k ) th site, where l s is the length of s , and k isthe length of the motif. We have assumed that the two mo-tifs cannot overlap. For each ﬁxed position i of X on s , with i = 1 , ..., ( l s − k ) , there are ( l s − k + 1 − i ) possibili-ties for Y to appear in the sequence. Hence, the number ofexpected co-occurences of X and Y within s is given by: P l s − ki =1 ( l s − k + 1 − i ) p ( X ) p ( Y ) . In order to obtain theexpected number of co-occurrences, we have to sum over allthe sequence in the ensemble S . We ﬁnally get: N exp ( Y | X ) = p ( X ) p ( Y ) S X s =1 l s − k X i =1 ( l s − k + 1 − i ) == 12 p ( X ) p ( Y ) S X s =1 ( l s − k + 1)( l s − k + 2) (6)For each value of k , we are now able to construct the k -motifnetwork of the ensemble S , i.e. a directed network whosenodes are motifs in the dictionary Z k , and an arc point fromnode X to node Y if the number of times Y follows X inthe ensemble of sequences is statistically signiﬁcant. Further-more, a weight can be associated to the arc from X to Y , basedon the extent to which the co-occurrence of the two motifs de-viates from expectation.This approach is able to represent the correlation patternsencrypted in the ensemble of sequences into a single object,the k -motif network. Then, graph theory allows to extract in-formation from the structural properties of the network, and toretrieve the main message encoded in the original sequences. AIC1DRY1 FMY1MAY1 MYF1 PMY1CAC2CDC2 CEC2CFC2 CHC2 CIC2 CKC2CLC2 CQC2 CRC2 CSC2CTC2 CVC2CVN2CWA2 CWC2FQC2FRC2 HYW2MCI2NCS2 NCT2 QCQ2YMC2YRC2 CGK3 CKE3CNE3ECG3ECK3 EKP3FKC3 GEK3 GKA3HKC3 HMR3 HQR3HTG3 IHT3 KAF3KCE3 KCK3KCN3 KGF3 KHK3KPF3KPY3KSF3NEC3 PYE3PYK3 QHQ3 QRI3RIH3 RMH3 RTH3RVH3 TGE3 THT3 WEW3WYK3 YEC3YIC3YKC3YQC3 YRD3YTC3 YVC3CVW4 MGM4WFW4 WLW4YPN4YSC4 DAD5NDN5WVW5FMC6FTC6 NRI6 WYH6 YWC6YWT6FQN7 KIW7GFG8 GSG8GYG8 HAH8 GIP9PGP9 PKG9PRG9GWG10WPW10 HGH11 HPH11ICK12 WGW12 WQW12 KFK13KLK13KMK13 KVK13SMP13RDR14RER14RSR15SRS15

FIG. 1. The 3-motifs network of the human proteome. Nodes be-longing to the same community are labeled by the same number andshare the same colour. Most of the communities can be associated toa functional domain as described in table I in [22].

In particular, it is interesting to study the components of the k -motif network or, if the graph is connected, its communitystructures, i.e. those groups of nodes tightly connected amongthemselves and weakly linked to the rest of the graph [15].In the following, we will consider the application of themethod to three different datasets, belonging to three contextsas diverse as biology, social dialogs and dynamical systems.We will show how the community analysis of the related k -motif networks enables to extract functional domains in pro-teomes, social cascades and hot topics in Twitter, and the in-crease of chaoticity in deterministic maps.In the biological context, many methods based on stringsdeviating from expectancy in genome [S4, S5] or in a pro-teome [S6] have been already used to make functional de-ductions. Although they provide insight on many biologicalmechanisms [S7], this approach turns out to be not sufﬁcientfor a complete and exhaustive interpretation of the genomicand proteomic message. A fundamental key to its compre-hension is in fact hidden in the correlations among recurrentpatterns of strings, which are perfectly represented at a globalscale in terms of k -motif networks. Various features of thesecorrelations translate into structural properties of k -motif net-works. In Fig. 1 we illustrate, as an example, the 3-motifgraph derived from the ensemble of human proteins (see [22]for details about the dataset). We have detected 15 differentcommunities in the graph, labeled in the ﬁgure with differentcolours and numbers. By means of a research in biologicaldatabases, we can show that linked couples of motifs belong-ing to the same community all co-occur in the same kind ofprotein domains and that one can associate 9 of these 15 com-munities just to one domain (see table I in [22]). These re-sults are outstanding compared to the current methods to ex-tract functional protein domains, all based on multi-alignmentof sequences, and cannot obtained if one uses a lower order TABLE I. The ﬁrst ten most signiﬁcant links between motifs, be-longing to 7 different communities in the Twitter dataset [22]. Eachcommunity corresponds to a speciﬁc tweet or expression that gener-ated a topic cascade.motif motif p obs p exp Expression or Tweet Topic1 29cle gg27 955.3

GUARDIAN ICMPOLL Cameron 35%Brown 29% Clegg 27% poll results fromvarious websites,journals, tvchannels, etc5bro wn29 894.8son4 4cle 924.3

Brown wins on 44%,Clegg is second on42%, Cameron 13%None of them 1% hey Dave, Gordon andNick : how about a 4thdebate on Channel 4this wednesday nightwithout the rules?!

Proposal for a 4thdebate amongleaders, made by ajournalist on histwitter pagenesd ayni 826.1jami ncoh 842.0 Benjamin Cohen Journalist ofChannel 4 Newsminc ohen 764.9isob eymu 831.4

Markov model, meaning that it is fundamental to take into ac-count both short- and long-range correlations (for more detailson the k -motif networks in proteomes, see [22]).Important information from k -motif networks can also beretrieved from datasets of social dialogs and microbloggingwebsites, like Twitter. Although in these cases, in principle, adictionary is a-priori known, not all terms used in the Internetlanguage are always listed in the dictionary [24]: abbrevia-tions, “leet language” words, names of websites or of pub-lic personages, are just some examples. Moreover, some ex-pressions or combinations of terms appear more frequently insome periods or contexts due to the interest in some hot topics.We have found that communities of k -motif networks derivedfrom microblogging sequences in Twitter during the UK Elec-tion in April 2010 are able to detect exactly those hot topicswhich generate information cascades [S15], as shown in Fig.1and Table II of [22]. In Table III we report the links with thehighest signiﬁcance together with the tweet associated to theircommunity. Each tweet was the origin of a cascade and canbe associated with a speciﬁc topic or event discussed duringthe election campaign (see [22] for details).Finally, k -motif networks carry important information onsequences of symbols generated from trajectories of dynam-ical systems by the so-called “symbolic dynamic” approach[10]. One is able, for instance, to distinguish ensembles ofsequences generated by deterministic maps from those gener-ated by stochastic processes, by looking at the number of com-ponents and communities in the k -motif network. In fact, themethod, when applied to sequences generated by determinis-tic equations that are increasingly non-linear, still ﬁnds shortmotifs, while the same does not occur for ensembles of ran- λ FIG. 2. Standard map: number of components in the -motifs net-works (main ﬁgure), and the Lyapunov exponent (inset), as a functionof the non-linearity parameter a . dom sequences. Furthermore, we have found that the higheris the non-linearity in a conservative deterministic dynamicalsystem, the more disconnected is the corresponding k -motifnetwork. In Fig. 2, we show an example of this behaviour fora well-known two-dimensional area-preserving deterministicmap, the standard map [S22]. Each point in Fig. 2 representsthe number of components in the -motif network obtainedfrom an ensemble of trajectories produced for a speciﬁc valueof the non-linearity parameter a . We observe that the numberof components increases with a , and this behavior is similarto that of the positive Lyapunov exponent of the map, shownin the inset (see also [22]).Summing up, in this Letter we have introduced a generalmethod to construct networks out of any symbolic sequen-tial data. The method is based on two different steps: ﬁrstit extracts in a “natural” way motifs, i.e. those recurrentshort strings which play the same role words do in language;then it represents correlations of motifs within sequences asa network. Important information from the original data areembedded in such a network and can be easily retrieved asshown with different applications (a biological system, a so-cial dialog and a dynamical system). With respect to previouslinguistic methods, our approach does not need the a prioriknowledge of a given dictionary, and also allows to comparedifferent ensembles, corresponding, for example, to differentvalues of control parameters in dynamical systems. All thismakes the method very general and opens up a wide rangeof applications from the study of written text, to the analysisof sheet music or sequences of dance movements. Moreover,the method does not use parameters on the position of motifsin order to correlate them, since co-occurrences are computedwithin sequences, which represent natural interruptions of acorpora of data (proteins in a proteome, posts in a blog, dif-ferent initial conditions in a symbolic dynamics, etc.). We thank A. Giansanti and V. Rosato for stimulating discussionson the biological applications of the method, S. Scellato for provid-ing us with the Twitter dataset, and M. De Domenico for his interest-ing comments on applications to dynamical systems. This work waspartially supported by the Italian To61 INFN project. ∗ Corresponding author: [email protected][1] R. Ferrer i Cancho, R.V. Sol´e and R. K¨ohler,

Phys. Rev. E ,051915(R) (2004); R. Ferrer i Cancho and R.V. Sol´e, Proc. R.Soc. Lond. B , 2261 (2001).[2] A. Motter, A.P.S. De Moura, Y.C. Lai, and P. Dasgupta,

Phys.Rev. E , , 065102 (2002); E.G. Altmann, J. B. Pierrehumbertand A.E. Motter, PLoS One :e7678 (2009).[3] D. B. Searls, Nature , 211 (2002).[4] V. Brendel, J.S. Beckmann, and E.N. Trifonov,

Journal ofBiomolecular Structure & Dynamics , 011 (1986).[5] C.-K. Peng, S. V. Buldyrev, A. L. Goldberger, S. Havlin, F.Sciortino, M. Simons, H. E. Stanley, Nature , 168 (1992).[6] N. Scafetta, V. Latora, and P. Grigolini,

PRE , 031906 (2002).[7] V. Rosato, N. Pucello, and G. Giuliano, Trends in Genetics ,278 (2002).[8] H.J. Bussemaker, H. Li and E.D. Siggia, Proc. Natl. Acad. Sci.USA , 10096 (2000).[9] Z. Solan, D. Horn, E. Ruppin and S. Edelman, Proc. Natl. Acad.Sci. USA , 11629 (2005).[10] C. Beck and F. Schl¨ogl, Thermodynamics of chaotic systems(Cambridge University Press, Cambridge, 1993).[11] L. Lacasa, B. Luque, F. Ballesteros, J. Luque, and J. C. Nunos,

Proc. Natl. Acad. Sci. USA , 4972 (2008).[12] J. Zivkovic, M. Mitrovic and B. Tadic,

Studies in Computa-tional Intelligence , , Complex Networks, Eds. S. Fortunatoet. al., 23-34, Springer, (2009).[13] E. Bradley, D. Capps, J. Luftig, and J. Stuart, Open AI Journal , 1-19 (2010).[14] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U.Hwang, Phys. Rep. , 175 (2006).[15] S. Fortunato,

Phys. Rep. , 75 (2010).[16] L. Ferraro, A. Giansanti, G. Giuliano, and V. Rosato,arXiv:q-bio/0410011v2.[17] A. Giansanti, M. Bocchieri, V. Rosato, and S. Musumeci,

Par-asitol. Res. , 639 (2007).[18] The term motif is chosen in analogy with the concept of net-work motifs , i.e. recurrent patterns of nodes and links in a graph.U. Alon,

Nature Reviews Genetics , 450 (2007).[19] M. Caselle, F. Di Cunto, and P. Provero, BMC Bioinformatics , 7 (2002); D. Cor`a, F. Di Cunto, P. Provero, L. Silengo, andM. Caselle, BMC Bioinformatics , 57 (2004).[20] P. Nicod`eme, T. Doerks, and M. Vingron, Bioinformatics :S161, Suppl.2 (2002).[21] A. J. Enright, S. Van Dongen, and C. A. Ouzounis, NucleicAcids Research :1575 (2002).[22] See supplementary material at the end of the paper for detailsand other results about the application of the method in the threedatasets.[23] K. Lerman and R. Ghosh, in Proc. of ICWSM (2010).[24] M. Mitrovic and B. Tadic, Eur. Phys. J. B : 293 (2010).[25] B.V. Chirikov, Phys. Rep. :263 (1979). Supplementary material to “Networks ofmotifs from sequences of symbols”

The properties of k -motif networks can reveal importantcharacteristics of the message encrypted in the original data,as the analysis of topological quantities (clustering coefﬁcient,average path length and degree distributions) has helped tounderstand various linguistic features in networks of wordsco-occuring in sentences [S1, S2], and also to model how lan-guage has evolved in networks of conceptually-related words[S3]. We present here details and further results on the ap-plication of the method described in the main article to threedifferent datasets: proteomic sequences, short text messagesacquired from Twitter , the well-known social network andmicroblogging platform, and ensembles of sequences derivedfrom dynamical trajectories of the standard map by means ofa symbolic dynamics approach.

BIOLOGICAL SEQUENCES

Methods to study over- or under-representation of partic-ular motifs in a complete genome [S4, S5] or in a proteome[S6], have already been proposed, and the results have beenused to make functional deductions. Although the informa-tion contained in strings deviating from expectancy is usefulfor the analysis of many biological mechanisms [S7], it turnsout to be not sufﬁcient for a complete and exhaustive interpre-tation of the genomic and proteomic message. A fundamentalkey to its comprehension is in fact hidden in the correlationsamong recurrent patterns of strings. The spatial structure ofproteins provides an example: when a protein folds, segmentsdistant on the sequence come to be close to each others in thespace. This can happen because two (or more) segments needto physically interact in order to perform the biological func-tion the protein is supposed to go through. Such a mechanismtranslates into a statistical correlation between short motifs ofaminoacids, which is well captured by an analysis in terms of k -motif networks. Human proteome

In our application, we have considered the ensemble of se-quences relative to the human proteome [S8]. It consists of34180 aminoacidic sequences of variable size, with an aver-age length of 481 letters. For this dataset, we have computedthe probabilities p obs and p exp for each of the = 8000 possible strings of three aminoacids, and we have selected as -motifs the strings satisfying p obs p exp > D p obs p exp E + 2 σ , hencecreating the dictionary Z [S9]. The entries of the dictio-nary are the nodes of the 3-motif network. The node X isthen linked to Y with a directed arc if the number of timesthat motif Y follows motif X within the same protein is sta-tistically signiﬁcant, according to the relation: p obs ( Y | X ) p exp ( Y | X ) > D p obs ( Y | X ) p exp ( Y | X ) E + 2 σ . The statistical signiﬁcance p obs ( Y | X ) p exp ( Y | X ) isalso the weight of the arc. In this way we obtain the 3-motifgraph of 199 nodes and 1302 directed links, shown in Fig.1 of the main article. The graph has 86 isolated nodes (notdisplayed in Figure), while the remaining 113 nodes are orga-nized into 10 weak components. The largest component of thegraph contains 5 clusters, detected by means of the MCl algo-rithm [S10]. Therefore, 15 different communities are presentin the graph. In Table II we report, for each community, thenumber of nodes and its total internal weight, deﬁned as thesum of the weights of links between nodes of the communi-ties normalized by the sum of the weights of links incidentin nodes of the community. By submitting a query to theProsite database [S11] we have obtained, for each couple ofconnected motifs belonging to the same community, the listof all proteins, classiﬁed by domain, where the two motifsco-occur. The results show that linked couples of motifs be-longing to the same community, all co-occur in the same kindof domains. In addition to this, one can associate 9 of these15 communities just to one protein domain, since the majorityof co-occurrences emerge in proteins matching a well-deﬁnedfunction. In Table II we report, when possible, the associationto a single protein domain, together with the ratio between thenumber of times the couple of motifs with the highest weightoccurred in that speciﬁc domain, and the total number of co–occurrences in the database.Analogous results were also found for the 4-motif graph[S12], while it is not possible to derive the same kind of infor- TABLE II. List of communities in the 3-motif network of the humanproteome. Community labels as in Fig. 1 of the main text, numberof nodes, total internal weight, associated domain, and the domainspeciﬁcity are reported. toyourchildryscreeng gillianduffy disobeymisobeymusobeymur citizensitizensu FIG. 3. Components of the -motifs network of the twitter dataset. Each component and its associated topic are described in table III. mation by using lower order Markov models to construct dic-tionaries. For example, the 3-motif network constructed witha dictionary based on a lower order approximation rather thanon a 2-bodies Markov chain, exhibits a community structurewith just four communities, none of which could be identiﬁedwith a functional protein domain. SOCIAL NETWORKS AND MICROBLOGGING

By means of k -motif networks, information can also beretrieved from datasets of social dialogs and microbloggingwebsites. Although in these cases, in principle, a dictionaryis a-priori known, not all terms used in the Internet languageare always listed in a dictionary: abbreviations, puns, leet lan-guage words [S13], names of websites or names of public ﬁg-ures, are just some examples. Moreover, some expressionsor combinations of terms appear more frequently in some pe-riods or contexts due to the interest to some hot topics. Inaddition to this, the method of k -motif networks turns to bevery useful in all those contexts where it is necessary to pro-cess and compact information from large amount of symbolicdata. This is the case of Internet, where the amount of textdata provided by blogs, dialogs in social networks, forums,etc. is growing and growing.In the following, we provide details on how network of mo-tifs are able to deduce information about hot topics and cas-cades [S15, S16] in a dataset extracted from Twitter, a well-know platform for social networking and microblogging. Twitter

Twitter [S14] is a social networking and microblogging ser-vice which allows users to send short messages known as tweets . Tweets are composed only of text, with a strict limit of 140 characters: they are displayed on the author’s pro-ﬁle page and delivered to the authors subscribers, who arealso known as “followers”. The dataset we have analyzedis a collection of 28143 tweets, crawled on two days, fromthe 23rd to 24th April 2010, and selected through the Twit-ter Streaming API [S17] if they contained the string . The choice of such a keyword, called in Twitteralso hashtag , was aimed to select all those tweets concerningelectoral campaign in UK, where general election to elect themembers of the House of Commons would have taken placetwo weeks later. We have analyzed the dataset removing allblank spaces between words and all symbols that where notnumbers or letters (punctuation, symbols like $, @, *, etc.)and not distinguishing between lower- and upper-case letters.From these sequences, dictionaries of motifs Z and Z havebeen extracted, selecting respectively the 10% and 1% of mostsigniﬁcant strings of and letters. As described in themain text, we have constructed networks whose nodes rep-resent the entries of a dictionary, and an arc is drawn from thenode representing string X to the node standing for string Y,if p obs ( Y | X ) /p exp ( Y | X ) is greater than a certain threshold.In Fig. 3, we show the -motifs network when the thresholdis set equal to 400 (isolated nodes not reported). Such a highthreshold is chosen to have a small network that can be eas-ily visualized and studied. More information can be obtainedby setting the threshold to lower values or analyzing networksmade up of motifs of different length k . Searching in the orig-inal dataset the connected motifs, it is possible to associateeach component to a particular tweet which generated a cas-cade or with a speciﬁc expression, related to a speciﬁc hottopic discussed by users of the microblogging platform. Forall components of Fig. 3, we report in Table III the tweet orexpression associated and its meaning. For example, compo-nent 1 and 4 can be associated to two exit polls disclosed onthose days by two different journals, or component 6 to thename “Gillian Duffy”, a 65-years old pensioner involved in apolitical scandal with British PM Gordon Brown during theelection tour (Brown’s remarks of her as a “bigoted woman”were accidentally recorded and broadcast). SYMBOLIC DYNAMICS

Symbolic dynamics is a general method to transform trajec-tories of dynamical systems into sequences of symbols. Thedistinct feature in symbolic dynamics is that time is measuredin discrete intervals. So at each time interval the system is ina particular state. Each state is associated with a symbol andthe evolution of the system is then described by a sequence ofsymbols. The method turns to be very useful in all those caseswhere system states and time are inherently discrete. In casethe time scale of the system or its states are not discrete, onehas to set a coarse-grained description of the system. Differentinitial conditions usually generate different trajectories in thephase space, which map onto different sequences of symbols.A large number of initial conditions produces an ensemble ofsequences whose analysis can be addressed with the methodbased on networks of motifs, as described in the main article.In the following, we will describe the application of themethod to the standard map, and we will show how the relatednetworks of motifs shape according to its chaotic behavior.

Standard Map

The standard map, also known as Chirikov map, is a bidi-mensional area-preserving chaotic map. It maps a square withside π onto itself [S22]. It is described by the equations: (cid:26) x t +1 = p t + a sin x t mod 2 πp t +1 = p t + x t +1 mod 2 π (7)where t represents time iteration and a is a parameter as-suming real values. The map is increasingly chaotic as a in-creases (see inset of Fig. 2 in the main article to see a plotof the Lyapunov exponent as a function of the parameter a ).For a = 0 , the map is linear and only periodic and quasiperi-odic orbits are allowed. When evolution of trajectories areplotted in the phase space (the xp plane), periodic orbits ap-pear as closed curves, and quasiperiodic orbits as necklaces ofclosed curves whose centers lie in another larger closed curve.Which type of orbit is observed depends on the map’s initialconditions. When the nonlinearity of the map increases, forappropriate initial conditions it is possible to observe chaoticdynamics.In order to obtain sequences from the standard map (7) bymeans of the symbolic dynamic approach [S23], one needs tomake a coarse graining of the phase space, deﬁning a discreteand ﬁnite number of possible states the trajectory can occupy.This way it is possible to associate a symbol to each of thepossible states and derive a sequence according to the trajec-tory originating from an initial condition. We have coarse-grained the phase space into 25 ( × ) squares of equal size TABLE III. In relation to Fig. 3, we report the number of nodes,links, the tweet or the expression containing the motifs and the topicassociated to each of the 13 communities

Comm. Nodes Links Expression or Tweet Topic1 25 33

Brown wins on 44%,Clegg is second on 42%,Cameron 13% None ofthem 1% poll results from var-ious websites, jour-nals, tv channels, etc2 12 14 Benjamin Cohen Journalist of Channel4 News [S18]3 10 11 hey Dave, Gordon andNick : how about a 4thdebate on Channel 4 thiswednesday night withoutthe rules?!

Proposal for a 4th de-bate among leaders,made by a journaliston his Twitter page4 9 13

GUARDIAN ICM POLLCameron 35% Brown 29%Clegg 27% poll results from var-ious websites, jour-nals, tv channels, etc5 6 5

Very funny screengrabfrom the LeadersDebate

About a funny pictureof the leaders debateon BBC [S19]6 3 2 Gillian Duffy Woman branded a’bigot’ by GordonBrown in generalelection campaign[S20]7 6 5

Cameron: I believe that ifyou’ve inherited hard allyour life you should pass iton to your children

Electoral campaignfrom David Cameron8 6 3

Strategy that when avoter misrepresentshis or her sincerepreferences in or-der to gain a morefavorable outcome[S21] and we have derived for different values of the parameter a , sequences of symbols. In other words, this means tofollow for time steps the trajectories originating from different initial conditions.The idea is that closed orbits or quasi periodic-ones corre-spond to correlations between motifs and therefore in links ofthe graph of motifs. When the map becomes more and morechaotic, closed orbits disappear and, correspondingly, the net-works break in many components. In the extreme limit ofmap highly chaotic ( a > ), the network of motifs are com-pletly disconnected, with all nodes isolated. Nevertheless, thisscenario is different from the one generated by stochastic se-quences, since in this case motifs would not be detected, whilethis still happens in the chaotic map, although only for smallvalues of k . This result is well depicted in Fig. 3 of the mainarticle, where the number of components of the -motif graphsis plotted as a function of the value a of the map generatingthe ensemble. This curve is shown to have the same behaviorof the Lyapunov exponent, as reported in the inset of the sameﬁgure. ∗ Corresponding author: [email protected][S1] R. Ferrer i Cancho and R.V. Sol´e,

Proc. R. Soc. Lond. B ,2261 (2001).[S2] S.M.G. Caldeira, T.C. Petit Lob˜ao, R.F.S. Andrade, A. Neme,and J.G.V. Miranda,

Eur. Phys. J. B , 523 (2006).[S3] A. Motter, A.P.S. De Moura, Y.C. Lai, and P. Dasgupta, Phys.Rev. E , , 065102 (2002).[S4] V. Brendel, J.S. Beckmann, and E.N. Trifonov, Journal ofBiomolecular Structure & Dynamics , 011 (1986).[S5] M. Caselle, F. Di Cunto, and P. Provero, BMC Bioinformatics , 7 (2002); D. Cor`a, F. Di Cunto, P. Provero, L. Silengo, andM. Caselle, BMC Bioinformatics , 57 (2004).[S6] P. Nicod`eme, T. Doerks, and M. Vingron, Bioinformatics : S161, Suppl.2 (2002).[S7] A. Giansanti, M. Bocchieri, V. Rosato, and S. Musumeci, Par-asitol. Res. , 639 (2007).[S8] Data downloaded from the

Consensus Coding Sequencedatabase h p ( x ) i , we denote the average of p ( x ) overall the possible conﬁgurations of x and with σ the standard de-viation of the distribution.[S10] A. J. Enright, S. Van Dongen, and C. A. Ouzounis, NucleicAcids Research Analysis of proteomes by means of k -motif networks :263 (1979).[S23] V.M. Aleksev and M.V. Yakobson, Phys. Rep.75