[PDF] Modular networks of word correlations on Twitter

Abstract

Complex networks are important tools for analyzing the information flow in many aspects of nature and human society. Using data from the microblogging service Twitter, we study networks of correlations in the appearance of words from three different categories, international brands, nouns and US major cities. We create networks where the strength of links is determined by a similarity measure based on the rate of coappearance of words. In comparison with the null model, where words are assumed to be uncorrelated, the heavy-tailed distribution of pair correlations is shown to be a consequence of modules of words representing similar entities.

Full PDF

aa r X i v : . [ phy s i c s . s o c - ph ] O c t Modular networks of word correlations on Twitter

Joachim Mathiesen, Pernille Yde and Mogens H. Jensen

Niels Bohr Institute, University of Copenhagen,Blegdamsvej 17, DK-2100 Copenhagen, Denmark

Complex networks are important tools for analyzing the information ﬂow in manyaspects of nature and human society. Using data from the microblogging serviceTwitter, we study networks of correlations in the appearance of words from threediﬀerent categories, international brands, nouns and US major cities. We createnetworks where the strength of links is determined by a similarity measure based onthe rate of coappearance of words. In comparison with the null model, where wordsare assumed to be uncorrelated, the heavy-tailed distribution of pair correlations isshown to be a consequence of modules of words representing similar entities.

Introduction

Networks are elegant representations of interactions between individuals in large commu-nities and organizations[1–3]. These networks are constantly changing according to demands,fashions and ﬂow of ideas[4–6]. The worldwide popularity of social media such as Twitter[5–7] have made them a considerable component in research on social networks[8, 9]. Twitter isa microblogging service that allows registered users to post short text-based announcements,limited to 140 characters in length, known as “tweets”, to an online stream. The frequencyby which users interact on a global scale on Twitter allows for a high-resolution real-timeanalysis of movements in society.From automatic queries to Twitter, we have estimated tweet rates of words from a givenset M containing selected words from one of the three diﬀerent categories, internationalbrand names, nouns and US major city names. The rate is measured by the number of newtweets posted per hour. For each query submitted at time t about a speciﬁc word a ∈ M ,Twitter returns a ﬁnite set of n a ( t ) tweets { T , . . . , T n a ( t ) } . In additions to the text string s containing the word a , each tweet contains the username of the author of the tweet, the time t i the tweet was posted and further details that we have not used. A tweet T i is therefore listof information T i = ( s, t i , . . . ). The maximum number of tweets returned from each query is n a = 1500.The time signal of tweets mentioning a speciﬁc word a , η a ( t ), can be written on the form η ( t ) = X i δ ( t − t i ) , (1)From the number of tweets and the timestamps we compute an averaged tweet rate of aword a , γ a ( t ) = 1 τ Z t + τt η ( t )d t = n a ( t ) τ , (2)Similarly we deﬁne a rate for two words a and b appearing in a tweet at the same time, γ ab ( t ) = n ab ( t ) /τ .Tweets containing words from the aforementioned categories were recorded over a periodof 4 months November 2010 – February 2011 and a period of two months January 2012 –February 2012. In general the rate, at which new tweets appear containing words from eachof the categories, is too high to count the total number of tweets. Our analysis is thereforebased on estimated tweet rates computed from Eq. (2) using n a = 100 and n a = 1500.When averaging over many queries, we did not see a signiﬁcant diﬀerence our results whenusing diﬀerent values of n a .We analyse the correlation between individual words within the said categories. For thatpurpose, we deﬁne a measure of similarity in terms of rate of co-appearance of pairs ofwords. The measure is then used to construct networks where links represent the degree ofsimilarity. All three categories of words are shown to have a pronounced modularity. Theway that we consider correlation networks can be seen as an alternative to existing studieson semantic networks (see e.g. [13]). Results

We deﬁne a similarity measure between two words a and b in terms of the hourly averagerate γ ab by which new tweets appear containing both a and b . For example, by consideringqueries to Twitter containing the terms ”Google” and ”Microsoft”, we get γ Google ≈ γ Microsoft ≈ γ Google,Microsoft ≈

700 tweetsper hour (January 2011). A normalized symmetric measure of similarity is naturally deﬁnedby ω ab = γ a ∩ b γ a ∪ b = γ ab γ a + γ b − γ ab (3)Alternatively one could use information theory to compute the similarity from the jointprobability of observing two words in the same tweet[10]. This approach would have beenuseful had we had access to the normalized probabilities of observing A and B . Here, becauseof limitations to the permissible sample rate of data we only have access to a fraction ofthe total number of posted tweets containing the relevant words and therefore we can onlyestimate the relative probabilities.In Fig. 2 we present networks of international brand names where the link strength is givenby the measure Eq. (3). A threshold is introduced on the link strength in order to visualizeprimary structures, i.e. links between brand with a similarity ω AB < .

004 are omitted.We observe that the network is strongly modular with groups of brands representing similarproducts or services. As an example one can observe distinct groups of European car brands,Asiatic car brands, consulting and IT companies, and fashion brands. The modules in thenetwork are coloured according to the community detection algorithm introduced in[11].While most of the connections inside the modules are somewhat less surprising, the fewlinks connecting the modules represent some less obvious relations between brands.In Fig. 3A, a similarity network of US cities is shown. The network provides an alternativemap where individual cities only to some extent are grouped according to their geographicallocation. The network is dominated by a central module consisting of New York, Chicago,Atlanta, Los Angeles and Boston. This is not surprising as these cities are hubs in theAmerican society. We observe a module of Californian cities that connects naturally tocities like Denver and Seattle. We also detect a module of east-coast to mid-western citiesconnecting to a module of southern cities. Again the modules were detected by the algorithmpresented in [11]. It is natural to ask how much of the similarity between cities is inﬂuencedby the geographical distance between them. To answer this question, we have comparedtweet rates with the distance between cities as well as the size of the cities. It turns out thatthere is a weak to moderate correlation between the size of a city and the number of tweetsreferring to that city. The co-appearance of two cities, however, has no clear correlationwith their sizes and the distance between them. That said, when the nodes in the similaritynetwork are arranged according to their geographical location it is evident that cities insame regions (states or neighbouring states) are better inter-connected and therefore oftenbelong in the same module, see Fig. 3B.As a ﬁnal example of a similarity network, we present in Fig. 4 a network of nouns. Froma list of 2000 common nouns in the English language, 200 nouns are randomly selected andthe corresponding pairwise similarities are computed. Like the previous networks for brandsand cities, the network of nouns also exhibits a pronounced modularity with modules e.g.representing similar food products.We now consider further the data underlying the link strength in the networks. As amain result, we obtain scale free distributions, p ( γ ab ) ∼ γ − αab , (4)of the pairwise tweet rates γ ab over six orders of magnitude using the brand names, nounsas well as major cities, see Fig. 5. Surprisingly, the distributions all have the same scalingexponent α = 1 . ± .

02 (s.d.). The distribution of the tweet rates of individual searchterms a , γ a , does not follow a clear scale invariant distribution (see inset of Fig. 5). Moreover,the tweet rate of pairs γ ab does not follow trivially from the rate of the individual brands,that is, the rate is not proportional to the product γ a γ b which would be the case if a and b were uncorrelated. In particular we notice that if the distribution of the rates γ x couldbe approximated by a scale invariant distribution p ( γ x ) ∼ γ − αx then the product z = γ a γ b would follow a distribution p ( z ) ∼ z − α log( z ) . (5)which follows from introducing the auxiliary variable v = γ a /γ b and performing the integral Z z/ǫ ǫ /z p ( z, v )d v = Z z/ǫ ǫ /z p ( γ a ( z, v ) , γ b ( z, v )) (cid:12)(cid:12)(cid:12)(cid:12) ∂ ( γ a , γ b ) ∂ ( z, v ) (cid:12)(cid:12)(cid:12)(cid:12) d v, (6)where ǫ is a characteristic minimum tweet-rate that we observe.The logarithmic correction to the scaling does not provide a statistically signiﬁcant ﬁt tothe data presented in Fig. 5, that is a best ﬁt has an exponent α ≈ γ x of individual search terms (see the inset of Fig. 5). A power-law distributionhas also been observed for the co-occurrence of tags in social annotation systems [14] whereusers annotate online resources such as web pages by lists of words. The exponent of thedistribution in the annotation systems ( α >

2) is larger than the one reported here and isclose to the distribution of co-occurrence of nouns in sentences of novels considered below.

FIG. 1: Probability density functions of the number of search hits returned from Bing and Googleand for the number of sentences in which two nouns co-appear in novels. In panel a) we performedpairwise queries on international brands to Bing and Google. In contrast to the result obtainedfrom Twitter, we do not observe clear scale-free distributions. Inset: Probability density functionsof search hits returned from queries on individual brands alone. Panel b) shows the number ofsentences in which two nouns co-appear in the novels Huckleberry Finn (Mark Twain) and Moby-Dick (Herman Melville). The distributions are plotted on double-logarithmic scales and includethe distributions of individual nouns. Dashed lines are best ﬁt to a scale-free distribution andhave exponents α = 2 . ± . α = 2 . ± . Discussion

For comparison, we have performed a similar analysis using search engines such as Googleand Bing. The similarity between two words was computed from Eq. (3) by inserting thenumber of web pages that the search engines return containing the words. That is, insteadof a rate we now use a ﬁxed number. The distributions turn out to be signiﬁcantly diﬀerent(see Fig. 1a) and do not show a clear scaling behavior as in the case of Twitter. This mayin part be explained by the fact that the search engines return results from web pages whichare not restricted in size and they cover a wide range of media.Finally, we compare the scaling behavior of word correlations observed on Twitter byconsidering the corresponding distribution of nouns in sentences of novels by Mark Twain(Huckleberry Finn) and Herman Melville (Moby-Dick). The sentences in these novels turnout to have a typical length comparable to the 140 character limit of a tweet and do indeedlead to broad but signiﬁcantly steeper distributions in the word correlations (see Fig. 1b).The novels are written by single authors and typically exhibits a more formal structurecompared to the text messages. At the same time, the pair distribution of nouns are forthe novels compatible with the null model where all words in the novels are randomizedmeaning that the correlated structures in the novels are rather weak. The distributions ofindividual words were considered for the same novels in [15]. Compared to the novels thedistribution of the co-appearance of words in tweets is less broad, which might be becausethe active vocabulary of the average user of Twitter is less diverse than that of the authorsof the two novels.Scale invariance is often described by Zipf’s law [12] which states that the frequency ofa word (for instance in a language) is inversely proportional to the rank in the frequencytable. In its general formulation Zipf’s law says that the frequency γ of a word is a powerlaw in the rank γ ∼ r − α . For the corresponding probability density functions we have p ( γ ) dγ = p ( r ) dr → p ( γ ) = p ( r ) (cid:12)(cid:12)(cid:12) drdγ (cid:12)(cid:12)(cid:12) . Since dkdγ = γ − αα making the natural assumption thatthe PDF of the rank is a constant, we obtain the PDF of the frequency as p ( γ ) ∼ γ − αα (7)Empirically the value α ∼ α ∼ .

1. In Fig. 1 (inset) we observed a frequencydistribution p ( γ ) ∼ γ − for words in two novels leading to α ∼ p ( γ ) ∼ γ − . leading to a rank exponent of the order α = 2 . Acknowledgments

Suggestions and comments by Alex Hunziker and Pengfei Tian are gratefully acknowl-edged. This study was supported by the Danish National Research Foundation through theCenter for Models of Life.

Author Contributions

J.M., P.Y and M.H.J. designed the research, performed the research, analyzed the data,and wrote the paper

Competing ﬁnancial interests

The authors declare no competing ﬁnancial interests. [1] Albert R, Barabasi A.-L. (2002) Statistical mechanics of complex networks. Rev. Mod. Phys.74: 47-97.[2] Borgatti SP, Mehra A, Brass DJ, Labianca G. (2009) Network analysis in the Social Sciences.Science 323: 892-895. [3] Kitsak M et al. (2010) Identiﬁcation of inﬂuential spreaders in complex networks. NaturePhysics 6: 888-893.[4] Ratkiewskicz J, Fortunato S, Flammini F, Vespignani A (2010) Characterizing and modelingthe dynamics of online popularity, Phys. Rev. Lett. 105: 158701[5] Mandavilli A (2011) Peer review: Trial by Twitter. Nature 469:, 286-287.[6] Huberman BA, Romero DM, Wu F (2009) Crowdsourcing, attention and productivity. J.Inform. Sci. 35: 758-765.[7] Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media?Proceedings of the 19th international conference on World Wide Web, 591-600.[8] King G (2011) Ensuring the data-rich future of the social sciences. Science 331: 719-721.[9] Centola D (2010) The spread of behavior in an online social network experiment. Science 329:1194-1197.[10] R.L. Cilibrasi and P.M.B. Vitanyi (2007) The Google Similarity Distance, IEEE T. Knowl.Data. En 3, 370383[11] Rosvall M and Bergstrom C. T. (2008) Maps of random walks on complex networks revealcommunity structure. Proc. Natl. Acad. Sci. U.S.A. 105 1118-1123[12] Zipf G.K., Human behavior and the principle of least eﬀort (Addison-Wesley, Cambridge,1949)[13] Steyvers M, Tenenbaum JB (2005) The large-scale structure of semantic networks: Statisticalanalyses and a model of semantic growth. Cognit Sci 29:4178.[14] Cattuto C., Barrat A., Baldassarro A, Schehr G, and Loreto V. (2009) Collective dynamicsof social annotation. Proc. Natl. Acad. Sci. U.S.A 26:10511-10515.[15] Bernhardsson S, Correa da Rocha LE, Minnhagen P (2010) Size-dependent word frequenciesand translational invariance of books, Physica A 389, 330-341.

Google IBMApple MicrosoftVodafoneBlackberry OracleVerizon SapBMWNintendo Nokia AccentureIntelMercedesT-mobilePorscheSamsungSiemensNissan HermesSony GucciLouis VuittonCiscoHondaToyota Canon ReutersVolkswagenAudiHyundai AdobeFerrari BurberryYahoo eBay A) brand b r a n d − − − − − B) FIG. 2: Network of correlations between international brands computed from the correspondingtweet rates on Twitter. A link in the network represents the similarity measure computed usingEq. (3). Only links with a strength larger than 0.004 are shown. The color of the nodes representsmodules of more inter-connected brands. Darker link colors mean stronger links. Nashville BaltimorePhoenix ColumbusBostonLouisvilleMemphisDallasOakland Houston CharlotteAustinTucson MilwaukeeOmahaPortland ClevelandPhiladelphiaAtlantaSanDiego FortWorthNewYorkElPaso Detroit WashingtonOklahomaCityLosAngelesSeattleLongBeach ChicagoSanAntonioSanFranciscoSacramento LasVegas TulsaDenver MiamiSanJoseKansasCity RaleighMesaColoradoSprings

PortlandSeattleSacramentoSanFrancisco OaklandSan Jose LasVegasLos AngelesLong BeachSan DiegoPhoenix El PasoMesaTucson DenverColoradoSprings KansasCityOmahaOklahoma CityTulsa DallasFortWorthAustinSanAntonio Houston MiamiAtlantaMemphisNashvilleLouisville ColumbusDetroitClevelandChicagoMilwaukee BostonCharlotteRaleighWashingtonBaltimorePhiladelphiaNew York city c i t y − − − − − A)B)C)

FIG. 3: Network of cities with high similarity. In panel A), we show a similarity network wherenodes are located according to the algorithm of Fruchterman-Rheingold. In panel B), the corre-sponding network is shown where nodes are arranged according to the geographical location ofthe cities. In both panels only links with a strength larger than 0.004 are shown. In the network,darker link colors mean stronger links. help mensinger factorytrip battery stringmouthdollarpet potpicture idea sweetscheesemagic lipsticknightpower sheepteacall boatclose shoestick lunch winterlightclick thread saltearth balloonmusic sugarcamera oatmealhead lamblaughpandalibrary daywave hair gorilla volleyball onionfeet milkcapsmoke comicsteambedroom weathermorning citygym chocolatehardware stamp goattomatocomputer mirrormemory noun noun − − − − − A)B)

FIG. 4: Network of nouns with high similarity. Similarity network of 200 random nouns chosenfrom a list of the 2000 most common nouns. We only show the largest connected component forlinks with a strength larger than 0.04. Tweet rate γ (per hour) P r o b a b i l i t y d e n s i t y f u n c t i o n − − − − − − BrandsCitiesNouns

Tweet rate γ (per hour) P r o b a b i l i t y d e n s i t y f u n c t i o n − − − Jaccard similarity measure P r o b a b i l i t y d e n s i t y f u n c t i o n − − − − − − − BrandsCitiesNouns A) B) FIG. 5: Probability density function of tweet rates of pairs of international brands, major cities inthe USA and common English nouns. The distributions include rates of individual search terms.The violet circles correspond to brand names, the blue triangles to cities and the green squares tonouns. Note that the rates of the cities have been multiplied by 20 to allow for a direct comparison.The distributions of the rates are scale invariant over more than six orders of magnitude and havethe same exponent α = 1 . ± .

02 (s.d.). The dashed line corresponds to α = 1 ..