Using Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages
Dániel Kondor, István Csabai, László Dobos, János Szüle, Norbert Barankai, Tamás Hanyecz, Tamás Sebők, Zsófia Kallus, Gábor Vattay
UUsing
Robust PCA to estimate regional characteristics of language use from geo-tagged Twitter messages
Dániel Kondor, István Csabai, László Dobos, János Szüle, Norbert Barankai,Tamás Hanyecz, Tamás Sebők, Zsófia Kallus and Gábor VattayDepartment of Physics of Complex Systems, Eötvös Loránd UniversityH-1117, Pázmány Péter sétány 1/A, Budapest, HungaryEmail: [email protected] 27, 2018
Abstract
Principal component analysis (PCA) and related techniques have been success-fully employed in natural language processing. Text mining applications in the ageof the online social media (OSM) face new challenges due to properties specific tothese use cases (e.g. spelling issues specific to texts posted by users, the presenceof spammers and bots, service announcements, etc.). In this paper, we employ aRobust PCA technique to separate typical outliers and highly localized topics fromthe low-dimensional structure present in language use in online social networks. Ourfocus is on identifying geospatial features among the messages posted by the usersof the Twitter microblogging service. Using a dataset which consists of over 200million geolocated tweets collected over the course of a year, we investigate whetherthe information present in word usage frequencies can be used to identify regionalfeatures of language use and topics of interest. Using the
PCA pursuit method, weare able to identify important low-dimensional features, which constitute smoothlyvarying functions of the geographic location.
With the rapidly growing corpus of digitally available textual data, there has been asignificant interest in unsupervised language processing techniques [2]. Principal compo-nent analysis (PCA) and the closely related latent semantic analysis (LSA) [1] have beenwidely and successfully applied to various classes of documents to identify relations amongdocuments, e.g. cluster the documents according to topics or indexing documents [3–5].Nowdays, LSA is employed by many search engines to improve results based on clusteringdocuments to topics of interest. While the basic recipe for LSA is well known, each datasource might require special treatment according to its unique characteristic.A relatively new and potentially valuable source of digital textual data is the corpusof messages posted by users to online social network (OSN) websites. Using these newdatasources, previous research showed the possibility to gain valuable new insights into1 a r X i v : . [ c s . C L ] N ov he structure and dynamics of society, gaining the possibility to address questions whosestudy was previously limited by the scarcity of available data [9–11, 13, 14, 16]. Also, thepossibility to simultaneously analyze both the network topology of connections betweenusers and the corpus of textual data posted by them opens up new possibilities in thefields of social, network and computer sciences [12, 15, 17, 18]. A valuable aspect of someonline social networks is the inclusion of spatial data. For example, the users of theTwitter microblogging service have the ability to attach their geographic coordinates totheir short messages (tweets), which opens up the possibility to analyze the geographicalvariation of language use on massive amounts of readily available data [10, 14].In this paper we evaluate principal component analysis (PCA) techniques on a datasetwhich consists of geo-tagged Twitter messages (tweets), and investigate the feasibilityof identifying regional characteristics of language use. Our approach is to consider amedium-sized geographical region as one “document”, and apply PCA to the corpus ofdocuments which we produce by assigning the text of tweets to the to the geographicregion where it was posted. In this way, the rows of our term-document matrix correspondto geographical regions and the columns to the words included in our corpus. This methodignores the possible variation present inside the regions (e.g. multiple distinct topics inone region), and the possibility to detect topics which are globally present. This is inaccordance in our goals; we wish to identify the regional variations. Detection of topicswithout a clear regional focus might be possible with other natural language processingmethods, and local variations can be detected by “zooming in” to regions with a highnumber of tweets to obtain a finer-grained picture of language use.In many real-world PCA applications, the data matrix can be considered as theweighted sum of two parts, one of which is sparse and the other is low-rank. The low-rankpart can be thought of as some “background” component, while the sparse part can eithercontain the relevant data or can be composed of some nontrivial outliers. Depending onthe nature of the problem studied, either one or both of them might be of interest. Inthese cases, it would be desirable to be able to separate these two parts, and to analyzethem separately. Also, if the sparse part has a large magnitude, the principal componentsfound by PCA will be dominated by the outliers and revealing the low-rank structurepresent in the data can prove difficult. In the case of our corpus of Twitter messages, weexpect the data matrix to be indeed separable into these two parts. We expect the low-rank component to represent the true variation in language use (e.g. usage frequencies ofwords will possibly be different in different geographic regions), and the sparse part tocontain highly localized topics of interest (e.g. landmarks). To deal with these issues, weemploy the Principal Component Pursuit technique developed by Candès et.al., whichachieves the separation of the original data matrix into its sparse and low-rank partseffectively [6, 7].
We used the publicly available sample datastream to download geotagged tweets. Ourdataset includes a total of 1.3 billion tweets between August 2012 and March 2013. Amongthese, 725 million are geo-tagged, i.e. have GPS coordinated associated with them. In2his paper, we limit our analysis to tweets from the contiguous United Stated of America,a total of approximately 245 million tweets. We constructed a geographically indexeddatabase of these tweets, which allows the efficient analysis of regional features [19]. Weuse the HTM scheme for geographic indexing [21]. This employs a quad-tree, where thesurface of the Earth is recursively divided into spherical triangles called HTM cells or trixels . The indexing works on multiple levels; a deeper (i.e. more detailed) level can beacquired by dividing each trixel into 4 smaller triangles by connecting the midpoints ofthe sides. The subdivision is started by 8 spherical triangles (level 0).We compile the list of the most frequently used words from the tweets, and computetheir spatial distribution. We divide the USA to level 6 HTM cells; we have a totalof 558 cells with an approximate cell area of 15500 km each. We construct the wordoccurrence matrix W ij as the number of occurrences of the i -th word in the j -th cell. Asthe population density of the USA is very heterogeneous, the number of word occurrencesin each cell is also heterogeneous. To improve the quality of the dataset, we only includecells which contain at least 10000 occurrences of at least 1000 individual words. We alsolimit the words used to those with at least 10000 occurrences in at least 300 individualcells. This way there remain 491 cells (see Fig. 1) and 6032 words, which form the W ij word occurrence matrix. Before applying the PCA procedure, we normalize W ij so that the elements are the relative frequencies of words: X ij ≡ W ij / (cid:80) k W kj , i.e. wenormalize each element by the total number of words posted in that cell. Due to thestructure of our data matrix, and the employed Robust PCA method, we chose not tosubtract averages from the data; of course, this will probably result in that average wordfrequencies dominate the first principal components.Furthermore, we compile a higher resolution data matrix from the same dataset fo-cusing on New York City, with a total of 6.3 million tweets. In this case, we use level13 HTM cells (approximate cell area: . km ), to possibly identify small-scale languageuse differences. We limit our analysis to trixels with at least occurrences of at least individual words. Also, we only include words with at least occurrences. Thisway, we have trixels and words (see Fig. 2). We compute the singular value decomposition (SVD) of the data matrix X : X = U Σ V T (1)The diagonal matrix Σ contains the singular values, while the column of U and V containthe principal components. The components reveal the sources of variation in the data,ordered by their magnitude of contribution.To analyze the results we can plot each component (i.e. the columns of the V matrix)on the map of the USA; we color each trixel according to the corresponding value (seeFig. 3). We also inspect the values which can be associated to the individual words ineach component (i.e. the columns of the U matrix). While the classic PCA method is useful for detecting structure in the data, in some cases,it might not give optimal results. One such case is that when there are outliers in the3
Figure 1: Number of words from the corpus in the trixels included in the analysis. Cellswith less than 10000 occurrences of at least 1000 individual words were excluded fromour analysis; these are left blank here too. Note that the colorscale is logarithmic.data; in this case, the main principal components are dominated by them and identifyinga possible low-rank structure becomes more challenging. Filtering out these can alsoprove difficult, as in many cases, it is not straightforward to estimate the relevance ofcomponents in the raw data. In the case of Twitter messages, this would mean the iden-tification of the sources of outliers (e.g. spammers and advertisement, weather forecast,local tourist attractions), and excluding messages containing them. While this wouldimprove the results of the PCA, it might also leave out some relevant data. A morefavorable approach would be a method, which can be applied to the raw data matrix,and automatically separates outliers (i.e. sparse parts) and low-dimensional structure.In this case, we can inspect the sparse part separately, and decide which componentsare relevant. This can be obtained by the
Principal Component Pursuit technique [6, 7],which separates the data matrix into these two parts: X = X S + X LR , (2)where X S is a sparse matrix and X LR contains the dense but low-rank part of the data.This is achieved by minimizing the sum λ (cid:107) X S (cid:107) + (cid:107) X LR (cid:107) σ , (3)where for a matrix X of dimensions n × n with n ≥ n , λ = 1 / √ n , and the normsare the l and nuclear norms respectively: (cid:107) X (cid:107) = (cid:88) ij | X ij |(cid:107) X (cid:107) σ = (cid:88) i σ i ( X ) . (4)4 Figure 2: Number of words from the corpus in the trixels included for New York City.Cells with less than 6500 occurrences of at least 1000 individual words were excluded.Note that the colorscale is logarithmic. 5ere σ i ( X ) denotes the i -th singular value of X . To obtain the results we use theMatlab code developed by Lin et al. implementing the inexact augmented Lagrangianmethod [6, 8].We carry out the SVD for both matrices and analyze the results simultaneously. Weexpect the principal components of X S to contain information about highly localizedtrends (i.e. features specific to only one or few cells), and the components of X LR tocontain the possibly smooth variations in language use. On Fig. 3, we show the first five principal components obtained by both methods overlayedon the map of the USA. As we deliberately chose not to subtract any mean values from thedata, the first component does not seem to contain any relevant structure. As the meanvalues in the original data matrix are the same for each cell, the outlier cells displayed hereare the ones where the otherwise rare words are in abundance. The second componentshows a clear structure in both cases, with the southern USA being separated from therest (note that the sign is inverted in the case of the low-rank part). Examining the wordsgiving the largest contribution (i.e. the columns of the U matrix), we can identify that thisseparation is mainly due to swear words, abbreviations and words with spelling specificto online social networks (see Table 1). In the case of the classic PCA and the sparsepart identified by PCA pursuit, words found in weather reports are also significant. Thesecontributions mainly come from the northern USA, where in some cells a large proportionof geolocated tweets come from meteorological services. On the other hand, in the case ofthe low-rank part identified by the PCA pursuit, the southern USA is mainly counteractedby words relevant to the geographic features of the northern USA, e.g. “winter” or “snow”.This means that the separation is effective, the “outlier” cells are indeed well filtered bythe method.In the case of the higher components, the difference between the two methods becomesmore drastic. The results of the classic PCA are dominated by outliers – cells which areisolated centers where the content of the tweets differ significantly from the rest. Someof these can be considered “noise”: meteorological stations, service announcements andtweets of job advertisements; on the other hand, some of these show real localized features:e.g. in the case of fifth component, the “outlier” cell is the location of the Grand Canyon,where a significant proportion of tweets indeed mention the canyon itself (see Figs. 3mand 3n). In the case of the PCA Pursuit, these results are present in the sparse part(which is very similar to the results of the classic PCA). The low-rank part containsslowly varying vectors, which mimic of the geographic and social features of the USA.In the case of the low-rank part, in the third component, the major cities are separatedfrom the countryside: words like ’downtown’, ’sushi’, ’mall’ or ’pub’ are opposed by wordslike ’truck’. In the case of the fourth component, areas with high ratio of Spanish speakingTwitter users are separated from the Northeastern USA.6 + (a) Classic PCA, first compo-nent (b) PCA Pursuit, sparse part,first component (c) PCA Pursuit, low rankpart, first component(d) Classic PCA, second com-ponent (e) PCA Pursuit, sparse part,second component (f) PCA Pursuit, low rank part,second component(g) Classic PCA, third compo-nent (h) PCA Pursuit, sparse part,third component (i) PCA Pursuit, low rank part,third component(j) Classic PCA, 4th compo-nent (k) PCA Pursuit, sparse part,4th component (l) PCA Pursuit, low rank part,4th component(m) Classic PCA, 5th compo-nent (n) PCA Pursuit, sparse part,5th component (o) PCA Pursuit, low rankpart, 5th component Figure 3: Results of classic PCA and PCA Pursuit, first five components. The fullcolorscale is shown on the top. While the results of the classic PCA approach are mostlydominated by outlier cells, PCA Pursuit is able to separate these from the relevant low-dimensional geographic structure. 7 st component 2nd component 3rd component 4th component 5th componentpositivescores backstage, ayo,barcelona, noe, elle,encore, toronto,hosted, att, diablo,nova, gabriel, luxury,vine, ballet, diana,hbd, platform, visual,featuring lol, she, ass, nigga,ain, shit, lmao, bout,tho, gone, damn, smh,dat, bitch, wit, ima,lil, wanna, niggas,gotta que, for, grand, los,por, beach, con, para,day, time, del, las,park, not, national,una, best, others,night, great que, los, por, con,para, del, las, una,como, pero, lol, esta,bien, todo, mas, hoy,cuando, ser, vida, hay for, not, don, peo-ple, love, que, know,someone, one, night,tonight, why, wish,school, gonna, going,ever, fucking, hate,guysnegativescores night, think, right,need, shit, going, to-day, people, she, back,day, one, time, know,good, don, love, not,lol, for que, underground,current, direction,humidity, new, valley,temperature, mph,great, national, win-ter, november, grand,wind, park, until,weather, december,for gone, until, bout,today, december, she,ain, ass, nigga, lmao,rain, lol, direction,current, underground,weather, mph, tem-perature, humidity,wind great, solutions,going, lake, right,december, state,last, bed, one, good,people, night, un-til, beach, not, day,tonight, today, for trail, valley, decem-ber, bout, desert, vil-lage, ass, ain, lmao,nigga, view, arizona,others, posted, photo,point, lol, park, na-tional, grand (a) Sparse part. Words giving large contributions are usually only relevant to a few cells. A significant type of these wordsis the ones relevant to meteorological services; these arise from low-density cells where the tweets of a local meteorologicalservice form a significant proportion of all geolocated tweets. Other relevant contributions arise from tourist locations(e.g.: “grand” for the Grand Canyon. (b) Low-rank part. Here, four components represent a large-scale language usage differences throughout the USA. Seethe text for more discussion.
Table 1: Significant words identified in the results of the principal pursuit method carried out for the wholeUSA (see Fig. 3). 8 a) 1st component (b) 2nd component (c) 3rd component (d) 4th component (e) 5th component
Figure 4: Results of PCA Pursuit applied to New York city, first five components, low-rank part. The results roughly partition the city to its boroughs.
Table 2: Important words identified in the low-rank part of the PCA Pursuit for New York City. (a) 1st component (b) 2nd component (c) 3rd component (d) 4th component (e) 5th component
Figure 5: The tweets of New York City, projected to the space defined by the first fiveprincipal components obtained from applying PCA Pursuit to the whole USA (low-rankpart). 9 .2 New York City
We have seen that we were able to identify the low-rank structure of variations in languageuse on the scale of the USA. To test if there are such variations present on a much smallerscale, we applied the same method to the region containing most of New York City(Fig. 2). Results obtained for this corpus are depicted on Fig. 4 (principal componentsof the obtained low-rank matrix). These components separate the city into its variousdistricts. In the case of the first two components, Manhattan is mainly separated fromthe rest of the city; in the case of the second component, the separation is again inpart due to the use of swear words and online text specific abbreviations. The thirdcomponent is now mainly about the ratio of Spanish texts, with the Bronx and Queensgiving significant contributions.We also examine whether the space of principal components obtained for New YorkCity is similar to the components obtained for the whole USA. We project the wordoccurrence matrix of New York City into the space of principal components obtained forthe whole USA: V (proj) = U (USA) X (NY) , and then plot the obtained vectors on the mapof New York (Fig. 5). We get that most components separate the city in a meaningfulway, suggesting that some of the differences obtained at large scales are relevant on muchsmaller scales too. In this paper, we investigated the feasibility of principal component analysis to identifyregional features of user content present in geotagged Twitter messages. Using the
PCAPursuit technique, we were able to separate the low-rank and sparse parts present in thedata, and identify some of the main features in both. Examining the spatial distributionof the principal components of the low-rank part, we found that there are indeed large-scale spatial variations present among Twitter users. We also investigated the scalabilityof this method and found that some of the principal components found for the wholeUSA are also relevant if we only consider the much smaller area of New York City.The methods presented here can be the basis for studying language use on large-scalein the future. The ability to separate the word usage matrix into a low-rank and sparsecomponent opens up many possibilities. The principal components of the sparse partcan be used to identify regional or highly localized topics of interest, and also to identifysources of Twitter content generated by bots and services, as these can be easily spottedin otherwise low-density regions. On the other hand, the low-rank part shows relevantregional variations in language use. Exploring possible connections between languageuse differences revealed by the methods employed here and other demographic or socialfeatures will probably yield new insights into the dynamics of society and language.Our next goal is to combine the PCA Pursuit method with sentiment analysis methodsdeveloped previously by Mitchell et al. [20].On the other hand, the results of PCA might lead to more direct applications. Iden-tifying locations of users based on the content of their messages has been previouslyevaluated using probabilistic approaches [10]. Now, using the space of PCA components,we might be able to estimate the location of texts (e.g. a small set of tweets from a specificuser, or connected group of users) more efficiently. Also, comparing the result obtained10ere with the results obtained by text mining methods focusing on other (i.e. non geo-graphic) features has the potential to extend those approaches to obtain spatial resultstoo.Also, our further goals include implementing the better integration of the PCA Pursuitmethod and the related data processing steps into the database to obtain an integratedand easy-to-use environment for analyzing large corpuses of possibly geo-tagged texts.
Acknowledgments
The authors thank the partial support of the European Union and the European So-cial Fund through project FuturICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV-2012-0013), the OTKA 7779 and the NAP 2005/KCKHA005 grants. EITKIC_12-1-2012-0001project was partially supported by the Hungarian Government, managed by the NationalDevelopment Agency, and financed by the Research and Technology Innovation Fund andthe MAKOG Foundation.
References [1] Landauer, T., and Dumais, S. (1997). A solution to Plato’s problem: The latentsemantic analysis theory of acquisition, induction, and representation of knowledge.
Psychological review , (2), 211–240.[2] Zsigmondi, Zs., and Kiss, A. (2013). Implementation of Natural Language SemanticWildcards using Prolog. SPLST 13 [3] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R.(1990) Indexing by Latent Semantic Analysis.
Journal of the American Society forInformation Science , , 391.[4] Foltz, P. W., Kintsch, W., and Landauer, T. K. (1998). The measurement of textualcoherence with latent semantic analysis. Discourse Processes , , 285–307.[5] Bellegarda, J. R., Butzberger, J. W., Yen-Lu Chow, Coccaro, N. B., and Naik, D.(1996) A novel word clustering algorithm based on latent semantic analysis. ICASSP-96 .[6] Lin, Z., Chen, M., and Ma, Y. (2010). The augmented Lagrange multiplier methodfor exact recovery of corrupted low-rank matrices. arXiv:1009.5055 .[7] Emmanuel J. Candès, Xiaodong Li, Yi Ma, John Wright (2011). Robust PrincipalComponent Analysis.
JACM , (3), 11.[8] We use the Matlab code made available by Chen et al.: http://perception.csl.illinois.edu/matrix-rank/sample_code.html .[9] Backstrom, L., Sun, E., and Marlow, C. (2010). Find me if you can: improvinggeographical prediction with social and spatial proximity. WWW ’10: Proceedingsof the 19th international conference on World wide web.
Proc. CIKM10 [11] Bakshy, E., Rosenn, I., Marlow, C., and Adamic, L. (2012). The role of social net-works in information diffusion.
WWW ’12 Proceedings of the 21st international con-ference on World Wide Web , 519–528.[12] Bakshy, E., and Hofman, J. (2011). Everyone’s an influencer: quantifying influenceon twitter.
Proceedings of the fourth ACM international conference on Web searchand data mining. [13] Leskovec, J., Backstrom, L., and Kleinberg, J. (2009). Meme-tracking and the dy-namics of the news cycle.
Proceedings of the 15th ACM SIGKDD international con-ference on Knowledge discovery and data mining - KDD ’09 , 497.[14] Eisenstein, J., O’Connor, B., Smith, N., and Xing, E. (2012). Mapping the geograph-ical diffusion of new words. arXiv: 1210.5268 [15] Rodriguez, M. G., Leskovec, J., and Schölkopf, B. (2012). Structure and Dynamicsof Information Pathways in Online Media. arXiv: 1212.1464 [16] Foster, J., Çetinoglu, Ö., Wagner, J., Le Roux, J., Hogan, S., Nivre, J., Hogan,D. and Van Genabith, J. (2011).
25. AAAI Workshop, Analyzing Microtext. [17] Myers, S., Zhu, C., and Leskovec, J. (2012). Information diffusion and external in-fluence in networks.
Proceedings of the 18th ACM SIGKDD international conferenceon Knowledge discovery and data mining. [18] Watts, D., and Dodds, P. (2007). Influentials, networks, and publicopinion formation.
Journal of consumer research , Submitted to the CogInfoCom 2013 .[20] Mitchell L, Frank MR, Harris KD, Dodds PS, Danforth CM (2013). The Geographyof Happiness: Connecting Twitter Sentiment and Expression, Demographics, andObjective Characteristics of Place.