Visualizing and Quantifying Impact and Effect in Twitter Narrative using Geometric Data Analysis
VVisualizing and Quantifying Impact and Effect inTwitter Narrative using Geometric Data Analysis
Fionn Murtagh (1), Monica Pianosi (2), Richard Bull (2)(1) School of Computer Science & Informatics(2) Institute of Energy & Sustainable DevelopmentDe Montfort University, Leicester LE1 9BH, UK [email protected], [email protected], [email protected]
August 14, 2018
Abstract
We use geometric multivariate data analysis which has been termeda methodology for both the visualization and verbalization of data. Thegeneral objectives are data mining and knowledge discovery. In the firstcase study, we use the narrative surrounding very highly profiled tweets,and thus a Twitter event of significance and importance. In the secondcase study, we use eight carefully planned Twitter campaigns relating toenvironmental issues. The aim of these campaigns was to increase envi-ronmental awareness and behaviour. Unlike current marketing, politicaland other communication campaigns using Twitter, we develop an inno-vative approach to measuring bevavioural change. We show also how wecan assess statistical significance of social media behaviour.
Keywords:
Twitter, Correspondence Analysis, semantics, multivariate dataanalysis, text analysis, visualization
The general aim of our work is the “visualization and verbalization of data” [5].Furthermore, the data here is the narrative of the flow of tweets (microblogs)in the online (Web 3.0) social medium, Twitter.
The general approach to analysis of Twitter conversations taken in [17] isbased on hashtags (terms preceded with the character “ a r X i v : . [ c s . S I] S e p eing the explicit use of each other’s “@” names) and a graph of such ex-changes was used for community analysis. The latter, in the case of [17], wasprimarily aimed at the pro and contra viewpoints relative to climate change.Based on such polarization of views, and greater prevalence of tweeting in theunsupportive-to-supportive direction (relative to action to counteract climatechange), it was nonetheless concluded that more work was required: “Con-tent analysis of the tweets could be a possible qualitative approach that couldshed light on [...] and provide new knowledge about the content of conversa-tional connections discovered ...”. In our work in this article, we look at theconversational connections starting from what is aiming at being an initiating,instigational and influencing tweet.In [8], Twitter-based behaviour (relating to the 2009 H1N1 swine flu) wassubjected to content analysis that included analysis of retweets, seeking partic-ular words and phrases, and manual labelling for content and sentiment charac-terization, followed by analysis of that. Such work was carried out by queryingthe Twitter data. The queries were sophisticated with many boolean connec-tives (“and”, “or”, etc.). Rather than a querying, matching and supervisedapproach such as this, and used in general for sentiment analysis, our work inthis article will be data-driven and unsupervised. We will map out the underly-ing semantics of our social media data through the text. The text used providesthe “sensory surface” [13] of the underlying semantics.Social media monitoring was originally adopted by public relations and ad-vertising agencies, who used it as a means to identify negative comments postedon the web about their clients [1]. It is defined as the activity of observingand tracking content on the social web. Each activity on social media has anoutcome, or effect, which can be measured by observing and then quantifyingspecific behaviours. Effects can be one of the following: retweets, mentions,favourites, follows, likes, shares, comments, sentiment.Social media are used by companies and public relations agencies, by localand central governments, who all seek to evaluate the use of social media chan-nels as a communication or engagement tool. In the evaluation of online successof museums [9], and the Social Media Metric for US Federal Agencies [10], theemphasis is not on evaluating social media efforts for marketing purposes, butto provide organisations with tools to be able to understand if their efforts inengaging citizens have been successful and, crucially, what defines success. Itis specifically the emphasis on engagement and collaboration with citizens thatmakes these approaches different from the marketing strategies, which are morefocused on connecting companies with their clients.Of direct relevance to our second case study is our previous work, as fol-lows. We [18] used tools that are freely available on-line to analyse social mediatraffic. The most basic form of effectiveness thus becomes creating social mediaconversation. This includes attracting more and new people, engaging themin different actions, and assessing how they participate in conversations boththeme-wise and among themselves. Four main measurement approaches wereused: (1) growth of community, (2) engagement (e.g. retweets), (3) content indi-cators, and (4) conversations (e.g. number of these). For each category different2etrics were defined and compared, using, as we have noted, publicly availablesoftware tools.We [18] concluded: “... although useful in understanding the effectiveness ofa communication campaign in its numerical terms, the proposed methodologycan only be the first step of a more in-depth investigation about what peoplecan learn during their on-line participation, and what is the perceived impactof the process on them, behaviour- or citizenship-wise. Consequently a morein-depth analysis of the characteristic of the community and a content analysisof on-line conversations is necessary...”. In this present work we are primarilyfocused on the content analysis of on-line (Twitter) conversations. We seek toanalyse the semantics of the discourse in a data-driven way. The following isconcluded by [18]: “top-down communication campaigns both predominate andare advised by those involved in ‘social marketing’ ... . However, this rarelymanifests itself through measurable behaviour change ...”. Thus our approachis, in its point of departure and vantage point, bottom-up. I.e. our approach isbased on the observable data.Mediated by the latent semantic mapping of the discourse, we will developsemantic distance measures between deliberative actions and the aggregate so-cial effect. We let the data speak (a Benz´ecri quotation, noted in [5]) in regardto influence, impact and reach. Twitter data presents all sorts of problems for word, or certainly more so forlinguistic, analysis. One example from the Stephen Fry tweets that we use is thefollowing that is part of a tweet: “Too twired to teet, too mailed out to e-shag.”(The first part of this play on words and language is referring to “too tired totweet”, and the second part has even more play on words relating to “e-mail”and the informal expression for being very tired, “to be shagged”). Anotherexample is a mention of the city of Manchester as “Madchester” with its “Reetpleased (note stunningly accurate Mancunian accent)”. There are many furtherexamples of informal expressions, “gr8ly” meaning “greatly”, and some use oflanguages other than English (an exchange in Dutch, culinary terms in French.)One result of our work is to show how semantic properties of words ex-tracted from Twitter are usable in practical, application-oriented analysis, thatis lexically-based, and that has potential for revealing the underlying or latentsemantics. Since our analysis takes into account all pairwise relationships, spec-ified through shared associations, therefore there is incorporation of contextof words and their use. From the comprehensive set of relationships betweentweets, between words, and between tweets and words, we have the basis foranalysis of semantics.The methodology used is based on a latent semantic, metric space embeddingfollowed, if desired, by induction of a hierarchical clustering, also expressed asthe inducing of an ultrametric or rooted tree topology.3 .3 Two Case Sudies of Twitter Narratives in This Work
In our first case study, we take impactful tweets and study their role in the Twit-ter narrative. In the second case study, we take Twitter data from a carefullyplanned campaign to influence through Twitter the environmentally-consciousattitudes as manifested in the Twitter medium.
We apply the approach used in [2] to take the text of a narrative and divideit into lexically homogeneous subsequences or parts, and coupled with this, todetect natural breakpoints in the narrative flow. The advantage of the geometricdata analysis approach used in [2] is that the structure of the narrative, and thesemantic flow, are revealed in a bottom-up manner, based on the actual textualdata.Our geometrical data analysis approach uses Corresponence Analysis in or-der to map out “the flow of thought and the flow of language” [6, 7] in aEuclidean metric, latent semantic factor space. From that factor space, a hier-archical topology is determined, and this hierarchy express the semantics at acontinuum of resolutions or scales.Unlike previous work that uses geometric data analysis on textually-expressednarrative, including [2, 15, 16], here in this work we are involved with social me-dia data where the narrative is much more diffuse and less focused. Twitter,consisting of streams of text messages called tweets that are each a maximumof 140 characters in length, is very often a dialogue with other tweeters, withnames preceded by the “at” sign, @, and frequently there is reference to top-ics that are made linkable through being preceded by the hash symbol,
When in October 2009, the actor, presenter and celebrity Stephen Fry an-nounced his retirement from Twitter to his near 1 million followers, it was anewsworthy event. It was reported [19] that “Fry’s disagreement with anothertweeter began when the latter said ‘I admire and adore’ Fry, but that he foundhis tweets ‘a bit... boring... (sorry Stephen)’.The tweeter, who said that he had been blocked from viewing Fry’s Twitterfeed, later apologised and acknowledged that Fry suffers from bipolar disorder.”Having caused major impact among his followers and wider afield, Fry actu-ally returned to Twitter, nearly immediately, having had an apology from the4ffending tweeter, @brumplum .The two crucial tweets of Stephen Fry’s were as follows. (In discussion below,we refer to them as, respectively, the “I retire” tweet or the “aggression” tweet.)6:09 a.m. on 31 October 2009: @brumplum You’ve convinced me. I’m obviously not good enough. Iretire from Twitter henceforward. Bye everyone.
Think I may have to give up on Twitter. Too much aggression andunkindness around. Pity. Well, it’s been fun.
In order to look at those decisive tweets in context, we took a set of 302of Fry’s tweets, spanning the critical early morning of 31 October 2009. Thesewere from 22 October 2009 to 22 November 2009.
Words are collected from the 302 tweets. Initially we have 1787 unique wordsdefined as follows: containing more than one consecutive letter; with punctu-ation and special characters deleted (hence with modification of short URLs,hashtags or Twitter names preceded by an at sign, but, for our purposes notdetracting from our interest in semantic content); and with no lemmatizationnor other processing, in order to leave us with all available emotially-laden oremotionally-indicative function words. For our analysis we do require a certainamount of sharing of words by the tweets. Otherwise there will be isolatedtweets (that are disconnected through no shared terms). So we select wordsdepending on two thresholds: a minimum global frequency and a minimumnumber of tweets for which the word is used. Both thresholds were set to 5(determined as a compromise between a good overlap between tweets in termsof word presence, yet not removing an overly large number of words). This ledto 143 words retained for the 302 tweets. A repercussion is that some tweetsbecame empty of words: 293 were non-empty, out of the 302 tweet set.For high dimensional word usage spaces, it is normal for CorrespondenceAnalysis to have a lack of concentration of inertia in the succession of factors(cf. Appendix B), that is to say, the latent semantic factors are of relativelysimilar importance. Hence, we developed an analysis methodology as outlinedin the following sections.
First we pursued the following analysis approach. Taking the two crucial tweetsnoted in section 2.1, there were 33 words, as follows.“to”, “and”, “it”, “on”, “you”, “me”, “not”, “have”, “up”, “too”, “from”,“good”, “well”, “think”, “ve”, “been”, “may”, “much”, “twitter”, “fun”, “brum-plum”, “enough”,“everyone”, “give”, “obviously”, “aggression”, “around”, “bye”,“convinced”, “henceforward”, “pity”,“retire”, “unkindness”5 -2 -1 - Principal factor projection of tweets in 33-word space
Dim 1 (6.40%) D i m ( . % ) to and it on you me not have up too from good well think ve been may much twitter fun brumplum enough everyone give obviously aggression around bye convinced henceforward pity retire unkindness RETIRE
AGGRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1: Factors 1 and 2, the best two-dimensional or planar projection ofthe data cloud of 302 tweets, where 225 tweets were retained as non-empty.Simultaneously we have the planar projection of the 33 word cloud. The dots areat the locations of the tweets (identifiers are not shown, to avoid overcrowding).Just two tweets, the crucial two, have the “retire” and “aggression” labels (andnot just a dot).(Note “ve”, from “have”, due to the removal of an apostrophe.) Then weseek out all other tweets that use at least one of these words. That resulted in225 out of the total of 302 tweets being retained.Figure 1 positions our two critical tweets in a best planar projection of thetweets and associated words. In Figure 1, the contribution to the inertia offactor 1 by the “aggression” tweet is the greatest among all tweets, and thecontribution to the inertia of factor 2 by the “I retire” tweet is the greatestamong all tweets. While useful for finding dominant themes (expressed by thewords used in the tweets), and perhaps also for the trajectory of these themes,we can use the full dimenensionality of the latent semantic representation ofthis Twitter data by clustering the tweets, based on their (Euclidean metric)factor projections. We use a chronologically (or sequence) constrained completelink agglomerative hierarchical clustering. See [11, 15, 2] for this hierarchicalclustering approach. 6igure 2: Hierarchical clustering, using the complete link agglomerative crite-rion (good for compact clusters) on the full dimensionality, Euclidean factorcoordinates. Just 33 words are used. The tweet (with a relatively long branch-ing path before it is agglomerated, to the left side of the text) is annotated thatis immediately following the two crucial ones that we are focused on, i.e. the “Iretire” tweet and the “aggression” tweet.7igure 3: A close-up from Figure 2. Our two critical tweets are the 166th and167th ones here, the “I retire” and “aggression” tweets. (Cf. section 2.1.)8igure 2 displays this hierarchical clustering of the Twitter narrative. Figure3 is a close-up view of part of the dendrogram. Our crucial tweets are locatedat the end of a fairly compact clustering structure. This points to how our twocrucial tweets can be considered as bringing a sub-narrative to a conclusion.Our interest is therefore raised in finding sub-narratives in the Twitter flow.These sub-narratives are sought here as chronologically successive tweets, i.e. asegment in the chronological flow of tweets.
To investigate our two critical tweets, such as the immediate or other precursors,and the repercussions or subsequent evolution of the Twitter narrative, we willnow determine sub-narratives in the overall Twitter narrative. This we will dothrough segmentation of the flow of tweets. So a sub-narrative is defined asa segment of this flow of tweets. That is, the sub-narrative consists of groups(or clusters) of successive tweets that are semantically homogeneous. Semantichomogeneity is defined through a statistical significance approach.
We return now to the full, original word set.On the full set of tweets and the words used in these tweets, a threshold of 5tweets was required for each word, and the total number of occurrences of wordsneeded to be at least 5. This lowered our word set, initially 1787, to 143. Thenwe removed stopwords, and partial words, in a list that we made: “the”, “ to”,“and”, “of”, “in.”, “it”, “is”, “for.”, “that”, “on”, “at”, “be”, “this”, “what”,“an”, “if.”, “ve”, “don”, “ly”, “th”, “tr”, “ll”. That led to 121 words retained.There remained 280 non-empty tweets (from the inital set of 302 tweets). Ourtwo critical tweets (the “I retire” and the “aggression” ones) were among theretained tweet set.Following Correspondence Analysis of the 280 tweets crossed by 121 words,an agglomerative hierarchical clustering was applied on the full-dimensionalityfactor space coordinates. The chronological sequence of tweets was hierarchi-cally clustered. With the set of 280 tweets, crossed by the 121 word set, Figure4 shows the chronological hierarchical clustering. Our two critical tweets are intheir chronological sequence in the 280-tweet sequence (at the 211th and 212thtweet positions in this sequence).A note follows now on why we did not use hashtag words (themes referredto), or at-sign prefaced words (other tweeters by Twitter name). The hash-tag was not used all that often, the usages being: (previous two generallytogether), (previous two together always), .The total number of at-sign names was 86. This was insufficient to base ourentire analysis on Twitter names, even if hashtag themes were added.9igure 4: Hierarchical clustering, using the complete link agglomerative crite-rion (providing compact clusters) on the full dimensionality, Euclidean factorcoordinates. The tweets are characterized by presence of any of the 121 word setused. The 280 tweets, in chronological sequence are associated with the terminalnodes (arranged horizontally at the bottom of the dendrogram or hierarchicaltree). We look for an understanding of semantic content, and the evolution ofthis, leading up to our two crucial tweets, and the further evolution of the tweetflow. 10o exploit the visualization of the Twitter narrative that is expressed inFigure 4, we will summarize this visualization by determining a segmentation ofthe flow of tweets. That is equivalently expressed as determining a partition oftweets from the dendrogram. Furthermore, as described in the next subsection,we look for internal nodes of the dendrogram that are statistically significant(using the approach that will now be described).
In line with [2], we made these agglomerations subject to a permutation testto authorize or not each agglomeration that is deemed to be significant. In thedescription that now follows for determining significant segments of tweets, wefollow very closely [2]. Statistical significance is that the agglomerands validlyform a single segment.All the distances between pairs of objects of the two adjacent groups thatare candidates for agglomeration are computed. These distances are dividedinto two groups: 50% of the distances with the highest values are coded with 1and 50% with the lowest values are coded with 0. The count of high distancesis denoted by h . The count of high distances between permuted groups is alsocomputed.The number of permutations producing a result equal to or over h , dividedby the number of permutations that are performed, gives an estimate of theprobability p of observing the data under the null hypothesis (the objects in thetwo groups are drawn from the same statistical population and, consequently,it is only an artefact of the agglomerative clustering algorithm that they tem-porarily form two groups). Probability p is compared with a pre-establishedsignificance level α . If p > α , the null hypothesis is accepted and the fusionof the two groups is carried out. If p ≤ α , the null hypothesis is rejected andfusion of the groups is prevented. Changing the value of α changes the resolu-tion of the partition obtained, which is what is obtained when the sequence ofagglomerations is not allowed to go to its culmination point (of just one clustercontaining all entities being clustered).An α significance level of 0.15 was set (giving an intermediate number ofsegments between not too large, if α were set to a greater value, or a smallnumber of segments if the significance level were more demanding, i.e. smallerin value). Assessment of significance used 5000 permutations (found to be verystable relative to a number of permutations that were a few hundred upwards).The number of segments found was 40. A factor space mapping of these40 segments was determined, in their 121-word space. Four of these segments(6th, 18th, 36th, 39th) had just one tweet. Since they would therefore quitepossibly perturb the Correspondence Analysis, in being exceptional in this way,we took these particular tweets as supplementary tweets. This means that theCorrespondence Analysis factor space (i.e. the latent semantic space endowedwith the Euclidean metric) was determined using the active set of 40 less thesefour tweets, and then the four supplementary tweets were projected into the11 -1 - -
36 active clusters (40 in all) in factors 1, 2 plane
Dim 1 (8.28%) D i m ( . % ) Figure 5: The centres of gravity of 40 segment groups of the Twitter flow areprojected in the principal factor plane. (See text for details related to 36 ofthese tweets being used for this analysis, and then 4 being projected into thefactor space as supplementary tweets.)factor space.The mapping is shown in Figure 5. It is noticeable that segment group30, that contains our critical tweets towards the end of it, is very close to theorigin, which is the average tweet here. The average tweet can be taken as themost innocuous. Therefore the factor plane of factors 1 and 2 is not useful forsaying anything further about segment group 30, beyond the fact that it is fullyunremarkable.The contributions of segment group 30 to the factors 1, 2, 3, 4, 5 are, re-spectively 0.04, 1.71, 9.94, 1.62, 0.27. We will look at factors 2,3 because theyare determined far more (than the other factors here) by segment group 30.Figure 6 displays the words that are of greater contribution to the meaninertia of factors 2 and 3. We note that Figure 7 displays the important tweetsegments in the factors 2,3 plane, i.e. the tweet segments with contribution tothe inertias of these factors that are greater than the mean. The chronologicaltrajectory linking these important tweet segments is also shown.12 -3 -2 -1 - Factors 2,3, words with > average contribution
Dim 2 (7.33%) D i m ( . % ) you frys pdc as have up day out very im from good into kind new or off who morning twitpic am best london which how la may some cambridge come did doesn excellent fellow last live mark night party then Figure 6: Important words, with contribution to the inertia of the cloud of allwords in this factor plane, of factors 2,3.13 -3 -2 -1 - - Factors 2,3, tweets with > average contribution
Dim 2 (7.33%) D i m ( . % ) Figure 7: The plane of factors 2,3 with the important tweet segments. Thesetweet segments are important due to greater than mean contribution to theinertia of the cloud of tweet segments. The trajectories connecting the tweetsin their chronological order are also shown.14arly tweet segments are positive on factor 3. Then there is a phase (withimportant tweet segments 21, 22, 25) that are fairly neutral on factor 3, butrange, first negatively on factor 2, and then positively. Then comes a phase(through tweet segment 27) of strong factor 3 positivity. Recall that positive andnegative orientations of factor axes are relative only and contain no judgementalcharacter whatsoever. With tweet segment 28, there is a move that is reinforcedby tweet segment 30, containing our crucial two tweets, back towards the otherextremity of factor 3. Further tweet segments then play out their roles on thenegative factor 3 half-axis.In summary we find the following description of the segment groups.Positive factor 3: segment group 27, appearance in LA (Los Angeles); seg-ment group 28, relating to appearance on the morning news and entertainmenttelevision show “Good Day, LA”; segment group 20, recording of British sciencefiction television series, “Dr. Who”.Negative factor 3: segment group 34, Cambridge (England) and (LondonStreet) Norwich; segment group 37, London (England). Segment group 32concerns computer-related purchases and issues, and a London event; segmentgroup 33 relates to Royal Geographic Society and other events.So our tweet segment of interest, segment group 30, is between tweets thatare mainly dealing with LA and the London area. In segment group 30, thereis the alternation with Twitter user @brumplum , and also a mention of havingarrived in LA. We note therefore these geographic linkages in the Twitter vicinityof the crucial tweets relating to “aggression” and “I retire”. Furthermore wenote the transition back to the London area, where events that Stephen Fry wasinvolved in were based.
The completes our analysis of the Stephen Fry case study. We have describedinitally how we did not find anything remarkable in the narrative flow relating toour two crucial tweets. Then we pursued analysis of sub-narratives, determinedby segments of the narrative flow. We found a number of special characteristicsboth of, and closely related to, the two crucial tweets.
Our next case study relates to the furthering and encouragement of environ-mental citizenship, i.e. engagement and responsibility in regard to environmen-tal issues. The background to this work encompasses the following aspects: (1)the testing of social media with the aim of designing interventions; (2) appli-cation to environmental communication initiatives; and (3) measuring impactof public engagement theory. The latter aspect is in the (renowned social andpolitical science theorist) J¨urgen Habermas sense of public engagement centredon communicative theory. By implication therefore, this points to discourse as15 possible route to social learning and environmental citizenship. For us, here,discourse is Twitter-based.In [18] we deal with the practical challenge of how on-line activity can actu-ally be measured. The Twitter campaigns set up and used in [18] – which we usein this work – are considered as discourse-based media and as such they havelinks with public engagement centred on Habermas’s communicative theory. Inorder to define and then measure terms like “influence”, “impact” and “reach”,we sought, in [18], to evaluate if this is simply the number of friends, followers,re-tweets or “like” in a social media (Twitter, Facebook) setting, and whethersuch social media actions could be considered, in appropriate circumstances, asan act of citizenship or public engagement.
For us,
Impact will be the semantic distance between the initiating action, andthe net aggregate outcome. This can be statistically tested. It can be visualized.Facets and indeed components of such impact can be further visualized andevaluated.Essential enabling aspects are (1) the data structure input , comprising char-acterization of relevant actions, characterization of the initiating actions; andfor all relevant actions, and the initiating actions, we have their context mode(called “campaign” here) which allows both intra and inter analyses. (2)
Map-ping of this characterization data (presence/absence, frequency of occurrence,mode category) into a semantic space that is both qualitatively (through vi-sualization) and quantitatively analyzed. This semantic space is a Euclidean,factor space.For visualization we use 2-dimensional projection, but for quantitative analy-sis, we use the full factor space dimensionality, hence with no loss of information.
The eight campaigns in late 2012 were as follows, with the date during whichthe campaign was carried out, and the theme of the campaign.1. 1 October to 7 October: Climate change: The big picture and the globalconsequences.2. 8 October to 14 October: Climate change: The local consequences.3. 15 October to 22 October: Light and electricity.4. 23 October to 28 October: Heating systems.5. 29 October to 4 November: Sustainable Food choices.6. 5 November to 11 November: Sustainable Travel choices.16eq. no. Tweet Init. – yes/no Campaign 1, 2, ..., 81 Tweet 1 1 12 Tweet 2 0 1... ... ... ...... ... ... ...985 Tweet 985 0 8Table 1: Transformed Twitter data used. Column 1 is the sequence number ofthe tweet. Column 2 is the tweet. Column 3 has the value 1 if the tweet was aninitiating one for a new campaign, and otherwise is 0. Column 4 has the value1 to 8, indicating the campaign.7. 12 November to 18 November: Sustainable Water use.8. 19 November to 25 November: Sustainable Waste.Table 1 depicts the initial data set derived from the Twitter data spanningthe eight campaigns. There are 985 tweets here. Campaigns were as followsin the succession of tweets: 1 to 63; 64 to 133; 134 to 301; 302 to 409; 410 to555; 556 to 730; 731 to 843; and 844 to 985. The initiating tweets for the eightcampaigns are: 3, 65, 134, 303 and 304 (which were combined – the two takentogether as one), 410, 557, 736 and 846. These initiating tweets are listed in fullin Appendix A.In the first stage of the processing, from all tweets a set of 3056 terms wasderived. These terms were essentially the full word set obtained from all tweets.See below, in the following subsection, for an exact specification. Each tweetwas cross-tabulated with those terms that were present for it. (Storage-wise,each tweet had 1 = presence, 0 = absence values for each of the 3056 terms. Insome cases there were 2 or 3 presences.) In a second stage of the processing, theterm set was reduced to 339 sufficiently often used terms. Some tweets therebybecame empty, so the number of usable, non-empty tweets dropped from 985 to968 non-initiating tweets plus the 8 initiating tweets. (We have already notedthat seven of the eight campaigns had one initiating tweet. Campaign 4 had twosuccessive initiating tweets. We joined these two tweets together into a singleinitiating tweet for campaign 4.)For the Correspondence Analysis, the latent semantic mapping method used,the input data set used is depicted in Table 2. For the analysis, we distinguishbetween principal rows (tweets that are not initiating ones) and supplementaryrows (tweets that are initiating ones); and principal columns (terms used by thetweets) and supplementary columns (categorization in regard to the campaign).See Table 2. The analysis that embeds rows and columns in a factor space iscarried out on the principal rows and columns, i.e. the regular discourse (non-initiating) tweets crossed by the terms that characterize them. Into that factorspace, the supplementary rows and columns are projected, i.e. respectively theinitiating tweets, and the campaign categories.17 weets
Terms (cid:122) (cid:125)(cid:124) (cid:123)
Cats. (cid:122) (cid:125)(cid:124) (cid:123)
Init.Tweets (cid:26)
Table 2: Upper left, Tweets × Terms: very sparse, most values 0 indicatingabsence of term in the tweet. Some values 1 (and a few 2 or even 3) indicatingpresence of term in the tweet. Upper right, Tweets × Categories, 1 in therelevant campaign column associated with the tweet. Otherwise 0. Lower left,Initiators × Terms: as for Tweets × Terms. Lower right, Initiators × Categories,i.e. Campaigns: each row has a campaign = 1 and otherwise 0.The data to be analyzed then was as follows. • Principal rows: the set of 968 retained tweets, that do not include theinitiating tweets. • Supplementary rows: the set of 8 initiating tweets. • Principal columns: the set of 339 terms retained. • Supplementary columns: the set of 8 “indicators” for the 8 campaigns.
In this and the next subsection, we explain how we select the term set used tocharacterize each tweet in the overall Twitter discourse.Only alphabetic characters are retained. So @, × Terms Matrix
The tweets × terms cross-tabulation is set up, with frequency of occurrencevalues. The greatest frequency of occurrence value is 3. Typically the frequencyof occurrence is 1. The cross-tabulation matrix is very sparse, with most valuesequal to 0.In order to facilitate and even to make possible the comparison of all tweetsin the Twitter discourse, we require each set of presences of terms over all tweetsto be at least 5, and also that the term be present in 5 tweets. Exceptionallyrare terms would hinder our analysis. Our thresholds of 5 were such that rarelyused terms were pinpointed, and not at the cost of removing too many terms.The 968 retained (non-initiating) tweets, and the 8 initiating tweets, arecrossed by 339 terms. Factors, in decreasing order of importance, provide latent semantic components.Analysis is carried out on the principal rows, columns. Then the supplemen-tary rows, columns are projected into the analysis. The principal rows are thediscourse, non-initiating tweets. The principal columns are the set of termsused in this discourse. The supplementary rows are the initiating tweets. Thesupplementary columns are the campaign indicators.Factors, in decreasing order of importance, provide latent semantic com-ponents. Analysis is carried out on the principal rows, columns. Then thesupplementary rows, columns are projected into the analysis. The principalrows are the discourse, non-initiating tweets. The principal columns are theset of terms used in this discourse. The supplementary rows are the initiatingtweets. The supplementary columns are the campaign indicators.19ach term is at the centre of gravity of “its” tweets. Each tweet is at thecentre of gravity of “its” terms. The factor space is a semantic space in that ittakes account of all interrelationships – between all tweets, between all terms,between all tweets and all terms.Typically we visualize this semantic, factor representation of the data bytaking two factors at a time. Planar projections lend themselves to such dis-play. In the analysis discussion to follow, we tidy up these displays, in order tohighlight useful and/or important outcomes.
Our first analysis shows the principal factor plane of the 8 tweets that initiatedthe campaigns, where we projected the supplementary rows (cf. Table 2) tohave their semantic locations; and the net aggregate campaigns, given by thecentres of gravity of the 8 campaigns, where we projected the supplementarycolumns (cf. Table 2) to have their semantic locations. The actual definition ofthe factors was from the principal rows – all tweets save the initiating ones –and the principal columns – the word set used in the Twitter discourse.Even if the principal factor plane accounts for relatively little informationin our data, it nonetheless is the mathematically best planar representation,hence summary, of our data. In this factor 1, factor 2 plane, Figure 8 shows theinstigating tweet (“tic1”, etc.) and the net overall effect (“C1”, etc.).We see that campaigns 3, 5, 8 have initiating tweets that are fairly closeto the net overall campaign in these cases. By looking at all tweets, and allterms, it is seen that the campaign initiating tweets, and the overall campaignmeans, are close to the origin, i.e. the global average. That just means that they,respectively – initiating tweets, and means – are relatively unexceptional, andexpress aggregates. The very low rates of inertia explained by the factors is anaspect which is fairly standard for such analysis of very sparse cross-tabulations,although it does point to the fact that we are seeing in Figure 8 just a projectionof our data.Therefore, while tweets initiating campaigns 3, 5, 8 are the closest to their re-spective campaign means, this is based on the best fitting planar, two-dimensionaldimensions. It is based on the best factor plane, defined by factors 1 and 2. Butthe entire semantic space is of dimensionality 338. (This is explained as fol-lows. The principal row set is 968 tweets. The principal column set is 339tweets. The dimensionality of the factor space is, at most and here equal tomin(339 − , − -0.5 - . - . . . . . . Dim 1 (0.95%) D i m ( . % ) C1 C2 C3 C4 C5 C6 C7 C8 tic1 tic2 tic3 tic4 tic5 tic6 tic7 tic8 Figure 8: The campaign initiating tweets are labelled “tic1” to “tic8”. Thecentres of gravity of the campaigns, i.e. the net aggregate of the campaigns, arelabelled “C1” to “C8”. In each case, the tweet initiating the campaign is linkedwith an arrow to the net aggregate of the campaign. The percentage inertiaexplained by the factors, “Dim 1” being factor 1, and “Dim 2” being factor 2,is noted. 21 Distances between initiating tweet and campaign mean D i s t an c e s , f o r d i m en s i ona li t i e s and Figure 9: For the 8 campaigns, shown are the Euclidean distances between thecampaign initiating tweets and the respective centres of gravity of the cam-paigns, or net overall campaigns. The lower curve is for the principal factorplane, hence the Euclidean distances between “tic1” and “C1”, etc., as shownin Figure 8. The upper curve is for the full semantic, factor space dimensionality.22 .3.1 Statistical Significance of Impact
We are still considering Figure 9.The campaign 7 case, with the distance between the tweet initiating cam-paign 7, and the mean campaign 7 outcome, in the full, 338-dimensional factor(semantic) space is equal to 3.670904.Compare that to all pairwise distances of non-initiating tweets. (They arequite normally/Gaussian distributed, with a small number of large distances.)The mean, mean − stdev, and mean − − . z = − .
16, the campaign 7 impact is significant at the 1.5% level (i.e. z = − .
16, in the two-sided case, has 98.5% of the Gaussian greater than it invalue).In the case of campaigns 1, 4, 5, 6, we find them less than 90% of all pairwisedistances.In the case of campaigns 3 and 8, we find them less than 80% of all pairwisedistances.That only leaves campaign 2 as being the least good fit, relative to initiatingtweet and outcome.
Having found campaign 7 to be the best, in the full semantic dimensionalitycontext, and hence with no loss whatsoever of information contained in ouroriginal data, from the point of view of proximity of cause and intended effect,we now look in somewhat more detail at this campaign.Campaign 7 relates to Sustainable Water use, cf. Appendix A. Including theinitiating tweet, there are 112 tweets (that have not become empty of terms inour term filtering preprocessing) in campaign 7, and there are 176 terms thatappear at least once in the set of tweets. We now use Correspondence Analysison just this campaign 7 data.We show the factors 1, 2 plane with the tweets, noting where the initiatingtweet is located in this projection, see Figure 10; and then we show the mostimportant terms, see Figure 11. In the latter, note the locations of tweeternames, @TheActualMattyC , @TheEAUC , @BeverleyLad .The story narrated by the principal plane view of campaign 7 is very largelya three-way interplay of tweeter personalities, @TheActualMattyC , @TheEAUC , @BeverleyLad . Note how they are reduced in our preprocessing (cf. Figure 11)to, respectively, “theactualmattyc”, “theeauc” and “beverleylad”. Respectivelythese are associated with: positive F1, positive F2; negative F1, positive F2;and relatively neutral F1, negative F2 (where F1 and F2 are factor 1 and factor2 coordinates). Regarding the last of these tweeter individuals, the term “love”appears in a tweet indicating “we’d love a cycling Leicester”, and the word23 -10 -5 - - Campaign 7: Factors 1, 2, with 10 most contributing tweets
Initiating tweet: black "o". Otherwise sequence numbers of tweets in this campaign.
Dim 1 (2.54%) D i m ( . % ) o Figure 10: Principal factor plane for campaign 7. Just the tweet set for thiscampaign is used, including the initiating tweet. Terms are used that appearat least once in the set of tweets. The input data used is 112 tweets crossed by176 terms. The 10 most contributing tweets are labelled here, and the initiatingtweet is also displayed. 24 -5 - - Campaign 7: Factors 1, 2, with 15 highest coordinate terms
Dim 1 (2.54%) D i m ( . % ) advice beverleylad environmental forget forward green isn looking love park stuff thank theactualmattyc theeauc travel Figure 11: The same data is used as in Figure 10. The 15 highest coordinatevalues of terms are labelled here. 25thanks” appears quite a few times. Our semantic analysis has provided thewords shown in Figure 11 as the most semantically loaded, in the factor 1, factor2 planar projection.In summary, Figures 10 and 11 are a particular illustration of what campaign7 entails. These two figures are related to the one and the same analysis, andare presented here as two figures in order not to have too much overcrowdingof projections. The information content in this planar projection is just over5% (i.e. 2.54% + 2.49%) of the total information of the campaign 7 Twitterdata. Information is quantified by inertia explained by these factors. While themost important planar projection, just one twentieth of the data’s information isquite weak. Figures 10 and 11 do provide us with a visualization of a particularnarrative underlying campaign 7.It may be noted that in our earlier work relating to impact of a causal com-municative action (the initiating tweet) relative to the evolution of the discourse(the tweets), we used the full information space, i.e. the full dimensionality ofthe semantic, factor space, in order to draw conclusions.
In the Stephen Fry Twitter case, we saw how we could visualize the criticaltweets as a culmination of some relatively homogeneous preceding tweets, andwith the following two tweets being semantically very different. Hence thesefollowing tweets manifest the shock effect. This visualization was in Figures 2and 3.Through segmentation of the Twitter overall narrative, we developed sub-narratives. These can be determined in such a way as to be statistically signif-icant. We discussed the overall flow of the narrative in a way that was betweenthe extremes (of course, to be checked for in the context of the given data) ofbeing innocuous versus being exceptional. We took being innocuous as havinga close-to-average semantic profile, in terms of projection in the factorial, latentsemantic space.In the environmental citizenship experimental case study, we developed anew approach to assessing impact, based on the process of discourse. A causalelement is used, and this is compared to the overall aggregate of a selected part(since that will be meanginful) of the course of the discourse.We studied this comparatively, using 8 different “campaigns”. We tracedout the semantic path from initiating tweet to the mean tweet of the associatedcampaign. We noted the differences between campaigns. We did this using themost salient – the most important – two-dimensional latent semantic, or factor,subspace, in order to illustrate our approach, as well as in the full dimensionalityspace, using all information and avoiding any approximation.We noted differences, e.g. campaign 3 overall was closest to its initiatingtweet in the two-factor projection; but with all information in use, campaign 7was the most effective campaign of all, in the sense of the initiating tweet beingclosest to the overall semantic mean of that campaign.26e have also developed a statistical test of significance of impact.Planar projections in our semantic, factor space allow visualization of out-comes. We looked in detail at campaign 7, pointing to what were the mostinfluential tweets, and the most revealing terms associated with the underlying(latent semantic) components. In some cases, this indicated who (the tweeter,@) or what themes (hashtag,
References [1] Barker M, Barker DI, Bormann NF, Neher KE:
Social Media Marketing. AStrategic Approach , Andover UK: Cengage Learning; 2012.[2] B´ecue-Bertaut M, Kostov B, Morin A, Naro G: Rhetorical strategy in foren-sic speeches: Multidimensional statistics-based methodology,
Journal ofClassification , 2014, 31:85–106.[3] Benz´ecri, J-P:
L’Analyse des Donn´ees, Tome I Taxinomie, Tome II Cor-respondances , 2nd ed. Paris: Dunod; 1979.[4] Benz´ecri J-P:
Correspondence Analysis Handbook , Basel: Dekker; 1994.[5] Blasius J, Greenacre M (Eds):
Visualization and Verbalization of Data ,Boca Raton, FL: Chapman & Hall/CRC Press; 2014.[6] Chafe WL: The flow of thought and the flow of language, in Giv´on T(Ed),
Syntax and Semantics: Discourse and Syntax (Vol. 12) , New York:Academic Press, 1979, pp. 159–181.[7] Chafe W:
Discourse, Consciousness, and Time. The Flow and Displace-ment of Conscious Experience in Speaking and Writing , Chicago: Univer-sity of Chicago Press; 1994. 278] Chew C, Eysenbach G: Pandemics in the age of Twitter: content analysisof tweets during the 2009 H1N1 outbreak,
PLoS ONE
Numerical Ecology , (3rd ed), Amsterdam: Else-vier; 2012.[12] Le Roux B, Rouanet H:
Geometric Data Analysis: From CorrespondenceAnalysis to Structured Data Analysis , Dordrecht: Kluwer Academic; 2004.[13] McKee R:
Story: Substance, Structure, Style, and the Principles of Screen-writing . York: Methuen; 1999.[14] Murtagh F:
Correspondence Analysis and Data Coding with R and Java .Boca Raton,FL: Chapman & Hall/CRC; 2005.[15] Murtagh F, Ganz A, McKie S: The structure of narrative: The case of filmscripts,
Pattern Recognition , 2009, 42: 302–312.[16] Murtagh F, Ganz A, Reddington J: New methods of analysis of narrativeand semantics in support of interactivity,
Entertainment Computing , 2011,2: 115–121.[17] Pearce W, Holmberg K, Hellsten I, Nerlich B: Climate change on Twit-ter: topics, communities and conversations about the 2013 IPCC WorkingGroup 1 report,
PLoS ONE , 9(4), 11 pp., e94785, 2014.[18] Pianosi M, Bull R, Rieser M: Impact, influence and reach: Lessons inmeasuring the impact of social media, pp. 36, preprint, 2013.[19] Quinn B: Stephen Fry fans beg actor not to giveup on Twitter,
The Observer
CARME, Correspon-dence Analysis and Related Methods Conference , Rennes, France, 2011.http//carme2011.agrocampus-ouest.fr/slides/Seguela Saporta.pdf.28 ppendix A: Our 8 Campaign Initiating Tweets
The following are these tweets, in full. For campaign 4 the two initiating tweetswere merged together. DMU stands for De Montfort University.Campaign 1: Introducing
Appendix B: Correspondence Analysis
Correspondence Analysis provides access to the semantics of information ex-pressed by the data. The way it does this is by viewing each observation (atweet here) or row vector as the average of all attributes (term here) that arerelated to it; and by viewing each attribute or column vector as the average ofall observations that are related to it.This semantic mapping analysis is as follows:1. The starting point is a matrix that cross-tabulates the dependencies, e.g.frequencies of joint occurrence, of an observations crossed by attributesmatrix.2. By endowing the cross-tabulation matrix with the χ metric on both ob-servation set (rows) and attribute set (columns), we can map observationsand attributes into the same space, endowed with the Euclidean metric.3. Interpretation is through (i) projections of observations, attributes ontofactors; (ii) contributions by observations, attributes to the inertia of thefactors; and (iii) correlations of observations, attributes with the factors.The factors are ordered by decreasing importance.Correspondence Analysis is not unlike Principal Components Analysis in itsunderlying geometrical bases. While Principal Components Analysis is partic-ularly suitable for quantitative data, Correspondence Analysis is appropriatefor the following types of input data: frequencies, contingency tables, proba-bilities, categorical data, and mixed qualitative/categorical data. The factorsare defined by a new orthogonal coordinate system endowed with the Euclidean30istance. The factors are determined from the eigenvectors of a positive semi-definite matrix (hence with non-negative eigenvalues). This matrix which isdiagonalized (i.e. subjected to singular value decomposition) encapsulates therequirement for the new coordinates to successively best fit the given data.The “standardizing” inherent in Correspondence Analysis (a consequenceof the χ distance) treats rows and columns in a symmetric manner. Onebyproduct is that the row and column projections in the new space may bothbe plotted on the same output graphic presentations (the principal factor planegiven by the factor 1, factor 2 coordinates; and other pairs of factors). From Frequencies of Occurrence to Clouds of Profiles, eachProfile with an Associated Mass
From the initial frequencies data matrix, a set of probability data, f ij , is definedby dividing each value by the grand total of all elements in the matrix. InCorrespondence Analysis, each row (or column) point is considered to have anassociated weight. The weight of the i th row point is given by f i = (cid:80) j f ij andthe weight of the j th column point is given by f j = (cid:80) i f ij . We consider the rowpoints to have coordinates f ij /f i , thus allowing points of the same profile to beidentical (i.e. superimposed). The i th point – because it is what we analyze – f ij /f i is viewed as the conditional (empirical) probability of column j given row i ; and symmetrically for f ij /f j , the conditional (empirical) probability of row i given column j .The following weighted Euclidean distance, the χ distance, is then usedbetween row points: d ( i, k ) = (cid:88) j f j (cid:18) f ij f i − f kj f k (cid:19) (1)and an analogous distance is used between column points.The mean row point is given by the weighted average of all row points: (cid:88) i f i f ij f i = f j (2)for j = 1 , , . . . , m . Similarly the mean column profile has i th coordinate f i . Input: Cloud of Points Endowed with the Chi Squared Met-ric
The cloud of points consists of the couples: (multidimensional) profile coordinateand (scalar) mass. The cloud of row points, N I , is the set of all 1 ≤ i ≤ n couples( { f ij /f i | j = 1 , , . . . p } , f i ). The cloud of column points, N J , is the set of all1 ≤ j ≤ p couples ( { f ij /f j | i = 1 , , . . . n } , f j ). The vectors are real-valued, so { f ij /f i | j = 1 , , . . . p } ∈ R p and { f ij /f j | i = 1 , , . . . n } ∈ R n .31he overall inertia about the origin of cloud N I is: M ( N I ) = (cid:88) i f i (cid:88) j f j (cid:18) f ij f i − f j (cid:19) = (cid:88) i,j f i f j (cid:18) f ij − f i f j f i (cid:19) = (cid:88) i,j ( f ij − f i f j ) f i f j Note how this uses the χ distance, defined above, and how the inertia is formallysimilar to the χ statistic of independence of observed f ij values relative to themodel, f i f j , that is the product of the marginal probabilities.Similarly we have the overall inertia about the origin of cloud N J : M ( N J ) = (cid:88) j f j (cid:88) i f i (cid:18) f ij f j − f i (cid:19) = (cid:88) i,j f j f i (cid:18) f ij − f i f j f j (cid:19) = (cid:88) i,j ( f ij − f i f j ) f i f j We have that the inertia of the row cloud, N I , is identical to the inertia of thecolumn cloud, N J .Decomposing the moment of inertia of the cloud N I , or of N J since bothanalyses are inherently and integrally related, furnishes the principal axes ofinertia, defined from a singular value decomposition. Output: Cloud of Points Endowed with the Euclidean Met-ric in Factor Space
The χ distance between rows i and k , d ( i, k ), has been defined in equation 1.In the factor space this pairwise distance is identical, i.e. it is invariant. Thecoordinate system and the metric change. For factors indexed by s and fortotal dimensionality S , we have S ≤ min { n − , p − } (there are n rows and p columns); the subtraction of 1 is since the factor space is centred and hencethere is a linear dependency which reduces the inherent dimensionality by 1),we have the projection of row i on the s th factor, F s , given by F s ( i ): d ( i, k ) = S (cid:88) s =1 ( F s ( i ) − F s ( k )) (3)In Correspondence Analysis the factors are ordered by decreasing momentsof inertia. The factors are closely related, mathematically, in the decompositionof the overall cloud, N I and N J , inertias, M ( N I ), M ( N J ). The eigenvaluesassociated with the factors, identically in the space of rows or observationsindexed by set i = 1 , , . . . , n , and in the space of attributes indexed by set j = 1 , , . . . , p , are given by the eigenvalues associated with the decompositionof the inertia. The decomposition of the inertia is a principal axis decomposition,which is arrived at through a singular value decomposition.In addition to projections on the factorial axes, for point i , F s ( i ), and forpoint j , G s ( j ), we also have the following that are important for interpretationof results.We have contributions : f i F s ( i ) is the absolute contribution of point i tothe moment of inertia λ s , associated with factor s . Contributions are whatdetermine the factors or axes. 32e have also correlations . The correlation of a point with a factor is the co-sine squared of that point/vector with the factor/axis. cos a = F s ( i ) / (cid:80) Ss =1 F s ( i )is the relative contribution of the factor s to point i . The correlation is said tobe the extent to which point i illustrates (or exemplifies) the factor.Relations for column points, j , and factors G s ( j ), hold symmetrically. Analysis of the Dual Spaces, Transition Formulae, and Sup-plementary Elements
The factors in the two spaces, of rows/observations and of columns/attributes,are inherently related as follows: F s ( i ) = λ − s p (cid:88) j =1 f ij f i G s ( j ) for s = 1 , , . . . , S ; i = 1 , , . . . , nG s ( j ) = λ − s n (cid:88) i =1 f ij f j F s ( i ) for s = 1 , , . . . , S ; j = 1 , , . . . , p (4)These are termed the transition formulas . The coordinate of element i ,1 ≤ i ≤ n , is the barycentre (centre of gravity) of the coordinates of the elements j , 1 ≤ j ≤ p , with associated masses of value given by the coordinates of f ij /f i of the profile of i . This is all to within the λ − s constant.We can consider normalized factors, φ s ( i ) = λ − s F s ( i ), and similarly ψ s ( j ) = λ − s G s ( j ).Therefore φ s ( i ) = p (cid:88) j =1 f ij f i ψ s ( j ) for s = 1 , , . . . , S ; i = 1 , , . . . , nψ s ( j ) = n (cid:88) i =1 f ij f j φ s ( i ) for s = 1 , , . . . , S ; j = 1 , , . . . , p (5)This implies that we can pass easily from one space to the other. us tosimultaneously view and interpret observations and attributes.Qualitatively different elements (i.e. row or column profiles), or ancillarycharacterization or descriptive elements may be placed as supplementary ele-ments . This means that they are given zero mass in the analysis, and theirprojections are determined using the transition formulas. This amounts to car-rying out a Correspondence Analysis first, without these elements, and thenprojecting them into the factor space following the determination of all proper-ties of this space.The transition formulas allow supplementary rows or columns to be projectedinto either space. If ξ j is the j th element of a supplementary row, with mass33 , then a factor loading, for factor s , is simply obtained subsequent to theCorrespondence Analysis: ψ i = 1 √ λ (cid:88) j ξ j ξ φ j . A similar formula holds for supplementary columns. Such supplementaryelements are therefore “passive” and are incorporated into the CorrespondenceAnalysis results subsequent to the eigen-analysis being carried out.
In Summary
Correspondence Analysis is thus the inertial decomposition of the dual cloudsof weighted points. It is a latent semantic decomposition, where the role of theterm frequency and inverse document frequency (TF-IDF) weighting scheme isinstead the use of (i) profiles and masses, (ii) with the χ2