[PDF] Inferring the Origin Locations of Tweets with Quantitative Confidence

Abstract

Social Internet content plays an increasingly critical role in many domains, including public health, disaster management, and politics. However, its utility is limited by missing geographic information; for example, fewer than 1.6% of Twitter messages (tweets) contain a geotag. We propose a scalable, content-based approach to estimate the location of tweets using a novel yet simple variant of gaussian mixture models. Further, because real-world applications depend on quantified uncertainty for such estimates, we propose novel metrics of accuracy, precision, and calibration, and we evaluate our approach accordingly. Experiments on 13 million global, comprehensively multi-lingual tweets show that our approach yields reliable, well-calibrated results competitive with previous computationally intensive methods. We also show that a relatively small number of training data are required for good estimates (roughly 30,000 tweets) and models are quite time-invariant (effective on tweets many weeks newer than the training set). Finally, we show that toponyms and languages with small geographic footprint provide the most useful location signals.

Full PDF

IInferring the Origin Locations of Tweetswith Quantitative Conﬁdence

Reid Priedhorsky, ∗ Aron Culotta, † Sara Y. Del Valle ∗ ∗ Los Alamos National LaboratoryLos Alamos, NM{reidpr,sdelvall}@lanl.gov † Illinois Institute of TechnologyChicago, [email protected]

ABSTRACT

Social Internet content plays an increasingly critical role inmany domains, including public health, disaster management,and politics. However, its utility is limited by missing geo-graphic information; for example, fewer than 1.6% of Twittermessages ( tweets ) contain a geotag. We propose a scalable,content-based approach to estimate the location of tweets us-ing a novel yet simple variant of gaussian mixture models. Fur-ther, because real-world applications depend on quantiﬁed un-certainty for such estimates, we propose novel metrics of accu-racy, precision, and calibration, and we evaluate our approachaccordingly. Experiments on 13 million global, comprehen-sively multi-lingual tweets show that our approach yields re-liable, well-calibrated results competitive with previous com-putationally intensive methods. We also show that a relativelysmall number of training data are required for good estimates(roughly 30,000 tweets) and models are quite time-invariant(e ﬀ ective on tweets many weeks newer than the training set).Finally, we show that toponyms and languages with small ge-ographic footprint provide the most useful location signals.

1. INTRODUCTION

Applications in public health [9], politics [29], disaster man-agement [21], and other domains are increasingly turning tosocial Internet data to inform policy and intervention strate-gies. However, the value of these data is limited because thegeographic origin of content is frequently unknown. Thus,there is growing interest in the task of location inference :given an item, estimate its geographic true origin .We propose an inference method based on gaussian mixturemodels (GMMs) [22]. Our models are trained on geotaggedtweets, i.e., messages with user proﬁle and geographic true ori-gin points. (cid:49)

For each unique n-gram, we ﬁt a two-dimensionalGMM to model its geographic distribution. To infer the originof a new tweet, we combine previously trained GMMs for then-grams it contains, using weights inferred from data; Figure 1shows an example estimate. This approach is simple, scalable,and competitive with more complex approaches. Our implementation is open source: http: // github.com / reidpr / quac Preprint version 3. LA-UR 13-23557. Please cite the published version in the ACMDigital Library (http: // dx.doi.org / /

200 0 200 400 600 800 km text:

Americans are optimistic about the economy & like whatObama is doing. What is he doing? Campaigning and playinggolf? Ignorance is bliss language: en location: Los Angeles, CA time zone: paciﬁctimeuscanada

Figure 1. A tweet originating near Los Angeles, CA. We show the trueorigin (a blue star) and a heat map illustrating the density function thatmakes up our method’s estimate. This estimate, whose accuracy was atthe 80th percentile, was driven by two main factors. The unigram ca from the location ﬁeld, visible as the large density oval along the Califor-nia coast, contributed about 12% of the estimate, while angeles ca , themuch denser region around Los Angeles, contributed 87%. The contri-bution of four other n-grams ( angeles , los angeles , obama , and los ) wasnegligible. Location estimates using any method contain uncertainty, andit is important for downstream applications to quantify this un-certainty. While previous work considers only point estimates,we argue that a more useful form consists of a density estimate(of a probability distribution) covering the entire globe, andthat estimates should be assessed on three independent dimen-sions of accuracy, precision, and calibration. We propose newmetrics for doing so.To validate our approach, we performed experiments on twelvemonths of tweets from across the globe, in the context of an-swering four research questions:RQ1.

Improved approach.

How can the origin locations ofsocial Internet messages be estimated accurately, pre-cisely, and with quantitative uncertainty? Our novel,simple, and scalable GMM-based approach produceswell-calibrated estimates with a global mean accuracy1 a r X i v : . [ c s . S I] N ov rror of roughly 1,800 km and precision of 900,000square kilometers (or better); this is competitive withmore complex approaches on the metrics available inprior work.RQ2. Training size.

How many training data are required?We ﬁnd that approximately 30,000 tweets (i.e., roughly0.01% of total daily Twitter activity) are su ﬃ cient forhigh-quality models, and that performance can be fur-ther improved with more training data at a cost of in-creased time and memory. We also ﬁnd that models areimproved by including rare n-grams, even those occur-ring just 3 times.RQ3. Time dependence.

What is the e ﬀ ect of a temporal gapbetween training and testing data? We ﬁnd that our mod-els are nearly independent of time, performing just 6%worse with a gap of 4 months (vs. no gap).RQ4. Location signal sources.

Which types of content pro-vide the most valuable location signals? Our results sug-gest that the user location string and time zone ﬁeldsprovide the strongest signals, tweet text and user lan-guage are weaker but important to o ﬀ er an estimate forall test tweets, and user description has essentially nolocation value. Our results also suggest that mentioningtoponyms (i.e., names of places), especially at the cityscale, provides a strong signal, as does using languageswith a small geographic footprint.The remainder of our paper is organized as follows. We ﬁrstsurvey related work, then propose desirable properties of alocation inference method and metrics which measure thoseproperties. We then describe our experimental framework anddetail our mixture model approach. Finally, we discuss ourexperimental results and their implications. Appendices withimplementation details follow the body of the paper.

2. RELATED WORK

Over the past few years, the problem of inferring the originlocations of social Internet content has become an increasinglyactive research area. Below, we summarize the four primarylines of work and contrast them with this paper.

Perhaps the simplest approach to location inference is geocod-ing : looking up the user proﬁle’s free-text location ﬁeld in a gazetteer (list of toponyms), and if a match is found, infer-ring that the message originated from the matching place. Re-searchers have used commercial geocoding services such asYahoo! Geocoder [32], U.S. Geological Survey data [26], andWikipedia [16] to do this. This technique can be extended tothe message text itself by ﬁrst using a geoparser named-entityrecognizer to extract toponyms [13].Schulz et al. [30] recently reported accurate results using ascheme which combines multiple geocoding sources, includ-ing Internet queries. Crucial to its performance was the dis-covery that an additional 26% of tweets can be matched toprecise coordinates using text parsing and by following linksto location-based services (FourSquare, Flickr, etc.), an ap-proach that can be incorporated into competing methods as well. Another 8% of tweets – likely the most di ﬃ cult ones, asthey contain the most subtle location evidence – could not beestimated and are not included in accuracy results.In addition to one or more accurate, comprehensive gazetteers,these approaches require careful text cleaning before geocod-ing is attempted, as grossly erroneous false matches are com-mon [16], and they tend to favor precision over recall (becauseonly toponyms are used as evidence). Finally, under one view,our approach essentially infers a probabilistic gazetteer thatweights toponyms (and pseudo-toponyms) according to thelocation information they actually carry. These approaches build a statistical mapping of text to discretepre-deﬁned regions such as cities and countries (i.e., treating“origin location” as membership in one of these classes ratherthan a geographic point); thus, any token can be used to informlocation inference.We categorize this work by the type of classiﬁer and by placegranularity. For example, Cheng et al. apply a variant of nai¨veBayes to classify messages by city [6], Hecht et al. use asimilar classiﬁer at the state and country level [16], and Kin-sella et al. use language models to classify messages by neigh-borhood, city, state, zip code, and country [19]. Mahmud et al.classify users by city with higher accuracy than Cheng et al. bycombining a hierarchical classiﬁer with many heuristics andgazetteers [20]. Other work instead classiﬁes messages intoarbitrary regions of ﬁxed [25, 34] or dynamic size [28]. Allof these require aggressively smoothing estimates for regionswith few observations [6]Recently, Chang et al. [5] classiﬁed tweet text by city usingGMMs. While more related to the present paper because ofthe underlying statistical technique, this work is still funda-mentally a classiﬁcation approach, and it does not attemptthe probabilistic evaluation that we advocate. Additionally,the algorithm resorts to heuristic feature selection to handlenoisy n-grams; instead, we o ﬀ er two learning algorithms toset n-gram weights which are both theoretically grounded andempirically crucial for accuracy.Fundamentally, these approaches can only classify messagesinto regions speciﬁed before training; in contrast, our GMMapproach can be used both for direct location inference as wellas classiﬁcation, even if regions are post-speciﬁed. These techniques endow traditional topic models [2] with lo-cation awareness [33]. Eisenstein et al. developed a cascad-ing topic model that produces region-speciﬁc topics and usedthese topics to infer the locations of Twitter users [10]; follow-on work uses sparse additive models to combine region-speciﬁc, user-speciﬁc, and non-informative topics more e ﬃ -ciently [11, 17].Topic modeling does not require explicit pre-speciﬁed regions.However, regions are inferred as a preprocessing step: Eisen-stein et al. with a Dirichlet Process mixture [10] and Hong et al.with K-means clustering [17]. The latter also suggests thatmore regions increases inference accuracy.2hile these approaches result in accurate models, the bulk ofmodeling and computational complexity arises from the needto produce geographically coherent topics. Also, while topicmodels can be parallelized with considerable e ﬀ ort, doing sooften requires approximations, and their global state limits thepotential speedup. In contrast, our approach focusing solelyon geolocation is simpler and more scalable.Finally, the e ﬀ orts cited restrict messages to either the UnitedStates or the English language, and they report simply themean and median distance between the true and predicted loca-tion, omitting any precision or uncertainty assessment. Whilethese limitations are not fundamental to topic modeling, thenovel evaluation and analysis we provide o ﬀ er new insightsinto the strengths and weaknesses of this family of algorithms. Recent work suggests that using social link information (e.g.,followers or friends) can aid in location inference [4, 8]. Weview these approaches as complementary to our own; accord-ingly, we do not explore them more deeply at present.

We o ﬀ er the following principal distinctions compared to priorwork: (a) location estimates are multi-modal probability distri-butions, rather than points or regions, and are rigorously evalu-ated as such, (b) because we deal with geographic coordinatesdirectly, there is no need to pre-specify regions of interest;(c) no gazetteers or other supplementary data are required,and (d) we evaluate on a dataset that is more comprehensivetemporally (one year of data), geographically (global), andlinguistically (all languages except Chinese, Thai, Lao, Cam-bodian, and Burmese).

3. EXPERIMENT DESIGN

In this section, we present three properties of a good locationestimate, metrics and experiments to measure them, and newalgorithms motivated by them.

An estimate of the origin location of a message should be ableto answer two closely related but di ﬀ erent questions:Q1. What is the true origin of the message? That is, at whichgeographic point was the person who created the messagelocated when he or she did so?Q2. Was the true origin within a speciﬁed geographical re-gion? For example, did a given message originate fromWashington State?It is inescapable that all estimates are uncertain. We arguethat they should be quantitatively treated as such and o ﬀ erprobabilistic answers to these questions. That is, we argue thata location estimate should be a geographic density estimate :a function which estimates the probability of every point onthe globe being the true origin. Considered through this lens,a high-quality estimate has the following properties: • It is accurate : the density of the estimate is skewed stronglytowards the true origin (i.e., the estimate rates points nearthe true origin as more probable than points far from it).

Figure 2. True origins of tweets having the unigram washington in thelocation ﬁeld of the user’s proﬁle.

Then, Q1 can be answered e ﬀ ectively because the mostdense regions of the distribution are near the true origin, andQ2 can be answered e ﬀ ectively because if the true origin iswithin the speciﬁed region, then much of the distribution’sdensity will be as well. • It is precise : the most dense regions of the estimate arecompact. Then, Q1 can be answered e ﬀ ectively becausefewer candidate locations are o ﬀ ered, and Q2 can be an-swered e ﬀ ectively because the distribution’s density is fo-cused within few distinct regions. • It is well calibrated : the probabilities it claims are close tothe true probabilities. Then, both questions can be answerede ﬀ ectively regardless of the estimate’s accuracy and preci-sion, because its uncertainty is quantiﬁed. For example, thetwo estimates “the true origin is within New York City with90% conﬁdence” and “the true origin is within North Amer-ica with 90% conﬁdence” are both useful even though thelatter is much less accurate and precise.Our goal, then, is to discover an estimator which producesestimates that optimize the above properties. We now map these properties to operationalizable metrics.This section presents our metrics and their intuitive reasoning;rigorous mathematical implementations are in the appendices.

Our core metric to evaluate the accuracy of an estimate is comprehensive accuracy error (CAE): the expected distancebetween the true origin and a point randomly selected fromthe estimate’s density function, or in other words, the meandistance between the true origin and every point on the globe,weighted by the estimate’s density value. (cid:50)

The goal here isto o ﬀ er a notion of the distance from the true origin to thedensity estimate as a whole.This contrasts with a common prior metric that we refer toas simple accuracy error (SAE): the distance from the bestsingle-point estimate to the true origin. Figure 2 illustratesthis contrast. The tight clusters around both Washington, D.C. A similar metric, called Expected Distance Error, has been proposedby Cho et al. for a di ﬀ erent task of user tracking [7]. washington is inherently bimodal; that is, no singlepoint at either cluster or anywhere in between is a good esti-mated location. More generally, SAE is a poor match for thecontinuous, multi-modal density estimates that we argue aremore useful for downstream analysis, because good single-point distillations are often unavailable. However, we reportboth metrics in order to make comparisons with prior work.The units of CAE (and SAE) are kilometers. For a given es-timator (i.e., a speciﬁc algorithm which produces locationestimates), we report mean comprehensive accuracy error (MCAE), which is simply the mean of each estimate’s CAE.CAE ≥

0, and an ideal estimator has MCAE = In order to evaluate precision, we extend the notion of one-dimensional prediction intervals [3, 12] to two dimensions.An estimate’s prediction region is the minimal, perhaps non-contiguous geographic region which contains the true originwith some speciﬁed probability (the region’s coverage ).Accordingly, the metric we propose for precision is simply thearea of this region: prediction region area (PRA) parameter-ized by the coverage, e.g., PRA is the area of the minimalregion which contains the true origin with 50% probability.Units are square kilometers. For a given estimator, we report mean prediction region area (MPRA), i.e., the mean of eachestimate’s PRA. PRA ≥

0; an ideal estimator has MPRA = Calibration is tested by measuring the di ﬀ erence between anestimate’s claimed probability that a particular point is the trueorigin and its actual probability.We accomplish this by building upon prediction regions. Thatis, given a set of estimates, we compute a prediction region ata given coverage for each estimate and measure the fraction oftrue origins that fall within the regions. The result should beclose to the speciﬁed coverage. For example, for prediction re-gions at coverage 0.5, the fraction of true origins that actuallyfall within the prediction region should be close to 0.5.We refer to this fraction as observed coverage (OC) at a givenexpected coverage; for example, OC is the observed cover-age for an expected coverage of 0.5. (This measure is commonin the statistical literature for one-dimensional problems [3].)Calibration can vary among di ﬀ erent expected coverage levels(because ﬁtted density distributions may not exactly match ac-tual true origin densities), so multiple coverage levels shouldbe reported (in this paper, OC and OC ).Note that OC is deﬁned at the estimator level, not for singlemessages. OC is unitless, and 0 ≤ OC ≤

1. An ideal esti-mator has observed coverage equal to expected coverage, anoverconﬁdent estimator has observed less than expected, andan underconﬁdent one greater.

In this section, we explain the basic structure of our experi-ments: data source, preprocessing and tokenization, and testprocedures.

We used the Twitter Streaming API to collect an approxi-mately continuous 1% sample of all global tweets from Jan-uary 25, 2012 to January 23, 2013. Between 0.8% and 1.6%of these, depending on timeframe, contained a geotag (i.e.,speciﬁc geographic coordinates marking the true origin of thetweet, derived from GPS or other automated means), yieldinga total of approximately 13 million geotagged tweets. (cid:51)

We tokenized the message text ( tx ), user description ( ds ), anduser location ( lo ) ﬁelds, which are free-text, into bigrams bysplitting on Unicode character category and script boundariesand then further subdividing bigrams appearing to be Japaneseusing the TinySegmenter algorithm [15]. (cid:52) This covers all lan-guages except a few that have low usage on Twitter: Thai, Lao,Cambodian, and Burmese (which do not separate words with adelimiter) as well as Chinese (which is di ﬃ cult to distinguishfrom Japanese). For example, the string “Can’t wait for 私の ”becomes the set of bigrams can , t , wait , for , 私 , の , can t , twait , wait for , for 私 , and 私の . (Details of our algorithm arepresented in the appendices.)For the language ( ln ) and time zone ( tz ) ﬁelds, which are se-lected from a set of options, we form n-grams by simply re-moving whitespace and punctuation and converting to lower-case. For example, “Eastern Time (US & Canada)” becomessimply easterntimeuscanada . Each experiment is implemented using a Python script ontweets selected with a regular schedule. For example, we mighttrain a model on all tweets from May 1 and test on a randomsample of tweets from May 2, then train on May 7 and test onMay 8, etc. This schedule has four parameters: • Training duration.

The length of time from which to selecttraining tweets. We used all selected tweets for training,except only the ﬁrst tweet from a given user is retained, toavoid over-weighting frequent tweeters. • Test duration.

The length of time from which to select testtweets. In all experiments, we tested on a random sampleof 2,000 tweets selected from one day. We excluded userswith a tweet in the training set from testing, in order to avoidtainting the test set. • Gap.

The length of time between the end of training dataand the beginning of test data. • Stride.

The length of time from the beginning of one train-ing set to the beginning of the next. This was ﬁxed at 6 daysunless otherwise noted.For example, an experiment with training size of one day, nogap, and stride of 6 days would schedule 61 tests across our 12months of data and yield results which were the mean of the 58 As in prior work [10, 17, 28], we ignore the sampling bias intro-duced by considering only geotagged tweets. A preliminary analysissuggests this bias is limited. In a random sample of 11,694,033 geo-tagged and 17,175,563 non-geotagged tweets from 2012, we ﬁnd acorrelation of 0.85 between the unigram frequency vectors for eachset; when retweets are removed, the correlation is 0.93. More complex tokenization methods yielded no notable e ﬀ ect. ﬃ cient data (i.e., 3 tests were not attempted dueto missing data). The advantage of this approach is that testdata always chronologically follow training data, minimizingtemporal biases and better reﬂecting real-world use.We built families of related experiments (as described below)and report results on these families.

4. OUR APPROACH: GEOGRAPHIC GMMS

Here, we present our location inference approach. We ﬁrst mo-tivate and summarize it, then detail the speciﬁc algorithms wetested. (Mathematical implementations are in the appendices.)

Examining the geographic distribution of n-grams can suggestappropriate inference models. For example, recall Figure 2above; the two clusters, along with scattered locations else-where, suggest that a multi-modal distribution consisting oftwo-dimensional gaussians may be a reasonable ﬁt.Based on this intuition and coupled with the desiderata above,we propose an estimator using one of the mature density esti-mation techniques: gaussian mixture models (GMMs). Thesemodels are precisely the weighted sum of multiple gaussian(normal) distributions and have natural probabilistic interpre-tations. Further, they have previously been applied to humanmobility patterns [7, 14].Our algorithm is summarized as follows:1. For each n-gram that appears more than a threshold num-ber of times in the training data, ﬁt a GMM to the trueorigin points of the tweets in the training set that containthat n-gram. This n-gram / GMM mapping forms the trainedlocation model.2. To locate a test tweet, collect the GMMs from the locationmodels which correspond to n-grams in the test tweet. Theweighted sum of these GMMs — itself a GMM — is thegeographic density function which forms the estimate ofthe test tweet’s location.It is clear that some n-grams will carry more location informa-tion than others. For example, n-gram density for the word the should have high variance and be dispersed across allEnglish-speaking regions; on the other hand, density for wash-ington should be concentrated in places named after that presi-dent. (cid:53)

That is, n-grams with much location information shouldbe assigned high weight, and those with little informationlow weight — but not zero, so that messages with only low-information n-grams will have a quantiﬁably poor estimaterather than none at all. Accordingly, we propose three meth-ods to set the GMM weights.

One approach is to simply assign higher weight to GMMswhich have a crisper signal or ﬁt the data better. We tested 15 quality properties which measure this in di ﬀ erent ways. Indeed, Eisenstein et al. attribute the poor performance of severalof their baselines to this tendency of uninformative words to dilutethe predictive power of informative words [10].

We tried weighting each GMM by the inverse of (1) the num-ber of ﬁtted points, (2) the spatial variance of these points,and (3) the number of components in the mixture. We alsotried metrics based on the covariance matrices of the gaussiancomponents: the inverse of (4) the sum of all elements, and(5) the sum of the products of the elements in each matrix. Fi-nally, we tried normalizing: by both the number of ﬁtted points(properties 6–9) and the number of components (10–13). Ofthese, property 5, which we call

GMM-Qpr-Covar-Sum-Prod ,performed the best, so we carry it forward for discussion.Additionally, we tried two metrics designed speciﬁcally to testgoodness of ﬁt: (14) Akaike information criterion [1] and (15)Bayesian information criterion [31], transformed into weightsby subtracting from the maximum observed value. Of this pair,property 14, which we call

GMM-Qpr-AIC , performed best,so we carry it forward.

Another approach is to weight each n-gram by its error amongthe training set. Speciﬁcally, for each n-gram in the learnedmodel, we compute the error of its GMM (CAE or SAE)against each of the points to which it was ﬁtted. We then raisethis error to a power (in order to increase the dominance ofrelatively good n-grams over relatively poor ones) and use theinverse of this value as the n-gram’s weight (i.e., larger errorsyield smaller weights).We refer to these algorithms as (for example)

GMM-Err-SAE4 ,which uses the SAE error metric and an exponent of 4. Wetried exponent values from 0.5 to 10 as well as both CAEand SAE; because the latter was faster and gave comparableresults, we report only SAE.

The above approaches are advantaged by varying degrees ofspeed and simplicity. However, it seems plausibly better tolearn optimized weights from the data themselves. Our basicapproach is to assign each n-gram a set of features with theirown weights, let each n-gram’s weight be a linear combina-tion of the feature weights, and use gradient descent to ﬁndfeature weights such that the total error across all n-grams isminimized (i.e., total geo-location accuracy is maximized).For optimization, we tried three types of n-gram features:1. The quality properties noted above (

Attr ).2. Identity features. That is, the ﬁrst n-gram had Feature 1 andno others, the second n-gram had Feature 2 and no others,and so on ( ID ).3. Both types of features ( Both ).Finally, we further classify these algorithms by whether we ﬁta mixture for each n-gram (

GMM ) or a single gaussian (

Gaus-sian ). For example,

GMM-Opt-ID uses GMMs and weightsoptimized using ID features only.

As two ﬁnal baselines, we considered

GMM-All-Tweets , whichﬁts a single GMM to all tweets in the training set and returns5hat GMM for all locate operations, and

GMM-One , whichweights all n-gram mixtures equally.

5. RESULTS

We present in this section our experimental results and discus-sion, framed in the context of our four research questions. (Inaddition to the experiments described in detail above, we triedseveral variants that had limited useful impact. These resultsare summarized in the appendices.)

Here we evaluate the performance of our algorithms, ﬁrst witha comparison between each other and then against prior work(which is less detailed due to available metrics).

We tested each of our algorithms with one day of trainingdata and no gap, all ﬁelds except user description, and mini-mum n-gram instances set to 3 (detailed reasoning for thesechoices is given below in further experiments). With a strideof 6 days, this yielded 58 tests on each algorithm, with 3 testsnot attempted due to gaps in the data. Table 1 summarizesour results, making clear the importance of choosing n-gramweights well.Considering accuracy (MCAE),

GMM-Err-SAE10 is 10% bet-ter than the best optimization-based algorithm (

GMM-Opt-ID ) and 26% better than the best property-based algorithm(

GMM-Qpr-Covar-Sum-Prod ); the baselines

GMM-One and

GMM-All-Tweets performed poorly. These results suggest thata weighting scheme directly related to performance, ratherthan the simpler quality properties, is important — even in-cluding quality properties in optimization ( -Opt-Attr and -Opt-Both ) yields poor results. Another highlight is the poor per-formance of

Gaussian-Opt-ID vs.

GMM-Opt-ID . Recall thatthe former uses a single Gaussian for each n-gram; as such, itcannot ﬁt the multi-modal nature of these data well.Turning to precision (MPRA ), the advantage of GMM-Err-SAE10 is further highlighted; it is 50% better than

GMM-Opt-ID and 38% better than

GMM-Qpr-Covar-Sum-Prod (notethat the relative order of these two algorithms has reversed).However, calibration complicates the picture. While

GMM-Err-SAE10 is somewhat overconﬁdent at coverage level 0.5(OC = .

453 instead of the desired 0.5),

GMM-Err-SAE4 iscalibrated very well at this level (OC = . = .

775 instead of 0.724).

GMM-Opt-ID has still better calibration at this level (OC = . = . Gaussian-Opt-ID ) or both levels (

GMM-All-Tweets ). Inshort, our calibration results imply that algorithms should beevaluated at multiple coverage levels, and in particular gaus-sians may not be quite the right distribution to ﬁt.These performance results, which are notably inconsistent be-tween the three metrics, highlight the value of carefully con-sidering and tuning all three of accuracy, precision, and cal-ibration. For the remainder of this paper, we will focus on

0 1 k2 k3 k4 k5 k6 k0 k 40 k 60 k 80 k 100 k 120 k CA E ( k m ) Rank

Gaussian-Opt-IDGMM-Qpr-Covar-Sum-ProdGMM-Opt-IDGMM-Err-SAE4

Figure 3. Accuracy of each estimate using selected algorithms, in de-scending order of CAE.

GMM-Err-SAE4 , with its simplicity, superior calibration, timee ﬃ ciency, and second-best accuracy and precision. A plausible hypothesis is that the more complex CAE metricis not needed, and algorithm accuracy can be su ﬃ ciently welljudged with the simpler and faster SAE. However, Gaussian-Opt-ID o ﬀ ers evidence that this is not the case: while it isonly 4% worse than GMM-Err-SAE4 on MSAE, the relativedi ﬀ erence is nearly 6 times greater in MCAE.Several other algorithms are more consistent between the twometrics, so SAE may be appropriate in some cases, but cautionshould be used, particularly when comparing di ﬀ erent typesof algorithms. Figure 3 plots the CAE of each estimate from four key al-gorithms. These curves are classic long-tail distributions (asare similar ones for PRA omitted for brevity); that is, a rel-atively small number of di ﬃ cult tweets comprise the bulk ofthe error. Accordingly, summarizing our results by median in-stead of mean may be of some value: for example, the medianCAE of GMM-Err-SAE4 is 778 km, and its median PRA is83,000 km (roughly the size of Kansas or Austria). However,we have elected to focus on reporting means in order to notconceal poor performance on di ﬃ cult tweets.It is plausible that di ﬀ erent algorithms may perform poorlyon di ﬀ erent types of test tweets, though we have not exploredthis; the implication is that selecting di ﬀ erent strategies basedon properties of the tweet being located may be of value. Table 2 compares

GMM-Opt-ID and

GMM-Err-SAE to ﬁvecompeting approaches using data from Eisenstein et al. [10],using mean and median SAE (as these were the only metricsreported).These data and our own have important di ﬀ erences. First, theyare limited to tweets from the United States — thus, we ex-pect lower error here than in our data, which contain tweetsfrom across the globe. Second, these data were created for6 lgorithm MCAE MSAE MPRA OC OC RT GMM-Err-SAE10 1735 ±

81 1510 ±

76 824 ± ± ± ±

82 1565 ±

78 934 ± ± ± ±

77 1578 ±

67 1661 ± ± ± ±

82 1801 ±

76 1192 ± ± ± ±

123 2084 ±

115 1337 ± ± ± ±

81 1635 ±

69 6751 ± ± ± ±

506 4122 ±

469 4207 ± ± ± ±

564 4146 ±

505 4142 ± ± ± ±

221 4439 ±

251 4235 ± ± ± ±

226 4454 ±

252 4249 ± ± ± ±

156 7072 ±

210 5243 ± ± ± Table 1. Performance of key algorithms; we report the mean and standard deviation of each metric across each experiment’s tests. MCAE and MSAEare in kilometers, MPRA is in thousands of km , and OC β is unitless. RT is the mean run time, in minutes, of one train-test cycle using 8 threads on6100-series Opteron processors running at 1.9 GHz. Algorithm SAEMean Median OC n-grams Hong et al. [17] 373Eisenstein et al. [11] 845 501

GMM-Opt-ID 870 534 0.50 19

Roller et al. [28] 897 432Eisenstein et al. [10] 900 494

GMM-Err-SAE6 946 588 0.50 153GMM-Err-SAE16 954 493 0.36 37

Wing et al. [34] 967 479

GMM-Err-SAE4 985 684 0.55 182

Table 2. Our algorithms compared with previous work, using the datasetfrom Eisenstein et al. [10]. The n-grams column reports the mean num-ber of n-grams used to locate each test tweet. user location inference, not message location (that is, they aredesigned for methods which assume users tend to stay near thesame location, whereas our model makes no such assumptionand thus may be more appropriate when locating messagesfrom unknown users). To adapt them to our message-basedalgorithms, we concatenate all tweets from each user, treatingthem as a single message, as in [17]. Finally, the Eisensteindata contain only unigrams from the text ﬁeld (as we willshow, including information from other ﬁelds can notably im-prove results); for comparison, we do the same. This yields7,580 training and 1,895 test messages (i.e., roughly 380,000tweets versus 13 million in our data set).Judged by mean SAE,

GMM-Opt-ID surpasses all other ap-proaches except for Eisenstein et al. [11]. Interestingly, thealgorithm ranking varies depending on whether mean or me-dian SAE is used — e.g.,

GMM-Err-SAE16 has lower medianSAE than [11] but a higher mean SAE. This trade-o ﬀ betweenmean and median SAE also appears in other work – for ex-ample, Eisenstein et al. report the best mean SAE but havemuch higher median SAE [11]. Also, Hong et al. report thebest median SAE but do not report mean at all [17].Examining the results for GMM-Err-SAE sheds light on thisdiscrepancy. We see that as the exponent increases from 4 to16, the median SAE decreases from 684 km to 493 km. How- M CA E ( k m ) T i m e f o r t r a i n a nd t es t ( m i nu t es ) Training duration (days) accuracytime

Figure 4. Accuracy of

GMM-Err-SAE4 with di ﬀ erent amounts of train-ing data, along with the mean time to train and test one model. Each daycontains roughly 32,000 training tweets. (The 16-day test was run in anonstandard conﬁguration and its timing is therefore omitted.) ever, calibration su ﬀ ers rather dramatically: GMM-Err-SAE16 has a quite overconﬁdent OC = .

36. This is explained inpart by its use of fewer n-grams per message (182 for an ex-ponent of 4 versus 37 for exponent 16).Moreover, to our knowledge, no prior work reports either pre-cision or calibration metrics, making a complete comparisonimpossible. For example, the better mean SAE of Eisensteinet al. [11] may coincide with worse precision or calibration.These metrics are not unique to our GMM method, and weargue that they are critical to understanding techniques in thisspace, as the trade-o ﬀ above demonstrates.Finally, we speculate that a modest decrease in accuracy maynot outweigh the simplicity and scalability of our approach.Speciﬁcally in contrast to topic modeling approaches, ourlearning phase can be trivially parallelized by n-gram. We evaluated the accuracy of

GMM-Err-SAE4 on di ﬀ erenttraining durations, no gap, all ﬁelds except user description,7 M CA E ( k m ) T i m e f o r t r a i n a nd t es t ( m i nu t es ) Minimum n-gram instances accuracytime

Figure 5. Accuracy and run time of

GMM-Err-SAE4 vs. inclusion thresh-olds for the number of times an n-gram appears in training data. and minimum instances of 3. We used a stride of 13 days forperformance reasons.Figure 4 shows our results. The knee of the curve is 1 day oftraining (i.e., about 30,000 tweets), with error rapidly plateau-ing and training time increasing as more data are added; ac-cordingly, we use 1 training day in our other experiments. (cid:54)

We also evaluated accuracy when varying minimum instances(the frequency threshold for retaining n-grams), with trainingdays ﬁxed at 1; Figure 5 shows the results. Notably, includingn-grams which appear only 3 times in the training set improvesaccuracy at modest time cost (and thus we use this value inour other experiments). This might be explained in part by thewell-known long-tail distribution of word frequencies; that is,while the informativeness of each individual n-gram may below, the fact that low-frequency words occur in so many tweetscan impact overall accuracy. This ﬁnding supports Wing &Baldridge’s suggestion [34] that Eisenstein et al. [10] prunedtoo aggressively by setting this threshold to 40.

We evaluated the accuracy of

GMM-Err-SAE4 on di ﬀ erenttemporal gaps between training and testing, holding ﬁxed train-ing duration of 1 day and minimum n-gram instances of 3.Figure 6 summarizes our results. Location inference is sur-prisingly time-invariant: while error rises linearly with gapduration, it does so slowly – there is only about 6% addi-tional error with a four-month gap. We speculate that this issimply because location-informative n-grams which are time-dependent (e.g., those related to a traveling music festival) arerelatively rare. We wanted to understand which types of content provide use-ful location information under our algorithm. For example,Figure 1 on the ﬁrst page illustrates a successful estimate by We also observed deteriorating calibration beyond 1 day; this mayexplain some of the accuracy improvement and should be explored. M CA E ( k m ) M CA E l o ss vs . no d e l ay Delay (days)

Figure 6. Accuracy of GMM-Err-SAE4 with increasing delay betweentraining and testing.

Field Alone ImprovementMCAE success MCAE success user location 2125 65.8% 1255 1.7%user time zone 2945 76.1% 910 3.0%tweet text 3855 95.7% 610 7.3%user description 4482 79.7% 221 3.3%user language 6143 100.0% -103 8.5%

Table 3. Value of each ﬁeld.

Alone shows the accuracy and success rateof estimation using that ﬁeld alone, while

Improvement shows the meanimprovement when adding a ﬁeld to each combination of other ﬁelds (inboth cases, positive indicates improvement). For example, adding userlocation to some combination of the other four ﬁelds will, on average,decrease MCAE by 1,255 km and increase the success rate by 1.7 per-centage points.

GMM-Err-SAE4 . Recall that this was based almost entirelyon the n-grams angeles ca and ca , both from the location ﬁeld.Table 6 in the appendices provides a further snapshot of thealgorithm’s output. These hint that, consistent with other meth-ods (e.g., [16]), toponyms provide the most important signals;below, we explore this hypothesis in more detail. One framing of this research question is structural. To measurethis, we evaluated

GMM-Err-SAE4 on each combination ofthe ﬁve tweet ﬁelds, holding ﬁxed training duration at 1 day,gap at zero, and minimum instances at 3. This requires anadditional metric: success rate is the fraction of test tweetsfor which the model can estimate a location (i.e., at least onen-gram in the test tweet is present in the trained model).Table 3 summarizes our results, while Table 4 enumerateseach combination. User location and time zone are the mostaccurate ﬁelds, with tweet text and language important forsuccess rate. For example, comparing the ﬁrst and third rowsof Table 4, we see that adding text and language ﬁelds to amodel that considers only location and timezone ﬁelds im-proves MCAE only slightly (39 km) but improves successrate considerably (by 12.3% to 100.0%). We speculate thatwhile tweet text is a noisier source of evidence than time zone8 ank Fields MCAE success

Table 4. Accuracy of including di ﬀ erent ﬁelds. We list each combinationof ﬁelds, ordered by increasing MCAE. (due to the greater diversity of locations associated with eachn-gram), our algorithm is able to combine these sources toincrease both accuracy and success rate.It is also interesting to compare the variant considering onlythe location ﬁeld (row 8 of Table 4) with previous work thatheuristically matches strings from the location ﬁeld to gazet-teers. Hecht et al. found that 66% of user proﬁles contain sometype of geographic information in their location ﬁeld [16],which is comparable to the 67% success rate of our modelusing only location ﬁeld.Surprisingly, user description adds no value at all; we specu-late that it tends to be redundant with user location. We also approached this question by content analysis. To doso, from an arbitrarily chosen test of the 58 successful

GMM-Err-SAE4 tests, we selected a “good” set of the 400 (or 20%)lowest-CAE tweets, and a “bad” set of the 400 highest-CAEtweets. We further randomly subdivided these sets into 100training tweets (yielding 162 good n-grams and 457 bad ones)and 300 testing tweets (364 good n-grams and 1,306 bad ones,of which we randomly selected 364). Two raters independently created categories by examining n-grams from the location and tweet text ﬁelds in the trainingsets. These were merged by discussion into a uniﬁed hierarchy.The same raters then independently categorized n-grams fromthe two ﬁelds into this hierarchy, using Wikipedia to conﬁrmpotential toponyms and Google Translate for non-English n-grams. Disagreements were again resolved by discussion. (cid:55)

Our results are presented in Table 5. Indeed, toponyms o ﬀ erthe strongest signal; fully 83% of the n-gram weight in well-located tweets is due to toponyms, including 49% from citynames. In contrast, n-grams used for poorly-located tweetstended to be non-toponyms (57%). Notably, languages withgeographically compact user bases, such as Dutch, also pro-vided strong signals even for non-toponyms.These results and those in the previous section o ﬀ er a keyinsight into gazetteer-based approaches [13, 16, 26, 30, 32],which favor accuracy over success rate by considering onlytoponyms. However, our experiments show that both accuracyand success rate are improved by adding non-toponyms, thelatter to nearly 100%; for example, compare rows 1 and 8 ofTable 4. Further, Table 5 shows that 17% of location signal inwell-located tweets is not from toponyms.

6. IMPLICATIONS

We propose new judgement criteria for location estimates andspeciﬁc metrics to compute them. We also propose a simple,scalable method for location inference that is competitive withmore complex ones, and we validate this approach using ournew criteria on a dataset of tweets that is comprehensive tem-porally, geographically, and linguistically.This has implications for both location inference research aswell as applications which depend on such inference. In partic-ular, our metrics can help these and related inference domainsbetter balance the trade-o ﬀ between precision and recall andto reason properly in the presence of uncertainty.Our results also have implications for privacy. In particular,they suggest that social Internet users wishing to maximizetheir location privacy should (a) mention toponyms only atstate- or country-scale, or perhaps not at all, (b) not use lan-guages with a small geographic footprint, and, for maximal pri-vacy, (c) mention decoy locations. However, if widely adopted,these measures will reduce the utility of Twitter and other so-cial systems for public-good uses such as disease surveillanceand response. Our recommendation is that system designersshould provide guidance enabling their users to thoughtfullybalance these issues.Future directions include exploring non-gaussian and non-parametric density estimators and improved weighting algo-rithms (e.g., perhaps those optimizing multiple metrics), aswell as ways to combine our approach with others, in orderto take advantage of a broader set of location clues. We alsoplan to incorporate priors such as population density and tocompare with human location assessments. We did a similar analysis of the language and time zone ﬁelds, usingtheir well-deﬁned vocabularies instead of human judgement. How-ever, this did not yield signiﬁcant results, so we omit it for brevity. ategory Good Bad Exampleslocation ∗∗∗ city ∗∗∗ ∗∗ ∗∗∗ ∗∗∗ not-location 0.07 0.57 ∗∗∗ dutch word ∗∗∗ ∗∗∗ st new, i, pages, check myletter 0.01 0.04 μ , w, α , sslang 0.00 0.08 ∗∗∗ bitch, lad, ass, cuzspanish word 0.00 0.07 ∗∗∗ mucha, niña, los, suerteswedish word 0.00 0.02 rätt, jävla, på, kulturkish word 0.02 0.00 kar, restoran, biraz, dahauntranslated 0.02 0.00 cewe, gading, ung, suria technical ∗∗ foursquare ∗∗∗ other 0.03 0.04 Table 5. Content analysis of n-grams in the location and text ﬁelds. For each category, we show the fraction of total weight in all location estimatesfrom n-grams of that category; e.g., 49% of all estimate weight in the good estimates was from n-grams with category city (weights do not add up to100% because time zone and language ﬁelds are not included). Weights that are signiﬁcantly greater in good estimates than bad (or vice versa) areindicated with a signiﬁcance code ( ◦ = ∗ = ∗∗ = ∗∗∗ = good set is equal to the mean weight for the same category inthe bad set. Categories with less than 1.5% weight in both classes are rolled up into other . We also show the top-weighted examples in each category.

7. ACKNOWLEDGMENTS

Susan M. Mniszewski, Geo ﬀ rey Fairchild, and other membersof our research team provided advice and support. We thankour anonymous reviewers for key guidance and the Twitterusers whose content we studied. This work is supported byNIH / NIGMS / MIDAS, grant U01-GM097658-01. Computa-tion was completed using Darwin, a cluster operated by CCS-7at LANL and funded by the Accelerated Strategic ComputingProgram; we thank Ryan Braithwaite for his technical assis-tance. Maps were drawn using Quantum GIS; (cid:56) base mapgeodata is from Natural Earth. (cid:57)

LANL is operated by LosAlamos National Security, LLC for the Department of Energyunder contract DE-AC52-06NA25396.

8. APPENDIX: MATHEMATICAL IMPLEMENTATIONS8.1 Metrics

This section details the mathematical implementation of themetrics presented above. To do so, we use the following vo-cabulary. Let m be a message represented by a binary featurevector of n-grams (i.e., sequences of up to n adjacent tokens;we use n = m = { w . . . w V } , w j ∈ { , } . w j = w j appears in message m , and V is the total sizeof the vocabulary. Let y ∈ R represent a geographic point(for example, latitude and longitude) somewhere on the sur-face of the Earth. We represent the true origin of a messageas y ∗ ; given a new message m , our goal is to construct a geo-graphic density estimate f ( y | m ), a function which estimatesthe probability of each point y being the true origin of m . http: // qgis.org http: // naturalearthdata.com These implementations are valid for any density estimate f ,not just gaussian mixture models. Speciﬁc types of estimatesmay require further detail; for GMMs, this is noted below.CAE depends further on the geodesic distance d ( y , y ∗ ) be-tween the true origin y ∗ and some other point y . It can beexpressed as:CAE = E f [ d ( y , y ∗ )] = (cid:90) y d ( y , y ∗ ) f ( y | m ) d y (1)As computing this integral is intractable in general, we ap-proximate it using a simple Monte Carlo procedure. First, wegenerate a random sample of n points from the density f , S = { y . . . y n } ( n = (cid:49)(cid:48) Using thissample, we compute CAE as follows:CAE ≈ | S | (cid:88) y ∈ S d ( y , y ∗ ) (2)Note that in this implementation, the weighting has becomeimplicit: points that are more likely according to f are simplymore likely to appear in S . Thus, if f is a good estimate, mostof the samples in S will be near the true origin.To implement PRA, let R f ,β be a prediction region such thatthe probability of y ∗ falling within the geographic region R isits coverage β . Then, PRA β is simply the area of R : The implementations of our metrics depend on being able to e ﬃ -ciently (a) sample a point from f and (b) evaluate the probability ofany point. ile Tweet text Location TZ L N-grams CAE PRA

100 I’m at Court Avenue Restaurant and BrewingCompany (CABCO) (309 Court Avenue, Des Moines)w / // t.co / LW8cKUG3 Urbandale, IA central en 0.50 tx moines0.50 tx des moines 4 3490 Eyebrow threading time with @mention :) Cardi ﬀ , Wales en 0.73 lo cardi ﬀ // t.co / @mention exhibition date announced soon @mention kkkkkk besta santiago en 0.91 tx kkkkkk0.08 tz santiago 1,496 511,40520 @mention eu entrei no site é em dólar, se for real eucompro uma pra vc ir de novo Pra Disney agora. Belem-PA brasilia pt 0.89 tx de novo0.07 lo pa 2,645 263,57610 Þegar ég get ekki soﬁð // t.co / zx43NoZD en 0.81 tx get0.05 ln en0.02 tx t0.02 tx zx0.02 tx co0.02 tx t co 5,505 2,185,3540 @mention cyber creeping ya mean! I’m in NewZealand not OZ you mad expletive haha it’s deadlyanyways won’t b home anytime soon :P en 1.00 tx expletive Table 6. Example output of

GMM-Err-SAE4 for an arbitrarily selected test. TZ is the time zone ﬁeld (with -timeuscanada omitted), while L is the languagecode. N-grams which collectively form 95% of the estimate weight are listed. CAE is in kilometers, while PRA is in square kilometers. PRA β = (cid:90) R f ,β d y (3)As above, we can use a sample of points S from f to constructan approximate version of R :1. Sort S in descending order of likelihood f ( y i | m ). Let S β be the set containing the top | S | β sample points.2. Divide S β into approximately convex clusters.3. For each cluster of points, compute its convex hull, produc-ing a geo-polygon.4. The union of these hulls is approximately R f ,β , and the areaof this set of polygons is approximately PRA β . (cid:49)(cid:49) Finally, recall that OC β for a given estimator and a set of testmessages is the fraction of tests where y ∗ was within the pre-diction region R f ,β . That is, for a set ( y ∗ , y ∗ , ... y ∗ n ) of n truemessage origins: Because the polygons lie on an ellipsoidal Earth, not a plane, wemust compute the geodesic area rather than a planar area. This isaccomplished by projecting the polygons to the Mollweide equal-area projection and computing the planar area under that projection. OC β = n n (cid:88) i = [ y ∗ i ∈ R if ,β ] (4)We do not explicitly test whether y ∗ ∈ R , because doing sopropagates any errors in approximating R . Instead, we counthow many samples in S have likelihood less than f ( y ∗ | m );if this fraction is greater than β , then y ∗ is (probably) in R .Speciﬁcally: r ( y ∗ ) = | S | (cid:88) y ∈ S [ f ( y ) < f ( y ∗ )] (5)OC β ≈ n n (cid:88) i = [ r ( y ∗ i ) > β ] (6) As introduced in section “Our Approach”, we construct ourlocation model by training on geographic data consisting ofa set D of n (message, true origin) pairs extracted from ourdatabase of geotagged tweets; i.e., D = { ( m i , y ∗ i ) } ni = . Foreach n-gram w j , we ﬁt a gaussian mixture model g ( y | w j )based on examples in D . Then, to estimate the origin loca-tion of a new message m , we combine the mixture models forall n-grams in m into a new density f ( y | m ). These steps aredetailed below.11e estimate g for each (su ﬃ ciently frequent) n-gram w j in D as follows. First, we gather the set of true origins of all mes-sages containing w j , and then we ﬁt a gaussian mixture modelof r components to represent the density of these points: g ( y | w j ) = r (cid:88) k = π jk N ( y | µ jk , S jk ) (7)where π j = { π j . . . π jr } is a vector of mixture weights and N is the normal density function with mean µ jk and covariance S jk . We refer to g ( y | w j ) as an n-gram density .We ﬁt π and S independently for each n-gram using the expec-tation maximization algorithm, as implemented in the Pythonpackage scikit-learn [27].Choosing the number of components r is a well-studied prob-lem. While Dirichlet process mixtures [24] are a commonsolution, they can scale poorly. For simplicity, we instead in-vestigated a number of heuristic approaches from the liter-ature [23]; in our case, r = min( m , log( n ) /

2) worked well,where n is the number of points to be clustered, and m is a pa-rameter. We use this heuristic with m =

20 in all experiments.Next, to estimate the origin of a new message m , we gatherthe available densities g for each n-gram in m (i.e., some n-grams may appear in m but not in su ﬃ cient quantity in D ).We combine these n-gram densities into a mixture of GMMs: f ( y | m ) = (cid:88) w j ∈ m δ j g ( y | w j ) = (cid:88) w j ∈ m δ j r (cid:88) k = π jk N ( y | µ jk , S jk )(8)where δ = { δ . . . δ V } are the n-gram mixture weights asso-ciated with each n-gram density g . We refer to f ( y | m ) as a message density .A mixture of GMMs can be implemented as a single GMM bymultiplying δ j by π jk for all j , k and renormalizing so that themixture weights sum to 1. Thus, Equation 8 can be rewritten: f ( y | m ) = (cid:88) w j ∈ m r (cid:88) k = τ jk N ( y | µ jk , S jk ) (9)where τ jk = δ j π jk / (cid:80) j , k δ j π jk .We can now compute all four metrics. CAE and OC β requireno additional treatment. To compute SAE, we distill f ( y | m )into a single point estimate by the weighted average of its com-ponent means: ˆ y = (cid:80) w j ∈ m (cid:80) rk = τ jk µ jk . Computing PRA β re-quires dividing S β into convex clusters; we do so by assigningeach point in S to its most probable gaussian in f .The next two sections describe methods to set the n-grammixture weights δ j . δ j weights by inverse error Mathematically, the inverse error approach introduced abovecan be framed as a non-iterative optimization problem. Specif-ically, we set δ by ﬁtting a multinomial distribution to the ob-served error distribution. Let e i j ∈ R ≥ be the error incurred by n-gram density g ( y | w j ) for message m i ; in our implemen-tation, we use SAE as e i j for performance reasons (resultswith CAE are comparable). Let e j be the average error ofn-gram w j : e j = N j (cid:80) N j i = e i j , where N j is the number ofmessages containing w j . We introduce a model parameter α ,which places a non-linear (exponential) penalty on error terms e j . The problem is to minimize the negative log likelihood,with constraints that ensure δ is a probability distribution: δ ∗ ← argmin δ − log (cid:89) j δ e α j j (10)s.t. (cid:88) j δ j = δ j ≥ ∀ j (11)This objective can be minimized analytically. While the in-equality constraints in Equation 11 will be satisﬁed implicitly,we express the equality constraints using a Lagrangian: L ( δ, λ ) = − log (cid:89) j δ e α j j + λ  (cid:88) j δ j −  (12) = − (cid:88) j e α j log δ j + λ  (cid:88) j δ j −  (13)Taking the partial derivative with respect to δ k and setting to0 results in: ∂ L ∂δ k = − e α k δ k + λ = ∀ k (14) = − e α k + λδ k = ∀ k (15) ∂ L ∂δ = − (cid:88) k e α k + λ (cid:88) k δ k = (cid:80) k δ k = λ yields: λ = (cid:88) k e α k (17)Plugging this into 14 and solving for δ k results in: δ k = e α k (cid:80) k e α k (18)This brings us full circle to the intuitive result above: that theweight of an n-gram is proportional to its average error. (cid:49)(cid:50) δ j weights by optimization This section details the data-driven optimization algorithm in-troduced above. We tag each n-gram density function witha feature vector. This vector contains the ID of the n-gramdensity function, the quality properties, or both of these. For Our implementation ﬁrst assigns δ k = e α k , then normalizes theweights per-message as in Equation 9. dallas might be { id = , variance = . , BIC = . , ... } . We denotethe feature vector for n-gram w j as φ ( w j ), with elements φ k ( w j ) ∈ φ ( w j ).This feature vector is paired with a corresponding real-valuedparameter vector θ = { θ , . . . , θ p } setting the weight of eachfeature. The vectors θ and φ are passed through the logisticfunction to ensure the ﬁnal weights δ are in the interval [0,1]: δ θ j = + e − (cid:80) pk = φ k ( w j ) θ k (19)The goal of this approach is to assign values to θ such thatproperties that are predictive of low-error n-grams have highweight (equivalently, so that these n-grams have large δ θ j ).This is accomplished by minimizing an error function (builtatop the same SAE-based e i j as the previous method): θ ∗ ← argmin θ | D | (cid:88) i = (cid:80) w j ∈ m i e i j δ θ j (cid:80) w j ∈ m i δ θ j (20)After optimizing θ , we assign δ ∗ = δ θ ∗ . The numerator inEquation 20 computes the sum of mixture weights for eachn-gram density weighted by its error; the denominator sumsmixture weights to ensure that the objective function is nottrivially minimized by setting δ θ j to 0 for all j . Thus, to mini-mize Equation 20, n-gram densities with large errors must beassigned small mixture weights.Before minimizing, we ﬁrst augment the error function inEquation 20 with a regularization term: Φ ( D , θ ) = | D | (cid:88) i = (cid:80) w j ∈ m i e i j δ θ j (cid:80) w j ∈ m i δ θ j + λ (cid:107) θ (cid:107) (21)The extra term is an (cid:96) -regularizer to encourage small valuesof θ to reduce overﬁtting; we set λ = (cid:49)(cid:51) We minimize Equation 21 using gradient descent. For brevity,let n i j = (cid:80) w j ∈ m i e i j δ θ j and d i j = (cid:80) w j ∈ m i δ θ j be the numera-tor and denominator terms from Equation 21. Then, the gradi-ent of Equation 21 with respect to θ k is ∂ Φ ∂θ k = | D | (cid:88) i = (cid:88) w j ∈ m i − φ k ( w j ) δ θ j (1 − δ θ j )( e i j d i j − n i j ) d i j + λθ k (22)We set Equation 22 to 0 and solve for θ using L-BFGS asimplemented in the SciPy Python package [18]. (Note thatby decomposing the objective function by n-grams, we needonly compute the error metrics e i j once prior to optimization.)Once θ is set, we then ﬁnd δ according to Equation 19 and usethese values to ﬁnd the message density in Equation 8.

9. APPENDIX: TOKENIZATION ALGORITHM

This section details our algorithm to convert a text string intoa sequence of n-grams, used to tokenize the message text, userdescription, and user location ﬁelds into bigrams (i.e., n = λ could be tuned on validation data; this should be explored.

1. Split the string into candidate tokens, each consisting of asequence of characters with the same Unicode category andscript. Candidates not of the letter category are discarded,and letters are converted to lower-case. For example, thestring “Can’t wait for 私の ” becomes ﬁve candidate tokens: can , t , wait , for , and 私の .2. Candidates in certain scripts are discarded either becausethey do not separate words with a delimiter (Thai, Lao,Khmer, and Myanmar, all of which have very low usageon Twitter) or may not really be letters (Common, Inher-ited). Such scripts pose tokenization di ﬃ culties which weleave for future work.3. Candidates in the scripts Han, Hiragana, and Katakana areassumed to be Japanese and are further subdivided using theTinySegmenter algorithm [15]. (We ignore the possibilitythat text in these scripts might be Chinese, because thatlanguage has very low usage on Twitter.) This step wouldsplit 私の into 私 and の .4. Create n-grams from adjacent tokens. Thus, the ﬁnal tok-enization of the example for n = can , t , wait , for , 私 , の , can t , t wait , wait for , for 私 , and 私の .

10. APPENDIX: RESULTS OF PILOT EXPERIMENTS

This section describes brieﬂy three directions we explored butdid not pursue in detail because they seemed to be of limitedpotential value. • Unifying ﬁelds.

Ignoring ﬁeld boundaries slightly reducedaccuracy, so we maintain these boundaries (i.e., the samen-gram appearing in di ﬀ erent ﬁelds is treated as multiple,separate n-grams). • Head trim.

We tried sorting n-grams by frequency and re-moving various fractions of the most frequent n-grams. Insome cases, this yielded a slightly better MCAE but alsoslightly reduced the success rate; therefore, we retain com-mon n-grams. • Map projection.

We tried plate carrée (i.e., WGS84 lon-gitude and latitude used as planar X and Y coordinates),Miller, and Mollweide projections. We found no consis-tent di ﬀ erence with our error- and optimization-based algo-rithms, though some others displayed variation in MPRA.Because this did not a ﬀ ect our results, we used plate carréefor all experiments, but future work should explore exactlywhen and why map projection matters.

11. REFERENCES

1. H. Akaike. A new look at the statistical modelidentiﬁcation.

Automatic Control , 19(6):716–723, 1974.10.1109 / TAC.1974.1100705.2. D. M. Blei and others. Latent Dirichlet allocation.

Machine Learning Research , 3:993–1022, 2003.http: // dl.acm.org / citation.cfm?id = Annals ofStatistics , 30(1), 2002. 10.1214 / aos / Proc.Privacy, Security, Risk and Trust (PASSAT) , 2011.10.1109 / PASSAT / SocialCom.2011.120.5. H. Chang et al. @Phillies tweeting from Philly?Predicting Twitter user locations with spatial word usage.In

Proc. Advances in Social Networks Analysis andMining (ASONAM) , 2012. 10.1109 / ASONAM.2012.29.6. Z. Cheng et al. You are where you tweet: Acontent-based approach to geo-locating Twitter users. In

Proc. Information and Knowledge Management (CIKM) ,2010. 10.1145 / Proc. KnowledgeDiscovery and Data Mining (KDD) , 2011.10.1145 / Transactions inGIS , 15(6), 2011. 10.1111 / j.1467-9671.2011.01297.x.9. M. Dredze. How social media will change public health. Intelligent Systems , 27(4):81–84, 2012.10.1109 / MIS.2012.76.10. J. Eisenstein et al. A latent variable model for geographiclexical variation. In

Proc. Empirical Methods in NaturalLanguage Processing , 2010.http: // dl.acm.org / citation.cfm?id = Proc. Machine Learning (ICML) , 2011.http: // / papers / Predictive Inference: An Introduction .Chapman and Hall, 1993.13. J. Gelernter and N. Mushegian. Geo-parsing messagesfrom microtext.

Transactions in GIS , 15(6):753–773,2011. 10.1111 / j.1467-9671.2011.01294.x.14. M. C. González et al. Understanding individual humanmobility patterns. Nature , 453(7196):779–782, 2008.10.1038 / nature06958.15. M. Hagiwara. TinySegmenter in Python.http: // lilyx.net / tinysegmenter-in-python / .16. B. Hecht et al. Tweets from Justin Bieber’s heart: Thedynamics of the location ﬁeld in user proﬁles. In Proc.CHI , 2011. 10.1145 / Proc. WWW , 2012.10.1145 / // Proc. Workshop onSearch and Mining User-Generated Content (SMUC) ,2011. 10.1145 / Proc. ICWSM , 2012.http: // / ocs / index.php / ICWSM / ICWSM12 / paper / viewFile / / Proc. Information Systems for CrisisResponse and Management (ISCRAM) , 2012. http: // iscramlive.org / ISCRAM2012 / proceedings / Finite Mixture Models . Wiley& Sons, 2005. 10.1002 / Psychometrika , 50(2):159–179, 1985.10.1007 / BF02294245.24. R. M. Neal. Markov chain sampling methods forDirichlet process mixture models.

Computational andGraphical Statistics , 9(2), 2000. 10.2307 / Information Retrieval , 16(1):1–33, 2012.10.1007 / s10791-012-9195-y.26. S. Paradesi. Geotagging tweets using their content. In Proc. Florida Artiﬁcial Intelligence Research Society ,2011. http: // / ocs / index.php / FLAIRS / FLAIRS11 / paper / viewFile / / Machine Learning Research , 12:2825–2830,2011. http: // dl.acm.org / citation.cfm?id = Proc. EmpiricalMethods in Natural Language Processing andComputational Natural Language Learning(EMNLP-CoNLL) , 2012.http: // dl.acm.org / citation.cfm?id = CACM ,54(3):18–20, 2011. 10.1145 / Proc. ICWSM , 2013.31. G. Schwarz. Estimating the dimension of a model.

Annals of Statistics , 6(2):461–464, 1978.10.1214 / aos / Proc. Data Mining Workshops (ICDMW) , 2012.10.1109 / ICDMW.2012.128.33. C. Wang et al. Mining geographic knowledge usinglocation aware topic model. In

Proc. Workshop onGeographical Information Retrieval (GIR) , 2007.10.1145 / Proc. Association forComputational Linguistics (ACL) , 2011. https: // / anthology-new / P / P11 //