A Probabilistic Framework for Location Inference from Social Media
Yujie Qian, Jie Tang, Zhilin Yang, Binxuan Huang, Wei Wei, Kathleen M. Carley
AA Probabilistic Framework for Location Inference from SocialMedia
Yujie Qian
Massachusetts Institute of [email protected]
Jie Tang
Tsinghua [email protected]
Zhilin Yang
Carnegie Mellon [email protected]
Binxuan Huang
Carnegie Mellon [email protected]
Wei Wei
Carnegie Mellon [email protected]
Kathleen M. Carley
Carnegie Mellon [email protected]
ABSTRACT
We study the extent to which we can infer users’ geographicallocations from social media. Location inference from social me-dia can benefit many applications, such as disaster management,targeted advertising, and news content tailoring. The challenges,however, lie in the limited amount of labeled data and the largescale of social networks. In this paper, we formalize the problem ofinferring location from social media into a semi-supervised factorgraph model (SSFGM). The model provides a probabilistic frame-work in which various sources of information (e.g., content andsocial network) can be combined together. We design a two-layerneural network to learn feature representations, and incorporatethe learned latent features into SSFGM. To deal with the large-scaleproblem, we propose a Two-Chain Sampling (TCS) algorithm tolearn SSFGM. The algorithm achieves a good trade-off between ac-curacy and efficiency. Experiments on Twitter and Weibo show thatthe proposed TCS algorithm for SSFGM can substantially improvethe inference accuracy over several state-of-the-art methods. Moreimportantly, TCS achieves over 100 × speedup comparing with tra-ditional propagation-based methods (e.g., loopy belief propagation). KEYWORDS
Location Inference, Social Media, Factor Graph Model
In social media platforms, such as Twitter, Facebook, and Weibo,location is a important demographic attribute to support friend andmessage recommendation [5, 33]. For example, statistics show thatthe average number of friends between users from the same timezone is about 50 times higher than the number between users witha distance of three time zones [18]. This geographical information,however, is usually unavailable. Cheng et al. [8] show that only26.0% of users on Twitter input their locations. Furthermore, of thelocations that are user-supplied, many are ambiguous or incorrect.Twitter, Facebook, and Weibo have functionalities allowing per-tweet geo-tags; however, it turns out that only 0.42% of all tweetscontain a geo-tag [8].In this work, we aim to find an effective and efficient way toautomatically infer users’ geographical locations from social mediadata. Different from previous works [5, 8, 29] that deal with thisproblem in a specific scenario (e.g., determining the US cities only)or with specific data (e.g., Twitter), we propose a method that isgeneral enough to apply to diverse scenarios. This brings severalnew challenges: • Limited labeled data.
Only a small portion of users havethe location information, and all the others are unlabeled. Itis necessary to design a principled way to learn with bothlabeled and large unlabeled data. • Large-scale network.
Our problem has strong network cor-relation, but how to leverage the correlation, particularly ina large-scale network is challenging. • Model flexibility.
The proposed model should be flexibleenough in oder to be easily generalized to other scenarios andto incorporate various information (e.g., content, structureand deep features).
Previous work on location inference.
The location inferenceproblem has been studied by researchers from different commu-nities. Surveys of location inference techniques on Twitter andrelated data challenges can be found in [2, 15, 22]. Roughly speak-ing, existing literature can be divided into two categories. The firstcategory of research focuses on studying content. For example,Cheng et al. [8] and Han et al. [14] used a probabilistic frameworkand illustrated how to find local words and overcome tweet spar-sity. Eisenstein et al. [10] proposed the Geographic Topic Model topredict a user’s geo-location from text and topics. Ryoo et al. [39]applied a similar idea to a Korean Twitter dataset. Ikawa et al. [19]used a rule-based approach to predict a user’s current location basedon former tweets. Wing et al. [41] and Roller et al. [38] proposedinformation retrieval approaches with geographic grids. The otherline of research infers user locations using network structure infor-mation. For example, Backstrom et al. [5] assumed that an unknownuser would be co-located with one of their friends and sought thelocation with the maximum probability. McGee et al. [30] integratedsocial tie strengths between users to improve location estimation.Jurgens [21] and Davis Jr et al. [44] used the idea of label propa-gation to infer user locations according to their network distancesfrom users with known locations. These methods, however, do notconsider content. Li et al. [29] proposed a unified discriminativeinfluence model and utilized both the content and the social net-work, but they focused on the US users and only considered thementioned location names in tweets. Rahimi et al. [35, 36] used asimple hybrid approach to combine predictions from content andnetwork. Recently, Miura et al. [32] proposed a recurrent neuralnetwork model for learning content representations and integratedthe user network embeddings. Another study using user profilescan be found in [44]. Table 1 summarizes the most related workson location inference. However, all the aforementioned methodscannot solve all the challenges listed above. a r X i v : . [ c s . A I] M a y able 1: Summary of previous studies on geo-location infer-ence in social media.Features Authors Scope Language Object Content Cheng et al. [8]
US All UserHan et al. [14]
World English UserEisenstein et al. [10]
US All UserRyoo et al. [39]
Korea Korean UserIkawa et al. [19]
Japan Japanese TweetWing et al. [41]
US English UserRoller et al. [38]
US English UserNetwork Backstrom et al. [5]
World All UserMcGee et al. [30]
World All UserJurgens [21]
World All UserDavis Jr et al. [9]
World All UserContent +Network Li et al. [29]
US English UserMiura et al. [32]
World English UserProfile Zubiaga et al. [44]
World All Tweet
Problem formulation.
We now give a formalization to preciselydefine the problem we are dealing with. Without loss of generality,our input can be considered as a partially labeled network G = ( V , E , X , Y L ) derived from the social media data. V denotes a setof | V | = N users, V L ⊂ V denotes a subset of labeled users (withlocations), V U = V \ V L indicates the subset of unlabeled users(without locations), E ⊆ V × V is the set of relationships betweenusers, Y L corresponds to the locations of users in V L , and X isthe feature matrix associated with users in V , where each rowcorresponds to a user and each column corresponds to a feature.Given the input, the problem of inferring user locations can bedefined as follows:Problem 1. Geo-location Inference . Given a partially labelednetwork G = ( V , E , X , Y L ) , the objective is to learn a predictive func-tion F in order to predict the locations of unlabeled users V U F : G = ( V , E , X , Y L ) → Y U (1) where Y U is the set of predicted locations for unlabeled users V U . It is worth noting that our formulation of user location inferenceis slightly different from that in the aforementioned work. The taskis defined as a semi-supervised learning problem for networkeddata — we have a network with limited labeled nodes and a largenumber of unlabeled nodes. Our goal is to leverage both localattributes X and network structure E to learn the predictive function F . Moreover, we assume that all predicted locations are among thelocations occurring in the labeled set Y L . It is also worth mentioningthat a user may have multiple locations; here we focus on predictingone’s primary location (e.g., home or the location in one’s profile). Our solution and contributions.
In this paper, we propose aprobabilistic framework based on factor graphs to address the loca-tion inference problem. However, it is infeasible to directly applytraditional factor graphs, due to the new challenges in our problem.Our goal is to achieve a good trade-off between the accuracy and efficiency, and also to make the model scalable to large networks.Our contributions in this work can be summarized as follows: • We present a semi-supervised factor graph model (SSFGM),which learns to infer user locations using both labeled andunlabeled data. • By incorporating network structures and deep feature rep-resentations, SSFGM substantially improves the inferenceaccuracy over several state-of-the-art methods. • We propose a Two-Chain Sampling (TCS) algorithm to learnthe SSFGM. TCS achieves over 100 × speedup comparingwith the traditional loopy belief propagation method. Allcodes and data used in this work are publicly available. We conduct systematic experiments on different genres of datasets,including Twitter and Weibo. The results show that the proposedmodel significantly improves the location inference accuracy. Interms of the time cost for model training, the proposed TCS al-gorithm is very efficient and requires only less than two hours oftraining on million-scale networks.
In this section, we propose a semi-supervised framework based onfactor graphs for location inference from social media.
Basic intuitions.
For inferring user locations, we have three basicintuitions. First, the user’s profile may contain implicit informationabout the user’s location, such as the time zone and the languageselected by the user. Second, the tweets posted by a user may revealthe user’s location. For example, Table 2 lists the most popular “local”words in five English-speaking countries. These words include cities(Melbourne, Dublin), organizations (HealthSouth, UMass), sports(hockey, rugby), local idiom (Ctfu, wyd, lad), etc. Third, networkstructure can be very helpful for geo-location inference. In Twit-ter, for example, users can follow each other, retweet each other’stweets, and mention other users in their tweets. The principle of homophily [27] — “birds of a feather flock together” [31] — sug-gests that these “connected” users may come from the same place.This tendency was observed between Twitter reciprocal friendsin [18, 26]. Moreover, we found that the homophily phenomenonalso exists in the mention network. Table 3 shows the statisticsfor US, UK, and China Twitter users. We can see that when user Amentions (@) user B, the probability that A and B come from thesame country is significantly higher than that they come from dif-ferent countries. Interestingly, when a US user A mentions anotheruser B in Twitter, the chance that user B is also from the US is 95% ,while if user A comes from the UK, the probability sharply dropsto 85%, and further drops to 80% for users from China. We also didstatistics on the US users at state-level and found that there is an82.13% chance that users A and B come from the same state if onementions the other in her/his tweets.
Model illustration.
Based on the above intuitions, we propose aSemi-Supervised Factor Graph Model (SSFGM) for location infer-ence. Figure 1 shows the graphical representation of the SSFGM. https://github.com/thomas0809/SSFGM nput: Partially Labeled Network ObservationsLatent Variables
Semi-Supervised Factor Graph Model f ( x , y ) f ( x , y ) h ( y , y ) h ( y , y ) f ( x , y ) f ( x , y ) f ( x , y ) h ( y , y ) h ( y , y ) h ( y , y ) v v v v v ChinaUSAUSAunknownunknown y y =? y y y y y =China y =? y =USA y =USA g ( x , y ) g ( x , y ) g ( x , y ) g ( x , y ) g ( x , y ) x x x x x y x attribute deep f ( x , y ) g ( x , y ) x Figure 1: Graphical representation of the proposed Semi-Supervised Factor Graph Model (SSFGM).Table 2: Popular location indicative words in tweets postedby users from different countries.
Country US UK Canada Australia IrelandTop-10words * HealthSouth Leeds Calgary Melbourne DublinUMass Used Toronto Sydney IrelandMontefiore Railway Vancouver Australia IrishCtfu xxxx Ontario 9am HumACCIDENT whilst Canadian Type ladsPanera listed Canada ℃ ladMINOR Xx BC hPa xxxwyd Xxx hockey Centre rugbyKindred tbh Available ESE Xxxhmu xx NB mm xxxx * Top-10 by mutual information [42], among the words occurred > times. Table 3: Who will Twitter user @? (User A mentions User B)
User A US UK ChinaUser B US 95.05 % UK 85.69 % China 80.37 %Indonesia 0.77 % US 5.12 % Indonesia 7.89 %UK 0.75 % Nigeria 3.03 % US 5.97 %Canada 0.61 % Indonesia 1.00 % Korea 0.96 %Mexico 0.27 % Ireland 0.49 % Japan 0.71 %
The graphical model SSFGM consists of two kinds of variables:observations { x } and latent variables { y } . In our problem, eachuser v i corresponds to an observation x i and is also associatedwith a latent variable y i . The observation x i represents the user’spersonal attributes and tweet content, and the latent variable y i represents the user’s location. In this paper, we consider locationinference as a classification problem, i.e., y i ∈ { , . . . , C } which canbe the user’s country, state, or city, and C is the number of possiblelocation categories. We denote Y = { y , y , . . . , y N } , and Y can be divided into a labeled set Y L and an unlabeled set Y U . The latentvariables { y i } i = , ··· , N are correlated with each other, representingthe social relationships between users. In SSFGM, such correlationscan be defined as factor functions.Now we explain the SSFGM in detail. Given a partially labelednetwork as input, we define two factor functions: • Attribute factor: f ( x i , y i ) represents the relationship be-tween observation (features) x i and the latent variable y i ; • Correlation factor: h ( y i , y j ) denotes the correlation be-tween the locations of users v i and v j .The factor functions can be instantiated in different ways. Inthis paper, we define the attribute factor as an exponential-linearfunction f ( x i , y i ) = exp (cid:0) α ⊤ Φ ( x i , y i ) (cid:1) Φ k ( x i , y i ) = ( y i = k ) x i , k ∈ { , . . . , C } (2)where α = ( α , · · · , α C ) ⊤ is the weighting vector, Φ = ( Φ , · · · , Φ C ) ⊤ is the vector of feature functions, and ( y i = k ) is an indicator func-tion which is equal to 1 when y i = k and 0 otherwise.The correlation factor is defined as h ( y i , y j ) = exp (cid:0) γ ⊤ Ω ( y i , y j ) (cid:1) (3)where γ is also a weighting vector, and Ω represents feature func-tions Ω kl ( y i , y j ) = ( y i = k , y j = l ) w ij , w ij can be any featuresassociated with users v i and v j , such as the number of interactions.Correlation can be directed (e.g., mention), or undirected (e.g., re-ciprocal follow, Facebook friend). For undirected correlation, weneed to guarantee γ kl = γ lk in the model. Model enhancement with deep factors.
We introduce how toutilize deep neural networks to enhance the proposed SSFGM. Italso demonstrates the flexibility of the model. We incorporate adeep factor д ( x i , y i ) in SSFGM to represent the deep (non-linear)association between x i and y i . The right side of Figure 1 illustrates ow we combine the predefined attributes and the deep factor inSSFGM.Specifically, our deep factor is a two-layer neural network. Theinput vector x is fed into a neural network with two fully-connectedlayers, denoted h ( x ) and h ( x ) : h ( x ) = ReLU ( W x + b ) h ( x ) = ReLU ( W h ( x ) + b ) (4)where W , W , b , b are parameters of the neural network, andwe use ReLU ( x ) = max ( , x ) [13] as the activation function. Similarto the definition of attribute factor, we define д ( x i , y i ) = exp (cid:0) β ⊤ Ψ ( x i , y i ) (cid:1) Ψ k ( x i , y i ) = ( y i = k ) h ( x i ) , k ∈ { , . . . , C } (5)where β is the weighting vector for the output of the neural net-work.Thus, we define the following joint distribution over Y : p ( Y | X ) = Z (cid:214) v i ∈ V f ( x i , y i ) д ( x i , y i ) (cid:214) ( v i , v j )∈ E h ( y i , y j ) (6)where Z is the normalization factor that ensures (cid:205) Y p ( Y | X ) = Feature Definitions.
For the attribute factor, we define two cate-gories of features: profile and content.
Profile features include information from the user profiles,such as time zone, user-selected language, gender, age, number offollowers and followees, etc.
Content features capture the characteristics of tweet content.The easiest way to define content features is using a bag-of-wordsrepresentation. But it suffers from sparsity and high dimensionality,especially in Twitter, which has hundreds of languages.In our work, we employ Mutual Information (MI) [42] to rep-resent the content. Given a word w and a location c , the MutualInformation between them is computed asMI ( w , c ) = log p ( w , c ) p ( w ) p ( c ) ≈ log count ( w , c ) · n count ( w ) · count ( c ) (7)where count ( w , c ) is the number of tweets which are posted atlocation c and contain the word w , count ( w ) is the number of tweetscontaining word w , count ( c ) is the number of tweets posted atlocation c , and n is the total amount of tweets in the training data.We pre-compute the MI between each word and each location usingthe training corpus, and define the content features for each user asthe aggregated MI. We use two aggregation approaches, max and averaдe , i.e.,MI max ( v , c ) = max w ∈ T ( v ) MI ( w , c ) MI averaдe ( v , c ) = | T ( v )| (cid:213) w ∈ T ( v ) MI ( w , c ) (8)where T ( v ) represents all the words from the tweets posted by user v . Then we use the aggregated MIs as the input content featuresfor our model. Now we introduce how to tackle the learning problem in SSFGM.We first start with the learning objective and gradient derivation,and then propose our Two-Chain Sampling algorithm.
Learning objective and gradient derivation.
Learning a Semi-Supervised Factor Graph Model involves two parts: learning pa-rameters α , β , γ for the graphical model, and learning parameters W , b for the neural network of the deep factor. In this paper, welearn the two parts jointly.We follow the maximum likelihood estimation (MLE) to learnthe graphical model. For notation simplicity, we rewrite the jointprobability (Eq. 6) as p ( Y | X ) = Z (cid:214) i exp (cid:0) θ ⊤ s ( y i ) (cid:1) = Z exp (cid:0) θ ⊤ S ( Y ) (cid:1) (9)where θ = ( α ⊤ , β ⊤ , γ ⊤ ) ⊤ are the factor graph model parametersto estimate, s ( y i ) = ( Φ ( x i , y i ) ⊤ , Ψ ( x i , y i ) ⊤ , (cid:205) y j Ω ( y i , y j ) ⊤ ) ⊤ , and S ( Y ) = (cid:205) i s ( y i ) . The input of SSFGM is partially labeled, whichmakes the model learning very challenging. The general idea hereis to maximize the marginal likelihood of labeled data. We denote Y | Y L as the label configuration that satisfies all the known labels.Then we can define the following MLE objective function O( θ ) : O( θ ) = log p ( Y L | X ) = log (cid:213) Y | Y L Z exp ( θ ⊤ S ) = log (cid:213) Y | Y L exp ( θ ⊤ S ) − log (cid:213) Y exp ( θ ⊤ S ) (10)Now the learning problem is cast as finding the best parameterconfiguration that maximizes the objective function, i.e.,ˆ θ = arg max θ log p ( Y L | X ) (11)We can use gradient descent to solve this optimization problem.First, we derive the gradient of parameter θ : ∂ O( θ ) ∂ θ = (cid:205) Y | Y L exp ( θ ⊤ S ) · S (cid:205) Y | Y L exp ( θ ⊤ S ) − (cid:205) Y exp ( θ ⊤ S ) · S (cid:205) Y exp ( θ ⊤ S ) = E p θ ( Y | Y L , X ) [ S ] − E p θ ( Y | X ) [ S ] (12)In order to learn the neural network parameters in the deepfactor, we derive the gradients of the top layer of the neural networksimilarly to Eq. 12, and then follow the standard backpropagationalgorithm to update the parameters. Similar methods have beenstudied in [4]; we mainly discuss how to learn the graphical modelin the following.In Eq. 12, the gradient is equal to the difference of two expecta-tions under two different distributions. The first one — p θ ( Y | Y L , X ) — is the model distribution conditioned on labeled data, and thesecond — p θ ( Y | X ) — is the unconditional model distribution. Bothof them are intractable and cannot be computed directly [40]. Wewill illustrate how to deal with this challenge in the rest of thesection. Loopy Belief Propagation (LBP) [34].
A traditional approachis LBP, an algorithm for approximately estimating marginal proba-bilities in graphical models. It performs message passing betweenvariable nodes and factor nodes according to the sum-product rule[25]. In each step of gradient descent, we need to perform LBPtwice to estimate p θ ( Y | X ) and p θ ( Y | Y L , X ) respectively, and thencalculate the gradient according to Eq. 12.However, the LBP-based learning algorithm is computationallyexpensive. Its time complexity is O ( I I (| V | C + | E | C )) , where I is he number of iterations for gradient descent, I is the number ofiterations for loopy belief propagation, and C is the number of thelocation categories (usually 30-200). This algorithm is very time-consuming, and not applicable especially when we have millionsof users and edges. Softmax Regression (SR).
We try to solve the learning challengein large-scale factor graphs. It is difficult to calculate the jointprobability Eq. 9 because of the normalization factor Z , whichsums over all the possible configurations of Y . However, if we onlyconsider a single variable y i and assume all the other variablesare fixed, its conditional probability can be easily calculated by asoftmax function, p ( y i | X , Y \ { y i }) = exp (cid:0) θ ⊤ s ( y i ) (cid:1)(cid:205) y ′ i exp (cid:16) θ ⊤ s ( y ′ i ) (cid:17) (13)Eq. 13 has the same form as softmax regression (also called multi-nomial logistic regression). The difference is that the neighborhoodinformation is captured in feature function s ( y i ) . Softmax regres-sion can be trained using gradient descent, and the gradient is mucheasier to compute than factor graph models. We then design anapproximate learning algorithm based on softmax regression:Step 1. Conduct softmax regression to learn α and β , with labeleddata {( x i , y i )| y i ∈ Y L } only; Step 2. Predict the labels Y U for unlabeled users;Step 3. Conduct softmax regression to learn θ according to Eq. 13;Step 4. Predict the labels Y U for unlabeled users. If the predictionaccuracy on the validation set increases, go to Step 3;otherwise, stop.This algorithm is an efficient approximation method for learningSSFGM, but its performance can be further improved. We can use SRto initialize the model parameters for the other learning algorithms. Two-Chain Sampling (TCS).
Now we introduce the proposedTCS algorithm, a novel Markov Chain Monte Carlo (MCMC) method[3], for efficiently learning SSFGM. MCMC has been proven suc-cessful in learning complex graphical models. For example, Rohani-manesh et al. proposed the SampleRank algorithm to train factorgraphs [37]. However, SampleRank has some shortcomings. It ac-tually optimizes an alternative max-margin objective instead ofthe original maximum likelihood objective. In addition, it relies onan external metric (e.g., accuracy), which could be arbitrary andengineering-oriented, since multiple metrics are often available forevaluation.We propose a new method to directly optimize the maximumlikelihood objective (Eq. 10) without using additional heuristic met-rics. We refer to this algorithm as
Two-Chain Sampling , summarizedin Algorithm 1. The key idea behind TCS is that we generate twoMarkov chains, and in each sampling step, we use a similar ap-proach as that of contrastive divergence (CD) [17] to compute thegradient.Mathematically, the gradient we are estimating (Eq. 12) consistsof two expectation terms. To obtain an unbiased estimation, we con-struct two Markov chains Y and Y . Specifically, we sample Y from p data = p θ ( Y | X , Y L ) and sample Y from p model = p θ ( Y | X ) . Various Here we assume p ( y i | x i ) = softmax (cid:0) α ⊤ Φ ( x i , y i ) + β ⊤ Ψ ( x i , y i ) (cid:1) . Algorithm 1:
Two-Chain Sampling (TCS)
Input : G = ( V , E ) , X , Y L , learning rate η ; Output : learned parameters θ ; Initialize θ randomly; Initialize Y with Y L fixed, and Y U randomly; Initialize Y randomly; repeat Randomly split V to mini-batches { B , . . . , B K } ; for k = , , . . . , K do Initialize the gradient δ ← ; for v i ∈ B k do Sample y i in Y such that Y ∼ p θ ( Y | X , Y L ) ; Sample y i in Y such that Y ∼ p θ ( Y | X ) ; if y i ∈ Y L then δ ← δ + s ( y i | Y ) − E [ s ( y i )| Y \{ y i }] ; else δ ← δ + E [ s ( y i )| Y \{ y i }] − E [ s ( y i )| Y \{ y i }] ; θ ← θ + η · δ ; Evaluate on the validation set; until early stopping criteria satisfied ;samplers could be applied here. We choose Gibbs sampling [12] inthis work. In each sampling step, Gibbs sampling updates a singlevariable y i while the other variables are fixed. In other words, wesample y i according to the distribution we have defined in Eq. 13,but use the neighbours’ values from Y and Y respectively in thetwo chains. It should also be noted that when we update y i of alabeled user in the chain Y (i.e., y i ∈ Y L ), its value should neverbe changed from its true label. Since Y follows p θ ( Y | X , Y L ) , all theknown labels must be fixed.It is non-trivial to calculate the gradient in the sampling process.A standard way is to keep sampling for a number of iterations andthen use the resulting distribution to approximately compute theexpectation value. However, the MCMC method typically requirestoo many iterations to reach convergence, which makes it notapplicable in training large factor graph models. Fortunately, assuggested by the contrastive divergence algorithm [17], we donot have to wait for the convergence but usually a few samplingsteps (or even one step) can be effective enough. Besides, bearinga similar merit to stochastic gradient descent (SGD) [6], we cansample only a small subset of variables each time instead of allof them. Thus we first randomly split the user set into some fix-sized mini-batches. In each step, we sample the variables y i in amini-batch, compute the gradient, and update the parameters. Thegradient can be approximated as (cid:205) i s ( y i | Y ) − s ( y i | Y ) , where thesummation is taken over the mini-batch. Empirically, it is a feasiblesolution, but the learning process sometimes becomes unstable. Toimprove learning stability, we change the gradient computation to (cid:205) i E [ s ( y i )| Y \{ y i }] − E [ s ( y i )| Y \{ y i }] , i.e., the expectations underthe distribution Eq. 13. Again, the first expectation value is simply s ( y i | Y ) if y i is a known label. We have explicitly indicated it in We also tried some other sampling methods such as Metropolis-Hastings sampling[16], and finally chose Gibbs sampling because of its efficiency. lgorithm 1 with the “if-then-else” statement. In practice, it isusually necessary to downsample the unlabeled data if they aresignificantly more than the labeled data.We use the early stopping technique to determine when to stoptraining. Specifically, we divide the labeled data into a training setand a validation set. During the learning process, we only use thelabels in the training set. We evaluate the model after each epoch (acomplete pass through the dataset), and if the prediction accuracyon the validation set does not increase for ε epochs, we stop thealgorithm and return the parameter configuration ˆ θ that achievesthe best accuracy on the validation set. ε is a hyperparameter.Compared with LBP and SR, the TCS algorithm directly optimizesthe MLE objective, and is very time-efficient. Focusing on the semi-supervised learning setting on a partially labeled factor graph, wesimultaneously maintain two Markov chains and provide an elegantway to perform gradient estimation. Parallel learning.
To scale up the proposed model to handlelarge networks, we have developed parallel learning algorithmsfor SSFGM. For the SR algorithm, softmax regression can be eas-ily parallelized. The gradient is a summation over all the traininginstances (or a mini-batch if using SGD), and the computation isindependent. For TCS, we can still parallelize the computation ofthe instances in a mini-batch. The only difference is that insteadof sampling the variables one by one in the sequential setting, wesample a mini-batch of variables simultaneously in the parallel set-ting. This variation is usually called the blocked Gibbs sampler [20]and will not change the original properties of Gibbs sampling.
Prediction.
SSFGM is learned in a semi-supervised way — bothlabeled and unlabeled instances are taken as input in the trainingprocess. After learning the parameters, we predict the labels of unla-beled instances. Alternatively, we can also apply the learned SSFGMin a inductive setting, i.e., to predict future unknown instances.For prediction, the task is to find the most likely configurationof ˆ Y for unlabeled users based on the learned parameters ˆ θ ,ˆ Y = arg max Y | Y L p ˆ θ ( Y | X , Y L ) (14)We also use the sampling method to obtain the predictions. Inprinciple, we can keep sampling with the estimated ˆ θ and returnthe configuration ˆ Y with the maximum likelihood. But in practice,we simply choose the value with the highest probability in eachsampling step. It only guarantees finding a local optimum, but isusually effective enough and much faster. (Cf. § 3 for details.) We evaluate the proposed model on two different social media data:Twitter and Weibo.
Datasets.
We construct three datasets for experiments. Table 4shows the basic statistics of the datasets. • Twitter (World):
We collect geo-tagged tweets posted in2011 through Twitter API. There are 243,000,000 tweetsposted by 3,960,000 users in our collected data. After data pre-processing, we obtain a dataset consisting of 1.5 million users
Table 4: Statistics of the datasets.Dataset
Twitter (World) 1,480,360 25,867,610 159 ( a ) Twitter (USA) 329,457 3,194,305 51 ( b ) Weibo 1,073,923 26,849,122 34 ( c ) ⋆ (a) 159 countries; (b) 50 states and Washington, D.C.; (c) 34 provinces. from 159 countries in the world. The task on this dataset is toinfer the user’s country. Due to the limitations of the TwitterAPI, we cannot crawl the following relationships; thus weuse mentions (“@”) in tweets to derive the relationships. • Twitter (USA):
This dataset is constructed from the sameraw data as that of Twitter (World). The difference is thatwe only keep the USA users here. The task on this dataset isto infer the user’s state. • Weibo [43]:
Weibo is the most popular Chinese microblog.The original dataset consists of about 1,700,000 users, withup to 1,000 of the most recent microblogs posted by each user.The task is to infer the user’s province. We use reciprocalfollowing relationships as edges in this dataset.We preprocess the three datasets in the following ways. First,we filter out users who have fewer than 10 tweets in the dataset.Then, we tokenize the tweet content into words. In Twitter, we splitthe sentences by punctuation and spaces. For languages that donot use spaces to separate words (such as Chinese and Japanese),we split each character. In the Weibo data provided by [43], thecontent has already been tokenized into Chinese words. For eachuser, we combine all her/his tweets and derive content features asdefined in Eq. 8. The ground truth location is defined by differentways in each dataset. In the two Twitter datasets, we convert theGPS-tag on tweets to its country/state, and only keep the userswho posted all tweets in the same country/state in order to reducethe noise in the training data. (In our data, more than 90% usersposted all their tweets in the same country in a year, and morethan 80% USA users posted all their tweets in the same state.) InWeibo, the ground truth locations are extracted from user profiles,which have been categorized into provinces. We collect the latitudeand longitude coordinates of the locations (for calculating the errordistances) through the Google Maps Geocoding API. In all datasets,we remove the countries/states/provinces with fewer than 10 users.
Comparison methods.
We compare the following methods forlocation inference: • Content [8]:
It utilizes a simple probabilistic model to pre-dict locations with tweet content only. • Logistic Regression (LR):
A baseline classification modelto predict the user location using logistic regression. We usethe same feature set as our proposed model, including bothcontent and profile features, but ignoring the correlations. • Support Vector Machine (SVM) [44]:
Zubiaga et al. haveapplied SVM to classify tweet location. We choose a linearfunction as the kernel of SVM. able 5: Performance comparison of different methods in user geo-location inference. (“Acc.” means Accuracy (%), and “MED”means Mean Error Distance (km).) Twitter (World) Twitter (USA) WeiboMethod Acc. Acc.@3 MED Acc. Acc.@3 MED Acc. Acc.@3 MEDContent [8] [44] [5] [24] [37] ) 94.96 98.25 292.15 58.48 74.54 578.95 66.91 82.81 263.29SSFGM (TCS) 95.68 • FindMe [5]:
This method infers user locations with socialand spatial proximity. It uses the network only and propa-gates label information to unlabeled users. • Graph Convolutional Network (GCN) [24]:
We also con-sider GCN, a state-of-the-art neural network model for graph-based semi-supervised learning. It uses the same featuresand correlations as our model to predict user locations. • SSFGM:
The proposed method. We compare the perfor-mance of our model trained by three different algorithms:Softmax Regression (SR), SampleRank [37], and Two-ChainSampling (TCS). We also report results when we enhancethe model with deep factors: SSFGM (TCS+Deep).
Evaluation metrics.
For evaluation, we divide each dataset intothree parts: 50% for training, 10% for validation, and 40% for testing.For the methods that do not require validation, the validation datais also used for training. We consider three evaluation metrics:Accuracy (percentage of the users whose locations are predictedcorrectly), Accuracy@3 (percentage that the true location is amongthe top 3 predictions ), and Mean Error Distance (the average errordistance between the prediction and the true location). Implementation details.
For the Content method, we identifylocation indicative words using the Information Gain Ratio criterionproposed by [14]. For LR and SVM, we use the implementation ofLiblinear [11] with the default parameter setting. For GCN, we usea two-layer GCN model with the hidden layer size of 128, and usethe mini-batched training approach [7].For the proposed method, we implement SSFGM (TCS) andSSFGM (TCS+deep) using TensorFlow [1] with the Adam opti-mizer [23]. We empirically set up the hyperparameters accordingto the performance on the validation set. Specifically, we use alearning rate of η = .
01, a mini-batch size of 512, and an earlystopping threshold of ε =
10. The deep factor is defined as a two-layer fully-connected neural network, where the first layer has 128hidden units and the second layer has 64 hidden units. All of the comparison methods can output a likelihood score for each location. Werank the locations according to the likelihood and evaluate the top 3.
All experiments are performed on an x86-64 machine with 40-core 3.00GHz Intel Xeon(R) CPUs, 3 NVIDIA Titan X GPUs, and128GB RAM.
Location inference performance.
We compare the performanceof all the methods on the three datasets. Table 5 lists the perfor-mance of comparison methods for geo-location inference.In our experiments, the proposed SSFGM consistently outper-forms all the comparison methods in terms of prediction accuracyon all datasets. In Twitter (World), LR and SVM can achieve an accu-racy of 94.4% in predicting the user’s country. Our SSFGM furtherimproves the accuracy to 95.7% by incorporating social network.In Twitter (USA) and Weibo, it becomes harder to predict a user’sstate/province. This is because for predicting user’s country, thecontent information might already be very indicative, as users fromdifferent countries use different languages; while for predicting thestate-level location, we need to exploit more information such asthe social network. SSFGM achieves a significant improvement incomparison with other methods that only utilize local attributes oronly utilize the network. It is noticeable that while using the samecontent and network information, SSFGM significantly outperformsGraph Convolutional Network (GCN), a state-of-the-art method hasbeen successfully applied in many other tasks on graphs. SSFGMdirectly models the correlation between the locations of relatedusers, while GCN only models the correlation between the features.In fact, we have also tried to combine GCN and SSFGM by definingthe attribute factor function using GCN (i.e., it takes the feature ma-trix X as input instead of a single user’s feature x i alone). However,it still cannot outperform SSFGM.Another interesting discovery is that, in Twitter (USA), purelynetwork-based methods (e.g., FindMe) perform worse than linearmodels (LR and SVM), but in Weibo (a Chinese microblog), theysignificantly outperform linear models. This suggests that networkinformation is more important in the Weibo dataset. We suspect thereason might be the differences of user behaviours and populationdistributions between the USA and China. able 6: Performance and training time of different learningalgorithms for SSFGM on a small dataset [28]. (The numbersin brackets represent the speedup against LBP.) Method Accuracy TimeLBP [34] × )SampleRank [37] × )TCS 91.23% 8.57 sec (118 × ) Table 7: Training time of GCN and SSFGM.
Method Twitter (World) Twitter (USA) WeiboGCN 11 hr 11 min 48.3 min 4 hr 18 minSSFGM (TCS) 1 hr 55 min 24.2 min 47.6 minSSFGM (TCS+Deep) 1 hr 57 min 24.3 min 1 hr 4 minFinally, we can observe that in general the deep factor helpsto improve inference accuracy of our model. Our motivation toincorporate deep factor in our model is trying to capture the non-linear, high-dimensional association between input features andoutput locations. Although its benefit is not very significant inour experiments, we have shown the feasibility of using neuralnetworks in our model. Designing more advanced and effectiveneural network architectures will be an interesting future direction.
Comparison of different learning algorithms.
Now we com-pare the performance of four different learning algorithms for SS-FGM, including the traditional Loopy Belief Propagation (LBP)algorithm [34]. LBP suffers from its high computational cost, andis not useful in our million-scale datasets. However, in order tofairly compare it with the other algorithms, we construct a smallerdataset with the Facebook ego-network data from SNAP [28]. Inthis dataset, each user has an anonymized hometown location, butcontent information is not available. We use Facebook friendshipsas edges. After data preprocessing, we get a relatively small datasetwith 856 users and 11,789 edges. Then we compare the performanceof four learning algorithms on this dataset. Table 6 shows the results,where the algorithms are mainly implemented in C++ and each oneuses a single CPU core. Among the algorithms, LBP achieves thehighest accuracy, but takes much more time to train than the others.The other three algorithms have significantly reduced the trainingtime, either with approximation assumptions or sampling methods.SR seems to be the most time-efficient, but its accuracy is worse thanthat of the others. SampleRank and the proposed TCS algorithmsolve the computation cost problem (over 100 × speedup comparedwith LBP), and achieve comparable accuracy. From Table 5, we canalso see that TCS usually performs better than SampleRank on largedatasets.We report the training time of TCS on the three large datasetsand compare them with GCN in Table 7. Here the algorithms arerunning on three GPUs under the Tensorflow framework. WithTCS, our model takes only 0.4–2 hours of training on million-scaledatasets and achieves the best prediction performance among thecomparison methods. It is also much faster than GCN. Twitter (World) Twitter (USA)0.40.60.81 A cc u r a cy SSFGMSSFGM-profileSSFGM-contentSSFGM-network
Figure 2: Feature contribution analysis. (SSFGM-profile,-content, -network means removing profile features, contentfeatures, or correlation factors, respectively.)Factor contribution analysis.
We evaluate how different factors(content, profile, and network) contribute to location inferencein the proposed model. We use the two Twitter datasets in thisstudy. Specifically, we remove each factor from our SSFGM andthen evaluate the model’s prediction accuracy decrease. The largerthe decrease, the more important the factor to the model. Figure 2shows the results on the Twitter datasets. We see that differentfactors contribute differently on the two datasets. The content-based features seem to be the most useful in the proposed modelfor inferring location on the Twitter datasets. On the other hand,all features are helpful. This analysis confirms the necessity ofincorporating various features in the proposed model.
Training data ratio analysis.
We conduct further experiments toevaluate our method’s performance when training data is limited.We change the training data ratio in each dataset and compareseveral methods’ prediction accuracies. The validation and testingsets remain constant. The results are shown in Figure 3. SSFGMdoes quite well, even with only 10% of labeled data. Its predictionaccuracy steadily increases when more labeled data are used fortraining. It shows distinct advantages compared with LR, whoseperformance can hardly be improved by adding more training data.
In this paper, we studied the problem of inferring user locationsfrom social media. We proposed a general probabilistic model basedon factor graphs. The model generalizes previous methods by incor-porating content, network, and deep features learned from socialcontext. It is also sufficiently flexible to support semi-supervisedlearning with limited labeled data. We proposed a Two-Chain Sam-pling (TCS) algorithm, which significantly improves the inferenceaccuracy. This algorithm is also parallelizable and is capable ofhandling large-scale networked data. Our experiments on threedifferent datasets validated the effectiveness and the efficiency ofthe model.
REFERENCES [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning.In
OSDI’16 , 2016.[2] O. Ajao, J. Hong, and W. Liu. A survey of location inference techniques on twitter.
Journal of Information Science , 41(6):855–864, 2015. Twitter usa FindMe LR SSFGM0.1
Table 1 A cc u r a cy Training ratio
Twitter (World)
FindMe LR SSFGM A cc u r a cy Training ratio
Twitter (USA) A cc u r a cy Training ratio
Figure 3: Training data ratio analysis. [3] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmcfor machine learning.
Machine Learning , 50:5–43, 2003.[4] T. Artieres et al. Neural conditional random fields. In
AISTATS’10 , pages 177–184,2010.[5] L. Backstrom, E. Sun, and C. Marlow. Find me if you can: improving geographicalprediction with social and spatial proximity. In
WWW’10 , pages 61–70. ACM,2010.[6] L. Bottou. Large-scale machine learning with stochastic gradient descent. In
COMPSTAT’10 , pages 177–186. Springer, 2010.[7] J. Chen, T. Ma, and C. Xiao. Fastgcn: Fast learning with graph convolutionalnetworks via importance sampling. arXiv preprint arXiv:1801.10247 , 2018.[8] Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: a content-basedapproach to geo-locating twitter users. In
CIKM’10 , pages 759–768. ACM, 2010.[9] C. A. Davis Jr, G. L. Pappa, D. R. R. de Oliveira, and F. de L Arcanjo. Inferringthe location of twitter messages based on user relationships.
Transactions in GIS ,15(6):735–751, 2011.[10] J. Eisenstein, B. O’Connor, N. A. Smith, and E. P. Xing. A latent variable modelfor geographic lexical variation. In
EMNLP’10 , pages 1277–1287. ACL, 2010.[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A libraryfor large linear classification.
Journal of Machine Learning Research , 9:1871–1874,2008.[12] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and thebayesian restoration of images.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 6(6):721, 1984.[13] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In
AISTATS’11 , volume 15, page 275, 2011.[14] B. Han, P. Cook, and T. Baldwin. Geolocation prediction in social media data byfinding location indicative words. In
COLING’12 , pages 1045–1062, 2012.[15] B. Han, A. Rahimi, L. Derczynski, and T. Baldwin. Twitter geolocation predictionshared task of the 2016 workshop on noisy user-generated text. In the 2ndWorkshop on Noisy User-generated Text , pages 213–217, 2016.[16] W. K. Hastings. Monte carlo sampling methods using markov chains and theirapplications.
Biometrika , 57(1):97–109, 1970.[17] G. E. Hinton. Training products of experts by minimizing contrastive divergence.
Neural Computation , 14(8):1771–1800, 2002.[18] J. Hopcroft, T. Lou, and J. Tang. Who will follow you back? reciprocal relationshipprediction. In
CIKM’11 , pages 1137–1146. ACM, 2011.[19] Y. Ikawa, M. Enoki, and M. Tatsubori. Location inference using microblog mes-sages. In
WWW’12 , pages 687–690. ACM, 2012.[20] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors.
Journal of the American Statistical Association , 96(453):161–173, 2001.[21] D. Jurgens. That’s what friends are for: Inferring location in online social mediaplatforms based on social relationships. In
ICWSM’13 , pages 273–282, 2013.[22] D. Jurgens, T. Finethy, J. McCorriston, Y. T. Xu, and D. Ruths. Geolocationprediction in twitter using social networks: A critical analysis and review ofcurrent practice. In
ICWSM’15 , pages 188–197, 2015.[23] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[24] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolu-tional networks. arXiv preprint arXiv:1609.02907 , 2016.[25] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm.
IEEE Transactions on Information Theory , 47(2):498–519, 2001.[26] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or anews media? In
WWW’10 , pages 591–600. ACM, 2010.[27] P. F. Lazarsfeld, R. K. Merton, et al. Friendship as a social process: A substantiveand methodological analysis.
Freedom and Control in Modern Society , 18(1):18–66,1954. [28] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collec-tion. http://snap.stanford.edu/data, June 2014.[29] R. Li, S. Wang, H. Deng, R. Wang, and K. C.-C. Chang. Towards social userprofiling: unified and discriminative influence model for inferring home locations.In
KDD’12 , pages 1023–1031. ACM, 2012.[30] J. McGee, J. Caverlee, and Z. Cheng. Location prediction in social media basedon tie strength. In
CIKM’13 , pages 459–468. ACM, 2013.[31] M. McPherson, L. Smith-Lovin, and J. Cook. Birds of a feather: Homophily insocial networks.
Annual Review of Sociology , pages 415–444, 2001.[32] Y. Miura, M. Taniguchi, T. Taniguchi, and T. Ohkuma. Unifying text, metadata, anduser network representations with a neural network for geolocation prediction.In
ACL’17 , volume 1, pages 1260–1272, 2017.[33] D. Mok, B. Wellman, et al. Did distance matter before the internet?: Interpersonalcontact and support in the 1970s.
Social Networks , 29(3):430–461, 2007.[34] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagation for approximateinference: An empirical study. In
UAI’99 , pages 467–475. Morgan KaufmannPublishers Inc., 1999.[35] A. Rahimi, T. Cohn, and T. Baldwin. Twitter user geolocation using a unified textand network prediction model. In
ACL’15 , volume 2, pages 630–636, 2015.[36] A. Rahimi, D. Vu, T. Cohn, and T. Baldwin. Exploiting text and network contextfor geolocation of social media users. In
NAACL’15 , pages 1362–1367, 2015.[37] K. Rohanimanesh, K. Bellare, A. Culotta, A. McCallum, and M. L. Wick. Sampler-ank: Training factor graphs with atomic gradients. In
ICML’11 , pages 777–784,2011.[38] S. Roller, M. Speriosu, S. Rallapalli, B. Wing, and J. Baldridge. Supervised text-based geolocation using language models on an adaptive grid. In
EMNLP’12 ,pages 1500–1510. ACL, 2012.[39] K. Ryoo and S. Moon. Inferring twitter user locations with 10 km accuracy. In
WWW’14 , pages 643–648. ACM, 2014.[40] C. Sutton and A. McCallum.
An introduction to conditional random fields forrelational learning , volume 2. Introduction to Statistical Relational Learning. MITPress, 2006.[41] B. P. Wing and J. Baldridge. Simple supervised document geolocation withgeodesic grids. In
ACL’11 , pages 955–964. ACL, 2011.[42] Y. Yang and J. O. Pedersen. A comparative study on feature selection in textcategorization. In
ICML’97 , volume 97, pages 412–420, 1997.[43] J. Zhang, B. Liu, J. Tang, T. Chen, and J. Li. Social influence locality for modelingretweeting behaviors. In
IJCAI’13 , volume 13, pages 2761–2767, 2013.[44] A. Zubiaga, A. Voss, R. Procter, M. Liakata, B. Wang, and A. Tsakalidis. To-wards real-time, country-level location classification of worldwide tweets.
IEEETransactions on Knowledge and Data Engineering , 29(9):2053–2066, 2017., 29(9):2053–2066, 2017.