[PDF] User Factor Adaptation for User Embedding via Multitask Learning

Abstract

Language varies across users and their interested fields in social media data: words authored by a user across his/her interests may have different meanings (e.g., cool) or sentiments (e.g., fast). However, most of the existing methods to train user embeddings ignore the variations across user interests, such as product and movie categories (e.g., drama vs. action). In this study, we treat the user interest as domains and empirically examine how the user language can vary across the user factor in three English social media datasets. We then propose a user embedding model to account for the language variability of user interests via a multitask learning framework. The model learns user language and its variations without human supervision. While existing work mainly evaluated the user embedding by extrinsic tasks, we propose an intrinsic evaluation via clustering and evaluate user embeddings by an extrinsic task, text classification. The experiments on the three English-language social media datasets show that our proposed approach can generally outperform baselines via adapting the user factor.

Full PDF

UUser Factor Adaptation for User Embedding via Multitask Learning

Xiaolei Huang

Computer ScienceUniversity of Memphis [email protected]

Michael J. Paul, Robin Burke

Information ScienceUniversity of Colorado Boulder mpaul,[email protected]

Franck Dernoncourt

Adobe Research [email protected]

Mark Dredze

Computer ScienceJohns Hopkins University [email protected]

Abstract

Language varies across users and their inter-ested ﬁelds in social media data: words au-thored by a user across his/her interests mayhave different meanings (e.g., cool) or senti-ments (e.g., fast). However, most of the ex-isting methods to train user embeddings ig-nore the variations across user interests, suchas product and movie categories (e.g., dramavs. action). In this study, we treat the userinterest as domains and empirically examinehow the user language can vary across the userfactor in three English social media datasets.We then propose a user embedding model toaccount for the language variability of userinterests via a multitask learning framework.The model learns user language and its vari-ations without human supervision. While ex-isting work mainly evaluated the user embed-ding by extrinsic tasks, we propose an intrin-sic evaluation via clustering and evaluate userembeddings by an extrinsic task, text classiﬁ-cation. The experiments on the three English-language social media datasets show that ourproposed approach can generally outperformbaselines via adapting the user factor.

Language varies across user factors including userinterests, demographic attributes, personalities, andlatent factors from user history. Research showsthat language usage diversiﬁes according to onlineuser groups (Volkova et al., 2013), which womenwere more likely to use the word weakness in apositive way while men were the opposite. In socialmedia, the user interests can include topics of userreviews (e.g., home vs. health services in Yelp)and categories of reviewed items (electronic vskitchen products in Amazon). The ways that usersexpress themselves depend on current contexts ofuser interests (Oba et al., 2019) that users may usethe same words for opposite meanings and different words for the same meaning. For example, onlineusers can use the word “fast” to criticize batteryquality of the electronic domain or praise medicineeffectiveness of the medical products; users canalso use the words “cool” to describe a property ofAC products or express sentiments.User embedding, which is to learn a ﬁxed-lengthrepresentation based on multiple user reviews ofeach user, can infer the user latent information intoa uniﬁed vector space (Benton, 2018; Pan and Ding,2019). The inferred latent representations from on-line content can predict user proﬁle (Volkova et al.,2015; Wang et al., 2018; Farnadi et al., 2018; Lynnet al., 2020) and behaviors (Zhang et al., 2015;Amir et al., 2017; Benton et al., 2017; Ding et al.,2017). User embeddings can personalize classiﬁ-cation models, and further improve model perfor-mance (Tang et al., 2015; Chen et al., 2016a; Yangand Eisenstein, 2017; Wu et al., 2018; Zeng et al.,2019; Huang et al., 2019). The representations ofuser language can help models better understanddocuments as global contexts.However, existing user embedding methods(Amir et al., 2016; Benton et al., 2016; Xing andPaul, 2017; Pan and Ding, 2019) mainly focus onextracting features from language itself while ig-noring user interests. Recent research has demon-strated that adapting the user factors can furtherimprove user geolocation prediction (Miura et al.,2017), demographic attribute prediction (Farnadiet al., 2018), and sentiment analysis (Yang andEisenstein, 2017). Lynn et al. (2017); Huang andPaul (2019) treated the language variations as a do-main adaptation problem and referred to this ideaas user factor adaptation .In this study, we treat the user interest as do-mains (e.g., restaurants vs. home services domains)and propose a multitask framework to model lan-guage variations and incorporate the user factorinto user embeddings. We focus on three online a r X i v : . [ c s . C L ] F e b eview datasets from Amazon, IMDb, and Yelpcontaining diverse behaviors conditioned on userinterests, which refer to genres of reviewed items.For example, if any Yelp users have reviews onitems of the home services, then their user interestswill include the home services.We start with exploring how the user factor, userinterest, can cause language and classiﬁcation vari-ations in Section 3. We then propose our user em-bedding model that adapts the user interests usinga multitask learning framework in Section 4. Re-search (Pan and Ding, 2019) generally evaluatesthe user embedding via downstream tasks, but userannotations sometimes are hard to obtain and thoseevaluations are extrinsic instead of intrinsic tasks.For example, the MyPersonality (Kosinski et al.,2015) that was used in previous work (Ding et al.,2017; Farnadi et al., 2018; Pan and Ding, 2019)is no longer available, and an extrinsic task is toevaluate if user embeddings can help text classi-ﬁers. Research (Schnabel et al., 2015) suggeststhat the intrinsic evaluation including clustering isbetter than the extrinsic evaluation for controllingless hyperparameters. We propose an intrinsic eval-uation for user embedding, which can provide anew perspective for testing future experiments. Weshow that our user-factor-adapted user embeddingcan generally outperform the existing methods onboth intrinsic and extrinsic tasks. We collected English reviews of Amazon (healthproduct), IMDb and Yelp from the publicly avail-able sources (He and McAuley, 2016; Yelp, 2018;IMDb, 2020). For the IMDb dataset, we includedEnglish movies produced in the US from 1960 to2019. Each review associates with its author andthe rated item , which refers to a movie in the IMDbdata, a business unit in the Yelp data and a prod-uct in the Amazon data. To keep consistency ineach dataset, we retain top 4 frequent genres ofrated items and the review documents with no lessthan 10 tokens. We dropped non-English reviewdocuments by the language detector (Lui and Bald-win, 2012), lowercased all tokens and tokenizedthe corpora using NLTK (Bird and Loper, 2004).The review datasets have different score scales. We The top 4 rated categories of Amazon-Health, IMDb andYelp are [sports nutrition, sexual wellness, shaving & hairremoval, vitamins & dietary supplements], [comedy, thriller,drama, action] and [restaurants, health & medical, home ser-vices, beauty & spas] respectively. normalize the scales and encode each review scoreinto three discrete categories: positive ( > forthe Yelp and Amazon, > for the IMDb), nega-tive ( < for the Yelp and Amazon, < for theIMDb) and neutral. Table 1 shows a summary ofthe datasets. To protect user privacy, we anonymize all user-related information via hashing, and our experi-ments only use publicly available datasets for re-search demonstration. Any URLs, hashtags andcapitalized English names were removed. Due tothe potential sensitivity of user reviews, we onlyuse information necessary for this study. We donot use any user proﬁle in our experiments, except,our evaluations use anonymized author ID of eachreview entry for training user embeddings. We willnot release any private user reviews associated withuser identities. Instead, we will open source oursource code and provide instructions of how to ac-cess the public datasets in enough detail so that ourproposed method can be replicated.

Language varies across user factors such as userinterests (Oba et al., 2019), demographic attributes(Huang and Paul, 2019), social relations (Yang andEisenstein, 2017; Gong et al., 2020). In this section,our goal is to quantitatively analyze whether theuser interests cause user language variations, whichcan reduce effectiveness and robustness of userembeddings. We approach this by two analysistasks, ﬁrst by measuring word feature similaritybased on user interests, and second by examininghow classiﬁer performance depends on the groupeduser interests in which the model is trained andapplied.

Existing methods mainly infer user embeddingsfrom features of text contents (Pan and Ding, 2019).Therefore, word usage variations across user in-terests will change word distributions and furtherimpact the stability of user embeddings. We aimto test whether there are language variations acrossthe user interests in our datasets and how strongthey are.We consider the word usage as it relates to userembeddings by estimating the overlap of top wordata Users Docs Rated Items Tokens Train Dev TestAmazon-Health 11,438 80,592 3,822 127 64,474 8,060 8,061IMDb 6,089 123,184 642 187 98,548 12,319 12,320Yelp 76,323 551,695 9,327 152 441,357 55,170 55,171

Table 1: Statistical summary of the Amazon, Yelp and IMDb review datasets. Amazon-Health refers to health-related reviews. Tokens mean the number of average tokens per document. We present the data split for theevaluation task of text classiﬁcation on the right side. U s e r G r oup s U s e r G r oup s U s e r G r oup s Figure 1: Word feature overlaps between every two user groups. A value of 1 means no variations of top featuresbetween two user groups, while values less than 1 indicate more feature variations. features across the genres of rated items, the cat-egories of reviewed products in Amazon, busi-ness units in Yelp and movies in IMDb. To solvedata sparsity caused by single user preference, wegrouped users and therefore their generated docu-ments according to genres of user reviewed items.We refer to this as genre domains. We build a uni-ﬁed feature vectorizer (Pedregosa et al., 2011) withTF-IDF weighted n -gram features ( n ∈ { , , } ),removing features that appeared in less than 2 doc-uments. We rank and select the top 1000 wordfeatures for each genre domain by mutual informa-tion. We then compute the intersection percentagebetween every two genre domains: let F is the setof top features for one genre domain and F is theset of top features for the other domain, then theoverlap is | F ∩ F | / .We show the results in Figure 1. The overlapvaries signiﬁcantly across genre domains. Thisindicates that the word usage and its contexts ofusers change across user interests and preferences.Since the training of user embeddings relies heavilyon the language features of users, this suggests thatit is important to consider the language variationsin user interests for the user embeddings. User embeddings are effective to understand userbehaviors in the classiﬁcation setting (Amir et al., 2016; Ding et al., 2018). Research has foundthat combining user and document representationscan beneﬁt classiﬁcation performance (Chen et al.,2016b; Li et al., 2018; Yuan et al., 2019). We ex-plore how the language variations in user interestscan affect classiﬁcation models.We conduct an analysis by training and testingclassiﬁers that group users by the categories ofreviewed items. We ﬁrst group items and usersaccording to item genres, which can be treatedas different domains of user interests. For eachdomain, we downsampled documents, users anditems within each group to match their numbersin the smallest group, so that classiﬁcation per-formance differences are not due to data sizes ofdocument, user and item. For each grouped doc-uments, we shufﬂe and split the data into training(80%) and test (20%) sets. We train logistic regres-sion classiﬁers with default hyperparameters fromscikit-learn (Pedregosa et al., 2011) using TF-IDFweighted uni-, bi- and tri-gram features. We reportweighted F1 scores across grouped users and showthe results in Figure 2.We can observe that classiﬁcation performancevaries across the grouped users. Higher perfor-mance variations between in- and out- user groupssuggest higher user variations and vice versa. If novariations of user language exist, the performance U s e r G r oup s U s e r G r oup s U s e r G r oup s Figure 2: Document classiﬁcation performance when training and testing on different groups of users. The datasetscome from Amazon health, IMDb and Yelp reviews. Darker red indicates better classiﬁcation performance, whiledarker blue means worse performance. of classiﬁers should be similar across the domains.The performance variations suggest that user be-haviors vary across the categories of user interests.We can also observe that classiﬁcation models gen-erally perform better when tests within the sameuser groups while worse in the other user groups.This suggests a variability connection between theuser interests and language usage, which derivesuser embeddings.

We present the architecture of our proposed modelin Figure 3 on the left. Methods (Pan and Ding,2019) to train text-based user embedding mainlyfocus on the user-generated documents while ignor-ing user factors, the user interests. A close work toours only trained user embeddings by predicting ifusers co-occurred with sampled words (Amir et al.,2017). We extend this line of work by adaptinguser interests into modeling steps. The proposedunsupervised model trains four joint tasks based onthe Skip-Gram (Mikolov et al., 2013): word andword, user and word, item and word, and user anditem. Note that we do not use the categories ofrated items and user interests in our training steps.Then we can optimize the model by minimizingthe following loss function: L = L ( w, w ) + L ( u, w ) + L ( p, w ) + L ( p, u ) where w , u , p are the notations of words, users andrated items respectively. Considering the large sizeof the vocabulary, users and rated items, we approx-imate our optimization objectives by the negativesampling. Then we can treat each task as a classi-ﬁcation problem and calculate loss values by thebinary cross-entropy. We present the details of eachoptimization task as following: Word and word is a standard way to trainWord2vec (Mikolov et al., 2013) models. The pre-diction task is to predict if the sampled words haveco-occurred within the window context. The train-ing process uses the negative sampling to approxi-mate objective function. We choose 5 as the num-ber of negative samples. We keep the top 20,000frequent words and ﬁnally replace the rest with aspecial token, < unk > . User and word predicts if a user authored thesampled words by the contexts of user posts. Thegoal is to learn patterns of user language usagefrom user historical posts. Given a document i ,its author u i and the user’s vocabulary V u i = { w ,i , ..., w j,i , ..., w n,i } , where n is the number offrequent words authored by the user. Our objectiveis to minimize the following function: L ( u, w ) = − (cid:80) w j ∈ V ui w k ∈ V w k / ∈ V ui [ log ( θ ( e ( u i ) · e ( w j ))) + log ( θ ( e ( u i ) · e ( w k )))] where w j is anegative sample from the whole vocabulary V , e ( u ) and e ( w ) are ﬁxed-length user and wordvectors respectively, and θ is a sigmoid function tonormalize values of dot production. We extend theprevious work (Amir et al., 2017) to integrate bothlocal and global user language usage by sampling w j from a combined token list of both the inputdocument and the user’s vocabulary. This can helpthe model learn contextual information of eachuser. Item and word follows the prediction task ofuser and word to classify if sampled words describethe selected item. This task is to use review docu-ments to train representations of rated items. Thenwe can have L ( p, w ) = − (cid:80) w j ∈ V pi ,w k ∈ V,w k / ∈ V pi log (1 − 𝑢 ∙ 𝑝𝑖𝑓 𝑝𝑢𝑟𝑐ℎ𝑎𝑠𝑒𝜃 𝑢 ∙ 𝑤𝑖𝑓 ℎ𝑎𝑠 𝜃 𝑤 𝑖 ∙ 𝑤 𝑗 𝑖𝑓 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝜃 𝑝 ∙ 𝑤𝑖𝑓 ℎ𝑎𝑠 DOCs

Embedding D : w , w , …, w n D k : w , w , …, w n DocEncoder + Predictor

Word Embedding

Product

Embedding User Embedding

Multitask User Embedding Personalized Document Classifier

Figure 3: Illustrations of User Embedding via multitask learning framework on the left and personalized documentclassiﬁers using trained embedding models on the right. The arrows and their colors refer to the input directionsand input sources respectively. We use the logos of people, shopping cart and ABC to represent users, revieweditems and word inputs. The (cid:76) is the concatenation operation. θ ( e ( p i ) · e ( w j ))) + log ( θ ( e ( p i ) · e ( w k ))) where V p i is the vocabulary of the rated item p i and w k isa negative sample of words. The language can beviewed as a bridge between an interactive relationof user and item, which predicts language usagefor both rated items and users. User and item learns if a user commented on thesampled items. The prediction task aims to adaptlatent user factors into the user embeddings. Givena document i , its author u i and the reviewed item p i , we can optimize the task by minimizing L ( u, p ) = − (cid:80) p k ∈ P,p k / ∈ P ui log ( θ ( e ( p i ) · e ( u i )))+ log ( θ ( e ( p k ) · e ( u i ))) where the P is a col-lection of all items, the P u i is a reviewed item andthe p k is a negative sample that the user does not re-view. The constraints between reviewed items andusers can help user embeddings identify languagevariations across domains of item genres. And inturn, the relation of user and item can help inferitem vectors.For model settings, we used Adam (Kingma andBa, 2014) for the model optimization with a learn-ing rate of 1e-5. We set the training epochs as5. The model initializes embedding vectors ran-domly and learns 300-dimension representationsfor words, users and reviewed items. We empir-ically use 5 as the number of negative samples.For the other parameters, we keep the same as the defaults in the Keras (Chollet and Others, 2015). We evaluate the effectiveness of the user factoradapted embedding model by an intrinsic evalua-tion, user clustering task and an extrinsic evalua-tion, personalized classiﬁcation task. The ﬁrst taskaims to measure the purity of clusters with respectto categories of user interests, and the second taskuses the document classiﬁcation as a proxy of qual-ifying quality of user embeddings. We conduct aqualitative analysis of the user embeddings com-paring with our close work (Amir et al., 2017).

The unsupervised evaluation of embedding mod-els focuses on four main categories: related-ness, analogy, categorization and selectional pref-erence (Schnabel et al., 2015). We approach theuser embedding evaluation by categorizing usersinto different clusters. User communities or groupsgather users by their interests and behaviors, suchas engaging in the same ﬁled of topics (Bentonet al., 2016; Yang and Eisenstein, 2017). In ourdatasets, the user-purchased Amazon products,the user-visited Yelp business units and the user-watched IMDb movies have their item categories.The categories can imply user preferences and inter-sts, and therefore can help evaluate user clusters.In this study, our proposed multitask model learnsinteractive relations across language, user and iteminstead of using the item categories. We compareour proposed model with other 5 baseline models: word2user represents users by aggregating wordrepresentations (Benton et al., 2016). We computea user representation by averaging embeddings ofall tokens that were authored by the user. To obtainthe word embeddings, for each dataset, we traineda word2vec model for 5 epochs using Gensim (Re-hurek and Sojka, 2010) with 300-dimensional vec-tors. lda2user generates user representations by ap-plying Latent Dirichlet Allocation (LDA) (Bleiet al., 2003) on user documents (Pennacchiotti andPopescu, 2011). We set the number of topics as 300and leave the rest of the parameters as their defaultsin Gensim (Rehurek and Sojka, 2010). We applythe LDA model on each user document to obtaina document vector, and then get a user vector byaveraging the vectors of all the user’s documents. doc2user applies paragraph2vec (Le andMikolov, 2014) to obtain user vectors. Weimplemented the User-D-DBOW model whichachieved the best performance in the previouswork (Ding et al., 2017). The implementationkeeps parameters with default values in theGensim (Rehurek and Sojka, 2010). We aggregateeach user’s documents as a single document. Thenthe User-D-DBOW model can derive a single uservector from the aggregated document. bert2user follows a similar process of thelda2user. We use the “bert-base-uncased” pre-trained BERT model for English from the trans-formers toolkit (Wolf et al., 2019) with default pa-rameter and model settings. After inserting “[CLS]”and “[SEP]” to the beginning and end of each doc-ument, the BERT model encodes a document intoa ﬁxed-length (768) document vector. We can thengenerate user embeddings by averaging each user’sall document vectors. user2vec trains user embeddings by predictingword usage by users. We follow the existingwork (Amir et al., 2017) but set the user vectordimension as 300.We use the

SpectralClustering algo-rithm from scikit-learn (Pedregosa et al., 2011)toolkit to cluster users into three clustering sizes, 4, 8 and 12. We set the afﬁnity as cosine and leaveother parameters as their defaults. To measure clus-ter quality, we select every two users from the clus-ters without repetition. We count the user pair as acorrect option if two users have overlaps within thesame item genre and from the same cluster or if theuser pair does not overlap and is from the differentclusters. Otherwise, we will count the selection asthe wrong option. Therefore, we can have a listof predicted labels and ground truths by using theitem genres as a proxy. Finally, we measure theclustering purity by the F1 score.We present results at Table 2. The results showthat our multitask user embedding model outper-forms the other baselines by a large portion on theIMDb and Yelp datasets. The improvements sug-gest the user factor adapted model can understandsemantic variations in diverse user interests. Theperformance of our model and user2vec has similarscores on the Amazon-Health dataset. Comparingto the other two datasets, the Amazon-Health datahas more similar topics of review items.

We train three classiﬁers to evaluate user embed-dings on the document classiﬁcation task. Wesplit each dataset into training (80%), development(10%) and test (10%) sets, as shown in Table 1. Themodels oversample the minority during the trainingprocess. We test the classiﬁers when they achievethe best performance on the development set. Fi-nally, we report precision, recall and F1 scores us-ing the classification report from scikit-learn (Pedregosa et al., 2011). Figure 3 illustratespersonalizing classiﬁers by concatenating docu-ment representations with user embeddings. Wecompare our proposed model with classiﬁers us-ing existing user2vec (Amir et al., 2017) and non-personalized classiﬁers. To ensure a fair compar-ison, classiﬁers use the same settings for modelswith and without user embeddings.

LR.

We build a logistic regression classiﬁer using

LogisticRegression from scikit-learn (Pe-dregosa et al., 2011). The classiﬁer extracts uni-,bi- and tri-gram features on the corpora with themost frequent 15K features with default parame-ters.

GRU.

We build a bi-directional Gated RecurrentUnit (GRU) (Cho et al., 2014) classiﬁer. We paddeddocuments to the average document length of eachcorpus. We set the output dimension of GRU as 200 mazon-Health IMDb YelpF1@4 F1@8 F1@12 F1@4 F1@8 F1@12 F1@4 F1@8 F1@12Baselines word2user .929 .909 .905 .653 .725 .762 .859 .810 .797lda2user .920 .914 .900 .696 .726 .761 .849 .839 .832doc2user .873 .891 .901 .660 .725 .748 .836 .828 .826bert2user .871 .896 .906 .660 .714 .734 .838 .828 .830user2vec .868 .891 .901 .601 .600 .593 .841 .829 .832Ours MTL .870 .890 .900 .801 .879 .884 .879 .843 .838

Table 2: Performance summary of different user embedding models. We report F1 scores at multiple numbers ofclusters. The bold fonts indicate the best performance in each evaluation task. and apply a dense layer on the output. The denselayer uses ReLU (Hahnloser et al., 2000) as theactivation function, applies a dropout (Srivastavaet al., 2014) rate of 0.2 and outputs 200 dimensionsfor ﬁnal document class prediction. We train theclassiﬁer for 20 epochs.

BERT.

We implement a BERT-based classiﬁerby HuggingFace’s transformers toolkit (Wolf et al.,2019). The classiﬁer loads the “bert-base-uncased”pre-trained BERT model for English, encodes eachdocument into a ﬁxed-length (768) vector and feedsto a linear prediction layer for prediction. We con-duct ﬁne-tuning steps for 10 epochs with a batchsize of 32 and optimize the model by

AdamW witha learning rate of 9 e − .We show the performance results in Table 3.Comparing to the baselines, the classiﬁers person-alized by our proposed model generally achieve thebest performance across the three datasets. Thishighlights adapting user factors can help embed-ding models learn user variations and beneﬁt theclassiﬁcation performance. We can also observethat the personalized classiﬁers generally outper-form the non-personalized classiﬁers. This indi-cates personalizing the classiﬁers with user historyboosts classiﬁcation performance in our study. To further evaluate the effectiveness of user embed-ding models, we map users into a 2-D space usinguser embeddings and plot them in Figure 4. Wegroup users according to user interests using thedomain categories of rated items. To map the 300-d user embeddings, we use the

TSNE algorithmfrom scikit-learn (Pedregosa et al., 2011) to com-press the dimension into 2-d vectors. We set then component as 2 and leave the other parameters astheir defaults in the TSNE. We can observe that theMTL user embedding model shows more cluster- ing patterns with regard to user interests (categoriesof reviewed items). This indicates that the unsuper-vised multitask learning framework can adapt thelatent user factors into the user embedding. Usersmay have multiple interests. In the right plot, wecan also ﬁnd that there is a cluster that mixes withmultiple colors on the right bottom.

User Proﬁling is a common task in natural lan-guage processing. Online generated user texts showdemographic variations in the linguistic styles, andthe linguistic style variability could be used forpredicting user’s personality and demographic at-tributes (Rosenthal and McKeown, 2011; Zhanget al., 2016; Hovy and Fornaciari, 2018; Wood-Doughty et al., 2020; Gjurkovi´c et al., 2020;Lynn et al., 2020). The demographic user fac-tors inﬂuence how online users express their opin-ions (Volkova et al., 2013; Hovy, 2015; Wood-Doughty et al., 2017) and show promising improve-ments in the text classiﬁcation task (Lynn et al.,2017; Huang and Paul, 2019; Lynn et al., 2019).However, in this work, the goal of modeling userfactor is to train robust user embeddings via do-main adaptation, rather than the end goal beingdemographic factor prediction and document clas-siﬁcation itself.

Personalized classiﬁcation generally improvesthe performance of document classiﬁers (Flek,2020). The multitask learning framework has beenapplied for personalizing document classiﬁers byoptimizing the classiﬁers on multiple document lev-els (Benton et al., 2017) or general and individuallevels (Wu and Huang, 2016). The social relationcan bridge connections between users and gener-alize classiﬁcation models across users (Wu andHuang, 2016; Yang and Eisenstein, 2017). For ex-ample, (Wu and Huang, 2016) optimizes documentethods Amazon-Health IMDb YelpPrecision Recall F1 Precision Recall F1 Precision Recall F1LR .834 .768 .793 .818 .779 .794 .856 .820 .833LR-u .841 .777 .801 .834 .791 .807 .860 .821 .835LR-up .838 .771 .796 .833 .791 .807 .863 .825 .838

GRU .813 .844 .812 .824 .837 .823 .851 .865 .852GRU-u .836 .811 .821 .832 .819 .825 .868 .846 .858GRU-up .821 .832 .825 .846 .824 .836 .876 .864 .867

BERT .866 .822 .840 .852 .809 .826 .866 .825 .840BERT-u .863 .812 .831 .858 .818 .833 .872 .843 .854

BERT-up .873 .838 .851 .864 .831 .844 .880 .839 .854

Table 3: Performance scores of document classiﬁers on the review datasets. ‘-u‘ means personalized classiﬁersusing user2vec (Amir et al., 2017) and ‘-up‘ indicates personalizing classiﬁers via our proposed method. We usethe bold fonts to highlight the best performance of each classiﬁer on separate datasets.

100 75 50 25 0 25 50 75 100 Y X domainThrillerActionDramaComedy

80 60 40 20 0 20 40 60 80 Y X domainThrillerActionDramaComedy Figure 4: Visualizations of IMDb users colored concerning their interests in 4 movie genres. We plot users usingthe embeddings from our proposed method (right) and user2vec (Amir et al., 2017) (left). The visualizations ofYelp and Amazon are omitted for reasons of space. classiﬁers by two optimization tasks, sentimentclassiﬁcation and user social relation minimization,which allows classiﬁers to minimize the impactsof user community variations. This work personal-izes classiﬁers in a different way, where we trainuser embedding models under a multitask learningframework and use the personalized classiﬁers toevaluate user embedding models.

In this study, we have proposed user factor adap-tation for building user embedding under a mul-titask framework. Our analyses show how theuser factor causes semantic variations in relationto word usage and document classiﬁcation, show-ing that the user factor is rooted in language. Wehave evaluated the proposed user embedding mod- els in both intrinsic and extrinsic tasks. The userfactor adapted model has shown its robustness tolanguage variations in both instrinsic and extrin-sic evaluations, learning user representations andpersonalizing classiﬁers. We release our sourcecode and instructions of data access at https://github.com/xiaoleihuang/UserEmbedding .Our work in user factor adaptation highlights sev-eral future directions to explore. First, our methodmodels latent user factors inferred from user posts.A combination of user embedding and explicit at-tributes (e.g., demographic factors) may improvemodel personalization. Second, user behaviorsshift over time. A time-adapted user embeddingcan jointly model temporality and user attributes inonline social media and can be extended to otherﬁelds, such as public health.

Acknowledgement

The authors thank the anonymous reviews. Thiswork was supported in part by the National Sci-ence Foundation under award number IIS-1657338.This work was also supported in part by a researchgift from Adobe Research. The ﬁrst author wouldthank the computational support from the JHUCLSP cluster.

References

Silvio Amir, Glen Coppersmith, Paula Carvalho,Mario J. Silva, and Bryon C. Wallace. 2017. Quan-tifying mental health from social media with neu-ral user embeddings. In

Proceedings of the 2ndMachine Learning for Healthcare Conference , vol-ume 68 of

Proceedings of Machine Learning Re-search , pages 306–321, Boston, Massachusetts.PMLR.Silvio Amir, Byron C. Wallace, Hao Lyu, Paula Car-valho, and M´ario J. Silva. 2016. Modelling contextwith user embeddings for sarcasm detection in socialmedia. In

Proceedings of The 20th SIGNLL Con-ference on Computational Natural Language Learn-ing , pages 167–177, Berlin, Germany. Associationfor Computational Linguistics.Adrian Benton. 2018. Learning representations of so-cial media users. arXiv preprint arXiv:1812.00436 .Adrian Benton, Raman Arora, and Mark Dredze. 2016.Learning multiview embeddings of twitter users. In

Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 2:Short Papers) , pages 14–19, Berlin, Germany. Asso-ciation for Computational Linguistics.Adrian Benton, Margaret Mitchell, and Dirk Hovy.2017. Multitask learning for mental health condi-tions with limited social media data. In

Proceed-ings of the 15th Conference of the European Chap-ter of the Association for Computational Linguistics:Volume 1, Long Papers , pages 152–162, Valencia,Spain. Association for Computational Linguistics.Steven Bird and Edward Loper. 2004. NLTK: The nat-ural language toolkit. In

Proceedings of the ACL In-teractive Poster and Demonstration Sessions , pages214–217, Barcelona, Spain. Association for Compu-tational Linguistics.David M Blei, Andrew Y Ng, and Michael I Jordan.2003. Latent { D } irichlet { A } llocation. Journal ofMachine Learning Research , 3:993–1022.Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin,and Zhiyuan Liu. 2016a. Neural sentiment classiﬁ-cation with user and product attention. In

Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing , pages 1650–1659. Tao Chen, Ruifeng Xu, Yulan He, Yunqing Xia, andXuan Wang. 2016b. Learning user and productdistributed representations using a sequence modelfor sentiment analysis.

IEEE Computational Intelli-gence Magazine , 11(3):34–44.Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder–decoder ap-proaches. In

Proceedings of SSST-8, Eighth Work-shop on Syntax, Semantics and Structure in Statisti-cal Translation , pages 103–111, Doha, Qatar. Asso-ciation for Computational Linguistics.Franc¸ois Chollet and Others. 2015. Keras.T. Ding, W. K. Bickel, and S. Pan. 2018. Predictingdelay discounting from social media likes with un-supervised feature learning. In , pages 254–257.Tao Ding, Warren K. Bickel, and Shimei Pan. 2017.Multi-view unsupervised user feature embedding forsocial media-based substance use prediction. In

Pro-ceedings of the 2017 Conference on Empirical Meth-ods in Natural Language Processing , pages 2275–2284, Copenhagen, Denmark. Association for Com-putational Linguistics.Golnoosh Farnadi, Jie Tang, Martine De Cock, andMarie-Francine Moens. 2018. User proﬁlingthrough deep multimodal fusion. In

Proceedingsof the Eleventh ACM International Conference onWeb Search and Data Mining , WSDM ’18, page171–179, New York, NY, USA. Association forComputing Machinery.Lucie Flek. 2020. Returning the N to NLP: Towardscontextually personalized classiﬁcation models. In

Proceedings of the ACL , pages 7828–7838.Matej Gjurkovi´c, Mladen Karan, Iva Vukojevi´c, Mi-haela Boˇsnjak, and Jan ˇSnajder. 2020. Pandora talks:Personality and demographics on reddit. arXivpreprint arXiv:2004.04460 .Lin Gong, Lu Lin, Weihao Song, and Hongning Wang.2020. Jnet: Learning user representations via jointnetwork embedding and topic embedding. In

Pro-ceedings of the 13th International Conference onWeb Search and Data Mining , WSDM ’20, page205–213, New York, NY, USA. Association forComputing Machinery.Richard H R Hahnloser, Rahul Sarpeshkar, Misha AMahowald, Rodney J Douglas, and H Sebastian Se-ung. 2000. Digital selection and analogue ampliﬁca-tion coexist in a cortex-inspired silicon circuit.

Na-ture , 405(6789):947.Ruining He and Julian McAuley. 2016. Ups and downs:Modeling the visual evolution of fashion trends withone-class collaborative ﬁltering. In

Proceedings ofhe 25th International Conference on World WideWeb (WWW) , volume 3, pages 507–517. Interna-tional World Wide Web Conferences Steering Com-mittee.Dirk Hovy. 2015. Demographic factors improve classi-ﬁcation performance. In

Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 1:Long Papers) , pages 752–762, Beijing, China. As-sociation for Computational Linguistics.Dirk Hovy and Tommaso Fornaciari. 2018. Increasingin-class similarity by retroﬁtting embeddings withdemographic information. In

Proceedings of the2018 Conference on Empirical Methods in NaturalLanguage Processing , pages 671–677, Brussels, Bel-gium. Association for Computational Linguistics.Qi Huang, Chuan Zhou, Jia Wu, Mingwen Wang, andBin Wang. 2019. Deep structure learning for ru-mor detection on twitter. In , pages 1–8.IEEE.Xiaolei Huang and Michael J. Paul. 2019. Neural UserFactor Adaptation for Text Classiﬁcation: Learningto Generalize Across Author Demographics. In

Pro-ceedings of the Eighth Joint Conference on Lexi-cal and Computational Semantics (* { SEM } Proceed-ings of the 3rd International Conference on Learn-ing Representations (ICLR) .Michal Kosinski, Sandra C Matz, Samuel D Gosling,Vesselin Popov, and David Stillwell. 2015. Face-book as a research tool for the social sciences:Opportunities, challenges, ethical considerations,and practical guidelines.

American Psychologist ,70(6):543.Quoc Le and Tomas Mikolov. 2014. Distributed rep-resentations of sentences and documents. In

Pro-ceedings of Machine Learning Research , volume 32,pages 1188–1196, Bejing, China. PMLR.Junjie Li, Haitong Yang, and Chengqing Zong. 2018.Document-level multi-aspect sentiment classiﬁca-tion by jointly modeling users, aspects, and overallratings. In

Proceedings of the 27th InternationalConference on Computational Linguistics , pages925–936, Santa Fe, New Mexico, USA. Associationfor Computational Linguistics.Marco Lui and Timothy Baldwin. 2012. langid.py: Anoff-the-shelf language identiﬁcation tool. In

Pro-ceedings of the ACL 2012 System Demonstrations ,pages 25–30, Jeju Island, Korea. Association forComputational Linguistics. Veronica Lynn, Niranjan Balasubramanian, and H. An-drew Schwartz. 2020. Hierarchical modeling foruser personality prediction: The role of message-level attention. In

Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics , pages 5306–5316, Online. Association forComputational Linguistics.Veronica Lynn, Salvatore Giorgi, Niranjan Balasubra-manian, and H. Andrew Schwartz. 2019. Tweetclassiﬁcation without the tweet: An empirical ex-amination of user versus document attributes. In

Proceedings of the Third Workshop on Natural Lan-guage Processing and Computational Social Sci-ence , pages 18–28, Minneapolis, Minnesota. Asso-ciation for Computational Linguistics.Veronica Lynn, Youngseo Son, Vivek Kulkarni, Ni-ranjan Balasubramanian, and H. Andrew Schwartz.2017. Human centered NLP with user-factor adap-tation. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing ,pages 1146–1155.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. In

Proceedings of the 26th International Con-ference on Neural Information Processing Systems -Volume 2 , NIPS’13, pages 3111–3119, USA. CurranAssociates Inc.Yasuhide Miura, Motoki Taniguchi, Tomoki Taniguchi,and Tomoko Ohkuma. 2017. Unifying text, meta-data, and user network representations with a neuralnetwork for geolocation prediction. In

Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 1260–1272, Vancouver, Canada.Daisuke Oba, Naoki Yoshinaga, Shoetsu Sato, SatoshiAkasaki, and Masashi Toyoda. 2019. Modeling per-sonal biases in language use by inducing personal-ized word embeddings. In

Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers) , pages 2102–2108, Minneapolis, Minnesota.Association for Computational Linguistics.Shimei Pan and Tao Ding. 2019. Social media-baseduser embedding: A literature review. In

Proceed-ings of the Twenty-Eighth International Joint Con-ference on Artiﬁcial Intelligence, IJCAI-19 , pages6318–6324. International Joint Conferences on Ar-tiﬁcial Intelligence Organization.Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake Vanderplas, Alexan-dre Passos, David Cournapeau, Matthieu Brucher,Matthieu Perrot, and ´Edouard Duchesnay. 2011.Scikit-learn: Machine learning in Python.

Journalof Machine Learning Research , 12(Oct):2825–2830.arco Pennacchiotti and Ana-Maria Popescu. 2011. Amachine learning approach to twitter user classiﬁca-tion. In

International AAAI Conference on Web andSocial Media .Radim Rehurek and Petr Sojka. 2010. SoftwareFramework for Topic Modelling with Large Cor-pora. In

Proceedings of the LREC 2010 Workshopon New Challenges for NLP Frameworks , pages 45–50. ELRA.Sara Rosenthal and Kathleen McKeown. 2011. Ageprediction in blogs: A study of style, content, andonline behavior in pre- and post-social media genera-tions. In

Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies , volume 1, pages 763–772.Tobias Schnabel, Igor Labutov, David Mimno, andThorsten Joachims. 2015. Evaluation methods forunsupervised word embeddings. In

Proceedingsof the 2015 Conference on Empirical Methods inNatural Language Processing , pages 298–307, Lis-bon, Portugal. Association for Computational Lin-guistics.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overﬁtting.

The Journal of Machine LearningResearch , 15(1):1929–1958.Duyu Tang, Bing Qin, and Ting Liu. 2015. Learning se-mantic representations of users and products for doc-ument level sentiment classiﬁcation. In

Proceedingsof the 53rd Annual Meeting of the Association forComputational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing(Volume 1: Long Papers) , pages 1014–1023.Svitlana Volkova, Yoram Bachrach, Michael Arm-strong, and Vijay Sharma. 2015. Inferring LatentUser Properties from Texts Published in Social Me-dia. In

AAAI Conference on Artiﬁcial Intelligence(AAAI) , Austin, TX.Svitlana Volkova, Theresa Wilson, and DavidYarowsky. 2013. Exploring demographic languagevariations to improve multilingual sentiment anal-ysis in social media. In

EMNLP 2013 - 2013Conference on Empirical Methods in NaturalLanguage Processing, Proceedings of the Confer-ence , pages 1815–1827, Seattle, Washington, USA.Association for Computational Linguistics.Jingjing Wang, Shoushan Li, Mingqi Jiang, HanqianWu, and Guodong Zhou. 2018. Cross-media userproﬁling with joint textual and social user embed-ding. In

Proceedings of the 27th International Con-ference on Computational Linguistics , pages 1410–1420, Santa Fe, New Mexico, USA. Association forComputational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing.

ArXiv , abs/1910.03771.Zach Wood-Doughty, Michael Smith, David Bronia-towski, and Mark Dredze. 2017. How does Twitteruser behavior vary across demographic groups? In

Proceedings of the Second Workshop on NLP andComputational Social Science , pages 83–89.Zach Wood-Doughty, Paiheng Xu, Xiao Liu, andMark Dredze. 2020. Using noisy self-reports topredict twitter user demographics. arXiv preprintarXiv:2005.00635 .Fangzhao Wu and Yongfeng Huang. 2016. Person-alized microblog sentiment classiﬁcation via multi-task learning. In

Proceedings of the Thirtieth AAAIConference on Artiﬁcial Intelligence , AAAI’16,page 3059–3065. AAAI Press.Zhen Wu, Xin-Yu Dai, Cunyan Yin, Shujian Huang,and Jiajun Chen. 2018. Improving review repre-sentations with user attention and product atten-tion for sentiment classiﬁcation. arXiv preprintarXiv:1801.07861 .Linzi Xing and Michael J. Paul. 2017. Incorporatingmetadata into content-based user embeddings. In

Proceedings of the 3rd Workshop on Noisy User-generated Text , pages 45–49.Yi Yang and Jacob Eisenstein. 2017. Overcoming lan-guage variation in sentiment analysis with social at-tention.

Transactions of the Association for Compu-tational Linguistics , 5:295–307.Yelp. 2018. Yelp Dataset Challenge.Zhigang Yuan, Fangzhao Wu, Junxin Liu, Chuhan Wu,Yongfeng Huang, and Xing Xie. 2019. Neural re-view rating prediction with user and product mem-ory. In

Proceedings of the 28th ACM InternationalConference on Information and Knowledge Manage-ment , CIKM ’19, page 2341–2344, New York, NY,USA.Xingshan Zeng, Jing Li, Lu Wang, and Kam-Fai Wong.2019. Joint effects of context and user history forpredicting online conversation re-entries. In

Pro-ceedings of the ACL , pages 2809–2818.Lei Zhang, Xiaolei Huang, Tianli Liu, Ang Li, Zhenx-iang Chen, and Tingshao Zhu. 2015. Using linguis-tic features to estimate suicide probability of chinesemicroblog users. In

Human Centered Computing ,pages 549–559, Cham. Springer International Pub-lishing.Wanru Zhang, Andrew Caines, Dimitrios Alikaniotis,and Paula Buttery. 2016. Predicting author age fromWeibo microblog posts. In