[PDF] Characterizing English Variation across Social Media Communities with BERT

Abstract

Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community's unique word types, is used to identify cases where a social group's language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.

Full PDF

CCharacterizing English Variation acrossSocial Media Communities with BERT

Li Lucy and

David Bamman

University of California, Berkeley {lucy3_li, dbamman}@berkeley.edu

Abstract

Much previous work characterizing lan-guage variation across Internet socialgroups has focused on the types of wordsused by these groups. We extend thistype of study by employing BERT tocharacterize variation in the senses of wordsas well, analyzing two months of Englishcomments in 474 Reddit communities. Thespeciﬁcity of different sense clusters to acommunity, combined with the speciﬁcityof a community’s unique word types,is used to identify cases where a socialgroup’s language deviates from the norm.We validate our metrics using user-createdglossaries and draw on sociolinguistictheories to connect language variation withtrends in community behavior. We ﬁndthat communities with highly distinctivelanguage are medium-sized, and their loyaland highly engaged users interact in densenetworks.

Internet language is often popularly character-ized as a messy variant of “standard” language(Desta, 2014; Magalhães, 2019). However, workin sociolinguistics has demonstrated that onlinelanguage is not homogeneous (Herring and Pao-lillo, 2006; Nguyen et al., 2016; Eisenstein, 2013).Instead, it expresses immense amounts of vari-ation, often driven by social variables. Onlinelanguage contains lexical innovations, such as or-thographic variants, but also repurposes wordswith new meanings (Pei et al., 2019; Stewartet al., 2017). There has been much attentionon which words are used across these socialgroups, including work examining the frequencyof types (Zhang et al., 2017; Danescu-Niculescu-Mizil et al., 2013). However, there is also increas-ing interest in how words are used in these online

Figure 1: Different online communities may system-atically use the same word to mean different things.Each marker on the t-SNE plot is a BERT embed-ding of python , case insensitive, in r/cscareerquestions(where it refers to the programming language) andr/elitedangerous (where it refers to a type of space-craft). BERT also predicts different substitutes when python is masked out in these communities’ comments. communities as well, including variation in mean-ing (Yang and Eisenstein, 2017; Del Tredici andFernández, 2017). For example, a word such as python in Figure 1 has different usages dependingon the community in which it is used. Our workexamines both lexical and semantic variation, andoperationalizes the study of the latter using BERT(Devlin et al., 2019).Social media language is an especially inter-esting domain for studying lexical semantics be-cause users’ word use is far more dynamic andvaried than is typically captured in standard senseinventories like WordNet. Online communitiesthat sustain linguistic norms have been character-ized as virtual communities of practice (Eckert andMcConnell-Ginet, 1992; Del Tredici and Fernán- a r X i v : . [ c s . C L ] F e b ez, 2017; Nguyen and Rosé, 2011). Users maydevelop wiki pages, or guides, for their communit-ies that outline speciﬁc jargon and rules. How-ever, some communities exhibit more languagevariation than others. One central goal in socio-linguistics is to investigate what social factors leadto variation, and how they relate to the growth andmaintenance of sociolects, registers, and styles. Toenable our ability to answer these types of ques-tions from a computational perspective, we mustﬁrst develop metrics for measuring variation.Our work quantiﬁes how much the language ofan online community deviates from the norm andidentiﬁes communities that contain unique lan-guage varieties. We deﬁne community-speciﬁclanguage in two ways, one based on word choicevariation, and another based on meaning variationusing BERT. Words used with community-speciﬁcsenses match words that appear in glossaries cre-ated by users for their communities. Finally, wetest several hypotheses about user-based attrib-utes of online English varieties drawn from soci-olinguistics literature, showing that communitieswith more distinctive language tend to be medium-sized and have more loyal and active users indense interaction networks. We release our code,our dataset of glossaries for 57 Reddit communit-ies, and additional information about all com-munities in our study at https://github.com/lucy3/ingroup_lang . The sheer number of conversations on social me-dia platforms allow for large-scale studies thatwere previously impractical using traditional so-ciolinguistics methods such as ethnographic inter-views and surveys. Earlier work on computer-mediated communication identiﬁed the presenceand growth of group norms in online settings(Postmes et al., 2000), and how new and vet-eran community members adapt to a community’schanging linguistic landscape (Nguyen and Rosé,2011; Danescu-Niculescu-Mizil et al., 2013).Much work in computational sociolinguisticshas focused on lexical variation (Nguyen et al.,2016). Online language contains an abund-ance of “nonstandard” words, and these dynamictrends rise and decline based on social and lin-guistic factors (Rotabi and Kleinberg, 2016; Alt-mann et al., 2011; Stewart and Eisenstein, 2018;Del Tredici and Fernández, 2018; Eisenstein et al., 2014). Online communities’ linguistic norms anddifferences are often deﬁned by which words areused. For example, Zhang et al. (2017) quantifythe distinctiveness of a Reddit community’s iden-tity by the average speciﬁcity of its words and ut-terances. They deﬁne speciﬁcity as the PMI of aword in a community relative to the entire set ofcommunities, and ﬁnd that distinctive communit-ies are more likely to retain users. To identifycommunity-speciﬁc language, we extend Zhanget al. (2017)’s approach to incorporate semanticvariation, mirroring the sense-versus-type dicho-tomy of language variation put forth by previouswork on slang detection (Dhuliawala et al., 2016;Pei et al., 2019).There has been less previous work on cross-community semantic variation. Yang and Eisen-stein (2017) use social networks to address sen-timent variation across Twitter users, account-ing for cases such as sick being positive in thissick beat or negative in

I feel tired and sick .Del Tredici and Fernández (2017) adapt the modelof Bamman et al. (2014) for learning dialect-awareword vectors to Reddit communities discussingprogramming and football. They ﬁnd that sub-communities for each topic share meaning con-ventions, but also develop their own. A line of fu-ture work suggested by Del Tredici and Fernández(2017) is extending studies on semantic variationto a larger set of communities, which our presentwork aims to achieve.The strength of BERT to capture word sensespresents a new opportunity to measure semanticvariation in online communities of practice. BERTembeddings have been shown to capture wordmeaning (Devlin et al., 2019), and different sensestend to be segregated into different regions ofBERT’s embedding space (Wiedemann et al.,2019). Clustering these embeddings can revealsense variation and change, where distinct sensesare often represented as cluster centroids (Huet al., 2019; Giulianelli et al., 2020). For example,Reif et al. (2019) use a nearest-neighbor classiﬁerfor word sense disambiguation, where word em-beddings are assigned to the nearest centroid rep-resenting a word sense. Using BERT-base, theyachieve an F1 of 71.1 on SemCor (Miller et al.,1993), beating the state of the art at that time. Partof our work examines how well the default beha-vior of contextualized embeddings, as depicted inFigure 1, can be used for identifying niche mean-ngs in the domain of Internet discussions.As online language may contain semantic in-novations, our domain necessitates word sense in-duction (WSI) rather than disambiguation. Weevaluate approaches for measuring usage or sensevariation on two common WSI benchmarks,SemEval 2013 Task 13 (Jurgens and Klapaftis,2013) and SemEval 2010 Task 14 (Manandhar andKlapaftis, 2009), which provide evaluation metricsfor unsupervised sense groupings of different oc-currences of words. The current state of the art inWSI clusters representations consisting of substi-tutes, such as those shown in Figure 1, predictedby BERT for masked target words (Amrami andGoldberg, 2018, 2019). We also adapt this methodon our Reddit dataset to detect semantic variation.

Our data is a subset of all comments on Redditmade during May and June 2019 (Baumgartneret al., 2020). Reddit is broken up into forum-based communities called subreddits , which dis-cuss different topics, such as parenting or gam-ing, or target users in different social groups,such as LGBTQ+ or women. We select the top500 most popular subreddits based on number ofcomments and remove subreddits that have lessthan 85% English comments, using the languageidentiﬁcation method proposed by Lui and Bald-win (2012). This process yields 474 subreddits,from which we randomly sample 80,000 com-ments each. The number of comments per subred-dit originally ranged from over 13 million to over80 thousand, so this sampling ensures that morepopular communities do not skew comparisonsof word usage across subreddits. Each sampledsubreddit had around 20k unique users on average,where a user is deﬁned as a unique username as-sociated with comments. We lowercase the text,remove urls, and replace usernames, numbers, andsubreddit names each with their own special tokentype. The resulting dataset has over 1.4 billiontokens.To understand how users in these communit-ies deﬁne and catalog their own language, wealso manually gather all available glossaries of thesubreddits in our dataset. These glossaries are usu-ally written as guides to newcomers to the com- Some Reddit users may have multiple usernames due tothe creation of “throwaway" accounts (Leavitt, 2015), but wedeﬁne a single user by its account username. munity and can be found in or linked from com-munity wiki pages. We exclude glossary linksthat are too general and not speciﬁc to that Redditcommunity, such as r/tennis’s link to the Wikipe-dia page for tennis terms. We provide the namesof these communities and the links we used inour Github repo. Our 57 subreddit glossarieshave an average of 72.4 terms per glossary, with awide range from a minimum of 4 terms to a max-imum of 251. We removed 1044 multi-word ex-pressions from analysis, because counting phraseswould conﬂate the distinction we make betweenexamining which individual words are used (type)and how they are used (meaning). We evaluateon 2814 single-token words from these glossar-ies that appear in comments within their respectivesubreddits based on exact string matching. Sincemany of these words appear in multiple subred-dits’ glossaries, we have 2226 unique glossarywords overall.

Much previous work on Internet language has fo-cused on lexical choice, examining the word typesunique to a community. The subreddit r/vegan,for example, uses carnis , omnis , and omnivores torefer to people who eat meat.For our type-based analysis, we only examinewords that are within the 20% most frequent ina subreddit; even though much of a community’sunique language is in its long tail, words withfewer than 10 occurrences may be noisy mis-spellings or too rare for us to conﬁdently determ-ine usage patterns. To keep our vocabularies com-patible with our sense-based method described in§4.2, we calculate word frequencies using the ba-sic (non-WordPiece) tokenizer in Hugging Face’stransformers library (Wolf et al., 2020). Follow-ing Eisenstein et al. (2014), we deﬁne frequencyfor a word t in a subreddit s , f s ( t ) , as the numberof users that used it at least once in the subreddit.We experiment with several different methods forﬁnding distinctive and salient words in subreddits.Our ﬁrst metric is the “speciﬁcity” metric usedin Zhang et al. (2017) to measure the distinctive-ness of words in a community. For each word type t in subreddit s , we calculate its PMI T , which we https://github.com/lucy3/ingroup_lang https://huggingface.co/transformers/ ubreddit word deﬁnition count type NPMI r/justnomil fdh “future damn husband" 354 0.397jnmom “just no mom", an annoying mother 113 0.367justnos annoying family members 110 0.366jnso “just no signiﬁcant other", an annoying romantic partner 36 0.345r/gardening clematis a type of ﬂower 150 0.395milkweed a ﬂowering plant 156 0.389perennials plants that live for multiple years 139 0.383bindweed a type of weed 38 0.369r/ps4 siea Sony Interactive Entertainment America 60 0.373ps5 PlayStation 5 892 0.371tlou The Last of Us, a video game 193 0.358hzd Horizon Zero Dawn, a video game 208 0.357 Table 1: Examples of words with high type NPMI scores in three subreddits. We present values for this metricbecause as we will show in Section 6, it tends to perform better. The listed count is the number of unique usersusing that word in that subreddit. will refer to as type PMI : T s ( t ) = log P ( t | s ) P ( t ) .P ( t | s ) is the probability of word t in subreddit s ,or P ( t | s ) = f s ( t ) (cid:80) w f s ( w ) , while P ( t ) is the probability of the word overall,or P ( t ) = (cid:80) r f r ( t ) (cid:80) w,r f r ( w ) . PMI can be normalized to have values between [ − , , which also reduces its tendency to over-emphasize low frequency events (Bouma, 2009).Therefore, we also calculate words’ NPMI T ∗ , or type NPMI : T ∗ s ( t ) = T s ( t ) − log P ( t, s ) . Here, P ( t, s ) = f s ( t ) (cid:80) w,r f r ( w ) . Table 1 shows example words with high NPMIin three subreddits. The community r/justnomil,whose name means “just no mother-in-law”, dis-cusses negative family relationships, so many ofits common and distinctive words refer to relat-ives. Words speciﬁc to other communities tend tobe topical as well. The gaming community r/ps4(PlayStation 4) uses acronyms to denote companyand game entities and r/gardening has words fordifferent types of plants.We also calculate term frequency–inverse docu-ment frequency (tf-idf) as a third alternative metric (Manning et al., 2008):TFIDF s ( t ) = (1 + log f s ( t )) log Nd ( t ) , where N is the number of subreddits (474) and d ( t ) is the number of subreddits word t appearsin.As another metric, we examine the use ofTextRank, which is commonly used for extractingkeywords from documents (Mihalcea and Tarau,2004). TextRank applies the PageRank algorithm(Brin and Page, 1998) on a word co-occurrencegraph, where the resulting scores based on words’positions in the graph correspond their importancein a document. For our use case, we constructa graph of unlemmatized tokens using the sameparameter and model design choices as Mihalceaand Tarau (2004). This means we run PageRankon an unweighted, undirected graph of adjectivesand nouns that co-occur in the same comment, us-ing a window size of 2, a convergence threshold of0.0001, and a damping factor of 0.85.Finally, we also use Jensen-Shannon divergence(JSD), which has been used to identify divergentkeywords in corpora such as books and social me-dia (Lin, 1991; Gallagher et al., 2018; Pechenicket al., 2015; Lu et al., 2020). JSD is a symmetricversion of Kullback–Leibler divergence, and it ispreferred because it avoids assigning inﬁnite val-ues to words that only appear in one corpus. Foreach subreddit s , we compare its word probabilitydistribution against that of a background corpus R s containing all other subreddits in our dataset.For each token t in s , we calculate its divergenceontribution as D s ( t ) = − m s ( t ) log m s ( t )+ 12 ( P ( t | s ) log P ( t | s )+ P ( t | R s ) log P ( t | R s )) , where m s ( t ) = P ( t | s ) + P ( t | R s )2 (Lu et al., 2020; Pechenick et al., 2015). Diver-gence scores are positive, and the computed scoredoes not indicate in which corpus, s or R s , a wordis more prominent. Therefore, we label D s ( t ) asnegative if t ’s contribution comes from R s , or if P ( t | s ) < P ( t | R s ) . Some words may have low scores with our type-based metrics, yet their use should still be con-sidered community-speciﬁc. For example, theword ow is common to many subreddits, but isused as an acronym for a video game name inr/overwatch, a clothing brand in r/sneakers, andhow much a movie makes in its opening week-end in r/boxofﬁce. We use interpretable metricsfor senses, analogous to type NPMI, that allow usto compare semantic variation across communit-ies.Since words on social media are dynamic andniche, making them difﬁcult to be comprehens-ively cataloged, we frame our task as word senseinduction. We investigate two types of methods:one that clusters BERT embeddings , and Am-rami and Goldberg (2019)’s current state-of-the-art model that clusters representatives containingword substitutes predicted by BERT (Figure 1).The current state-of-the-art WSI model associ-ates each example of a target word with 15 repres-entatives, each of which is a vector composed of20 sampled substitutes for the masked target word(Amrami and Goldberg, 2019). This method thentransforms these sparse vectors with tf-idf andclusters them using aggolomerative clustering, dy-namically merging less probable senses with moredominant ones. In our use of this model, eachexample is assigned to its most probable sensebased on how its representatives are distributedacross sense clusters. One version of their modeluses Hearst-style patterns such as target (or even[MASK]) , instead of simply masking out the tar-get word. We do not use dynamic patterns in our study, because these patterns assume that tar-get words are nouns, verbs, or adjectives, and ourReddit experiments do not ﬁlter out any wordsbased on part of speech.As we will show, Amrami and Goldberg(2019)’s model is resource intensive on large data-sets, and so we also test a more lightweightmethod that has seen prior application on similartasks. Pre-trained BERT-base has demonstratedgood performance on word sense disambiguationand identiﬁcation using embedding distance-basedtechniques (Wiedemann et al., 2019; Hu et al.,2019; Reif et al., 2019; Hadiwinoto et al., 2019).The positions of dimensionality-reduced BERTrepresentations for python in Figure 1 suggestthat they are grouped based on their community-speciﬁc meaning. Our embedding-based methoddiscretizes these hidden layer landscapes acrosshundreds of communities and thousands of words.This method is k -means (Lloyd, 1982; Arthur andVassilvitskii, 2007; Pedregosa et al., 2011), whichhas also been employed by concurrent work totrack word usage change over time (Giulianelliet al., 2020). We cluster on the concatenationof the ﬁnal four layers of BERT. There havebeen many proposed methods for choosing k in k -means clustering, and we experimented with sev-eral of these, including the gap statistic (Tibshir-ani et al., 2001) and a variant of k -means usingthe Bayesian information criterion (BIC) called x -means (Pelleg and Moore, 2000). The followingcriterion for cluster cardinality worked best on de-velopment set data (Manning et al., 2008): k = argmin k RSS ( k ) + γk, where RSS ( k ) is the minimum residual sum ofsquares for number of clusters k and γ is a weight-ing factor.We also tried applying spectral clustering onBERT embeddings as a possible alternative to k -means (Jianbo Shi and Malik, 2000; von Luxburg,2007; Pedregosa et al., 2011). Spectral cluster-ing turns the task of clustering embeddings intoa connectivity problem, where similar points have We also experimented with a BERT model after domain-adaptive pretraining on our entire Reddit dataset (Han and Ei-senstein, 2019; Gururangan et al., 2020), and reached similarresults in our Reddit language analyses. We also tried other ways of forming embeddings, suchas summing all layers (Giulianelli et al., 2020), only takingthe last layer (Hu et al., 2019), and averaging all layers, butconcatenating the last four performed best. dges between them, and the resulting graph ispartitioned so that points within the same groupare similar to each other, while those across dif-ferent groups are dissimilar. To do this, k -meansis not applied directly on BERT embeddings, butinstead on a projection of the similarity graph’snormalized Laplacian. We use the nearest neigh-bors approach for creating the similarity graph, asrecommended by von Luxburg (2007), since thisconstruction is less sensitive to parameter choicesthan other graphs. To determine the number ofclusters k , we used the eigengap heuristic: k = argmax k λ k +1 − λ k , where λ k for k = 1 , ..., are the smallest eigen-values of the similarity graph’s normalized Lapla-cian. We develop and evaluate word sense inductionmodels using SemEval WSI tasks in a manner thatis designed to parallel their later use on larger Red-dit data.

In SemEval 2010 Task 14 (Jurgens and Klapaftis,2013) and SemEval 2013 Task 13 (Manandhar andKlapaftis, 2009), models are evaluated based onhow well predicted sense clusters for different oc-currences of a target word align with gold senseclusters.Amrami and Goldberg (2019)’s performancescores reported in their paper were obtained fromrunning their model directly on test set data forthe two SemEval tasks, which had typically fewerthan 150 examples per word. However, these taskswere released as multi-phase tasks and provideboth training and test sets (Jurgens and Klapaftis,2013; Manandhar and Klapaftis, 2009), and ourstudy requires methods that can scale to largerdatasets. Some words in our Reddit data appearvery frequently, making it too memory-intensiveto cluster all of their embeddings or representat-ives at once (for example, the word pass appearsover 96k times). It is more feasible to learn sensesfrom a ﬁxed number of examples, and then matchremaining examples to these senses. We evalu-ate how well induced senses generalize to new ex-amples using separate train and test sets.We tune parameters for models using SemEval2010 Task 14. In this task, the test set contains 100 target noun and verb lemmas, where each oc-currence of a lemma is labeled with a single sense(Manandhar and Klapaftis, 2009). We use WSImodels to ﬁrst induce senses for 500 randomlysampled training examples, and then match testexamples to these senses. There are a few lem-mas in SemEval 2010 that occur fewer than 500times in the training set, in which case we use allinstances. We also evaluate the top-performingversions of each model on SemEval 2013 Task13, after clustering 500 instances of each noun,verb, or adjective lemma in their training corpus,ukWaC (Jurgens and Klapaftis, 2013; Baroni et al.,2009). In SemEval 2013 Task 13, each occurrenceof a word is labeled with multiple senses, but weevaluate and report past scores using their single-sense evaluation key, where each word is mappedto one sense.For the substitution-based method, we matchtest examples to clusters by pairing representativeswith the sense label of their nearest neighbor in thetraining set. We found that Amrami and Goldberg(2019)’s default model is sensitive to the num-ber of examples clustered. The majority of targetwords in the test data for the two SemEval tasks onwhich this model was developed have fewer than150 examples. When this same model is appliedon a larger set of 500 examples, the vast major-ity of examples often end up in a single cluster,leading to low or zero-value V-Measure scores formany words. To mitigate this problem, we exper-imented with different values for the upper-boundon number of clusters c , ranging from 10 to 35in increments of 5. This upper-bound determ-ines the distance threshold for ﬂattening dendro-grams, where allowing more clusters lowers thesethresholds and breaks up large clusters. We found c = 25 produces the best SemEval 2010 resultsfor our training set size, and use it for our Redditexperiments as well.For the k -means embedding-based method, wematch test examples to the nearest centroid rep-resenting an induced sense using cosine distance.During training, we initialize centroids using k -means++ (Arthur and Vassilvitskii, 2007). We ex-perimented with different values of the weightingfactor γ ranging from 1000 to 20000 on SemEval2010, and choose γ = 10000 for our experi-ments on Reddit data. Preliminary experimentssuggest that this method is less sensitive to thenumber of training examples, where directly clus- odel F Score V Measure AverageBERT embeddings , k -means, γ = 10000 BERT embeddings ,spectral, K = 7 BERT substitutes , Amramiand Goldberg (2019), c = 25 Table 2: SemEval 2010 Task 14 unsupervised evalu-ation results with two measures, F Score and V Meas-ure, and their geometric mean. MFS is most frequentsense baseline, where all instances are assigned to themost frequent sense. Standard deviation over ﬁve runsare in parentheses. Bolded models use our train andtest evaluation setup. tering SemEval 2010’s smaller test set led to sim-ilar results with the same parameters.For the spectral embedding-based method, wematch a test example to a cluster by assigning it thelabel of its nearest training example. To constructthe K -nearest neighbor similarity graph duringtraining, we experimented with different K around log( n ) , where for n = 500 , K ∼ (von Luxburg,2007; Brito et al., 1997). For K = 6 , ..., , wefound that K = 7 worked best, though perform-ance scores on SemEval 2010 for all other K werestill within one standard deviation of K = 7 ’s av-erage across multiple runs.The bolded rows of Table 2 and Table 3 showperformance scores of these models using ourevaluation setup, compared against scores repor-ted in previous work. These results show thatfor embedding-based WSI, k -means works bet-ter than spectral clustering. In addition, cluster-ing BERT embeddings performs better than mostmethods, but not as well as clustering substitution-based representatives. We apply the k -means embedding-based methodand Amrami and Goldberg (2019)’s substitution-based method to Reddit, with the parameters thatperformed best on SemEval 2010 Task 14. We in-duce senses for a vocabulary of non-lemmatized13,240 tokens, including punctuation, that occuroften enough for us to gain a strong signal of se- The single-sense scores for Amrami and Goldberg(2019) are not reported in their paper. To generate thesescores, we ran the default model in their code base directlyon the test set using SemEval 2013’s single-sense evaluationkey, reporting average performance over ten runs.

Model NMI B-Cubed AverageBERT embeddings , k -means, γ = 10000 BERT embeddings ,spectral, K = 7 BERT substitutes , Amramiand Goldberg (2019), c = 25 Table 3: SemEval 2013 Task 13 single-sense evalu-ation results with two measures, NMI and B-Cubed,and their geometric mean. Standard deviation over ﬁveruns are in parentheses. Bolded models use our trainand test evaluation setup.

Model Clustering perword Matching persubredditBERT embeddings , γ = 10000 Amrami and Goldberg(2019)’s BERTsubstitutes , c = 25 Table 4: The models’ median time clustering 500 ex-amples of each word, and their median time matchingall words in a subreddit to senses. mantic deviation from the norm. These are non-emoji tokens that are very common in a com-munity (in the top 10% most frequent tokens ofa subreddit), frequent enough to be clustered (ap-pear at least 500 times overall), and also usedbroadly (appear in at least 350 subreddits). Whenclustering BERT embeddings, to gain the repres-entation for a token split into wordpieces, we av-erage their vectors. With each WSI method, weinduce senses using 500 randomly sampled com-ments containing the target token. Then, wematch all occurrences of words in our selectedvocabulary to their closest sense, as describedearlier.Though the embedding-based method has lowerperformance than the substitution-based one onSemEval WSI tasks, the former is an order of mag-nitude faster and more efﬁcient to scale (Table 4). During the training phase of clustering, both mod-els learn sense clusters for each word by makinga single pass over that word’s set of examples; wethen match every vocab word in a subreddit to itsappropriate cluster. While the substitution-basedmethod is 1.7 times slower than the embedding- To avoid sampling repeated comments written by bots,we disregarded comments where the context window arounda target word (ﬁve tokens to the left and ﬁve tokens to theright) repeat 10 or more times. We used a Tesla K80 GPU for the majority of these ex-periments, but we used a TITAN Xp GPU for three of the 474subreddits for the substitution-based method. ubreddit word M † M ∗ T ∗ subreddit’s sense example other sense example r/elitedangerous python 0.383 0.347 0.286 “Get a Python , stuff it withpassenger cabins..." “I self taught some

Python over the summer..."r/fashionreps haul 0.374 0.408 0.358 “Plan your ﬁrst haul , don’tjust buy random nonsense..." “...discipline is the long haul of getting it done..."r/libertarian nap 0.370 0.351 0.185 “The nap is just a social con-tract." “Move bedtime earlier tocompensate for no nap ..."r/90dayﬁance nickel 0.436 0.302 0.312 “

Nickel really believes thatAzan loves her." “...raise burrito prices by a nickel per month..."r/watches dial 0.461 0.463 0.408 “...the dial has a really nicetexturing..." “...you didn’t have to dial thearea code..."

Table 5: Examples of words where both the embedding-based and substitution-based WSI models result in a highsense NPMI score in the listed subreddit. Each row includes example contexts from comments illustrating thesubreddit-speciﬁc sense and a different sense pulled from a different subreddit. based method during the training phase, it be-comes 47.9 times slower during the matchingphase. The particularly large difference in runtimeis due to the substitution-based method’s need torun BERT multiple times for each sentence (inorder to individually mask each vocab word inthe sentence), while the embedding-based methodpasses over each sentence once. We also noticedthat the substitution-based method sometimes cre-ated very small clusters, which often led to veryrare senses (e.g. occurring fewer than 5 timesoverall).After assigning words to senses using a WSImodel, we calculate the NPMI of a sense n insubreddit s , counting each sense once per user: S s ( n ) = log P ( n | s ) P ( n ) (cid:30) − log P ( n, s ) , where P ( n | s ) is the probability of sense n insubreddit s , P ( n, s ) is the joint probability of n and s , and P ( n ) is the probability of sense n over-all.A word may map to more than one sense, soto determine if a word t has a community-speciﬁcsense in subreddit s , we use the NPMI of theword’s most common sense in s . We refer to thisvalue as the sense NPMI , or M s ( t ) . We calcu-late these scores using both the embedding-basedmethod, denoted as M ∗ s ( t ) , and the substitution-based method, denoted as M † s ( t ) .These two sense NPMI metrics tend to scorewords very similarly across subreddits, with anoverall Pearson’s correlation of 0.921 ( p < . ).Words that have high NPMI with one model alsotend to have high NPMI with the other (Table 5).There are some disagreements, such as the scoresfor ﬂu in r/keto, which does not refer to inﬂuenzabut instead refers to symptoms associated with starting a ketogenic diet ( M ∗ = 0 . , M † =0 . ). Still, both metrics place r/keto’s ﬂu inthe 98th percentile of scored words. Thus, forlarge datasets, it would be worthwhile to use theembedding-based method instead of the state-of-the-art substitution-based method to save substan-tial time and computing resources and yield sim-ilar results.Some of the words with high sense NPMI inTable 5, such as haul (a set of purchased products), dial (a watch face) have well documented mean-ings in WordNet or the Oxford English Dictionary that are especially relevant to the topic of the com-munity. Others are less standard, including python to refer to a ship in a game, nap as an acronym for“non-aggression principle", and

Nickel as a fan-created nickname for a character named Nicole ina reality TV show. Some terms have low M acrossmost subreddits, such as the period punctuationmark (average M ∗ = − . , M † = − . ). To provide additional validation for our metrics,we examine how they score words listed in user-created subreddit glossaries (as described in §3).New members may spend 8 to 9 months acquir-ing a community’s linguistic norms (Nguyen andRosé, 2011), and some Reddit communities havesuch distinctive language that their posts can bedifﬁcult to understand to outsiders. This makesthe manual annotation of linguistic norms acrosshundreds of communities difﬁcult, and so for thepurposes of our study, we use user-created gloss-aries to provide context for what our metrics ﬁnd.Still, glossaries only contain words deemed by afew users to be important for their community, andthe lack of labeled negative examples inhibits theiruse in a supervised machine learning task. There- etric meanreciprocalrank median,glossarywords median,non-glossarywords 98th percentile,all words % of scoredglossary words in98th percentiletype

PMI ( T ) 0.0938 2.7539 0.2088 5.0063 18.13NPMI ( T ∗ ) 0.4823 0.1793 0.0131 0.3035 22.30TFIDF 0.2060 0.5682 0.0237 3.0837 16.76TextRank 0.0616 6.95e-5 7.90e-5 0.0002 24.91JSD 0.2644 2.02e-5 2.44e-7 5.60e-05 29.07 sense BERT substitutes ( M † ) 0.2635 0.1165 0.0143 0.1745 28.75BERT embeddings ( M ∗ ) 0.3067 0.1304 0.0208 0.1799 30.73 Table 6: This table compares how each metric for quantifying community-speciﬁc language handles words inuser-created subreddit glossaries. The 98th percentile cutoff for all words are calculated for each metric using allscores across all subreddits. The % of glossary words is based on the fraction of glossary words with calculatedscores for each metric. fore, we focus on whether glossary words, on av-erage, tend to have high scores using our methods.Table 6 shows that glossary words have highermedian scores than non-glossary words for all lis-ted metrics (U-tests, p < . ). In addition, asubstantial percentage of glossary words are in the98th percentile of scored words for each metric.To see how highly our metrics tend to scoreglossary terms, we calculate their mean reciprocalrank (MRR), an evaluation metric often used toevaluate query responses (Voorhees, 1999):mean reciprocal rank = 1 G G (cid:88) i =1 rank i , where rank i is the rank position of the highestscored glossary term for a subreddit and G is thenumber of subreddits with glossaries. Mean recip-rocal rank ranges from 0 to 1, where 1 would meana glossary term is the highest scored word for allsubreddits.We have ﬁve different possible metrics for scor-ing community-speciﬁc word types: type PMI,type NPMI, tf-idf, TextRank, and JSD. Of these,TextRank has the lowest MRR, but still scores acompetitive percentage of glossary words in the98th percentile. This is because the TextRank al-gorithm only determines how important a word iswithin each subreddit, without any comparison toother subreddits to determine how a word’s fre-quency in a subreddit differs from the norm. TypeNPMI has the highest MRR, followed by JSD.Though JSD has more glossary words in the 98thpercentile than type NPMI, we notice that manyhigh-scoring JSD terms include words that have avery different probability in a subreddit comparedto the rest of Reddit, but are not actually distinct-ive to that subreddit. For example, in r/justnomil,words such as husband , she , and her are within Figure 2: Normalized distributions of type NPMI ( T ∗ )and sense NPMI ( M ∗ ) for words in subreddits withuser-created glossaries. The top graph involves 2,184glossary words and 431,773 non-glossary words, andthe bottom graph involves 807 glossary words and194,700 non-glossary words. Glossary words tend tohave higher scores than non-glossary words. the top 10 ranked words by JSD score. This con-trasts the words in Table 1 with high NPMI scoresthat are more unique to r/justnomil’s vocabulary.Therefore, for the remainder of this paper, we fo-cus on NPMI as our type-based metric for meas-uring lexical variation.Figure 2 shows the normalized distributionsof type NPMI and sense NPMI. Though gloss-ary words tend to have higher NPMI scores thannon-glossary words, there is still overlap betweenthe two distributions, where some glossary wordsave low scores and some non-glossary wordshave high ones. Sometimes this is because manyglossary words with low type NPMI instead havehigh sense NPMI. For example, the glossary word envy in r/competitiveoverwatch refers to an esportsteam and has low type NPMI ( T ∗ = 0 . )but sense NPMI in the 98th percentile ( M ∗ =0 . , M † = 0 . ). Only 21 glossary terms,such as aha , a popular type of skin exfoliant inr/skincareaddiction, are both in the 98th percent-ile of T ∗ and the 98th percentiles of M ∗ and M † scores. Thus, examining variation in the mean-ing of broadly used words provides a comple-mentary metric to counting distinctive word types,and overall provides a more comprehensive under-standing of community-speciﬁc language.Other cases of overlap are due to model er-ror. Manual inspection reveals that some gloss-ary words that actually have unique senses havelow M scores. Sometimes a WSI method splitsa glossary term in a community into too manysenses or fails to disambiguate different mean-ings. For example, the glossary word spawn inr/childfree refers to children, but the embedding-based method assigns it to the same sense usedin gaming communities, where it instead refers tothe creation of characters or items. As anotherexample of a failure case, the substitution-basedmethod splits the majority of occurrences of rep ,an exercise movement, in r/bodybuilding into twolarge but separate senses. Though new methodsusing BERT have led to performance boosts, WSIis still a challenging task.The use of glossaries in our study has severallimitations. Some non-glossary terms have highscores because glossaries are not comprehensive.For example, dips ( M ∗ = 0 . , M † = 0 . )is not listed in r/ﬁtness’s glossary, but it regularlyrefers to a type of exercise. This suggests the po-tential of our methods for uncovering possible ad-ditions to these glossaries. The vast majority ofglossaries contain community-speciﬁc words, buta few also include common Internet terms thathave low values across all metrics, such as lol , imo ,and fyi . In addition, only 71.12% of all single-token glossary words occurred often enough tohave scores calculated for them. Some wordsare relevant to the topic of the community (e.g. christadelphianism in r/christianity), but are actu-ally rarely used in discussions. We do not computescores for rarely-occurring words, so they are ex- cluded from our results. Despite these limitations,however, user-created glossaries are valuable re-sources for outsiders to understand the termino-logy used in niche online communities, and offerone of the only sources of in-domain validation forthese methods. In this section, we investigate how language vari-ation relates to characteristics of users and com-munities in our dataset. For these analyses, we usethe metrics that aligned the most with user-createdglossaries (Table 6): T ∗ for lexical variation and M ∗ for semantic variation. We deﬁne F , or thedistinctiveness of a community’s language variety,as the fraction of unique words in the community’stop 20% most frequent words that have T ∗ or M ∗ in the 98th percentile of all scores for each met-ric. That is, a word in a community is counted as a“community-speciﬁc word” if its T ∗ > . orif its M ∗ > . . Though in the following sub-sections we report numerical results using thesecutoffs, the U-tests for community-level attributesand F are statistically signiﬁcant ( p < . ) forcutoffs as low as the 50th percentile. Online communities differ from those in the off-line world due to increased anonymity of thespeakers and a lack of face-to-face interactions.However, the formation and survival of onlinecommunities still tie back to social factors. Onecentral goal of our work is to see what behavioralcharacteristics a community with unique languagetends to have. We examine four user-based at-tributes of subreddits: community size, user activ-ity, user loyalty, and network density. We calcu-late values corresponding to these attributes usingthe entire, unsampled dataset of users and com-ments. For each of these user-based attributes,we propose and test hypotheses on how they re-late to how much a community’s language devi-ates from the norm. Some of these hypothesesare pulled from established sociolinguistic theor-ies previously developed using ofﬂine communit-ies and interactions, and we test their conclusionsin our large-scale, digital domain. We construct U-tests for each attribute after z -scoring them acrosssubreddits, comparing subreddits separated intotwo equal-sized groups of high and low F . igure 3: Community size, user activity, user loyalty,network density all relate to the distinctiveness of acommunity’s language, which is the fraction of wordswith type NPMI or sense NPMI scores in the 98th per-centile. Each point on each plot represents one Redditcommunity. For clarity, axis limits are slightly croppedto omit extreme outliers. Del Tredici and Fernández (2018), when choos-ing communities for their study, claim that “small-to-medium sized” communities would be morelikely to have lexical innovations. We deﬁne com-munity size to be the number of unique users ina subreddit, and ﬁnd that large communities tendto have less community-speciﬁc language ( p < . , Figure 3). Communities need to reach a“critical mass” to sustain meaningful interactions,but very large communities such as r/askredditand r/news may suffer from communication over-load, leading to simpler and shorter replies byusers and fewer opportunities for group identity toform (Jones et al., 2004). We also collected sub-scriber counts from the last post of each subredditmade in our dataset’s timeframe, and found thatcommunities with more subscribers have lower F ( p < . ), and communities with a higher ratioof subscribers to commenters also have lower F ( p < . ). Multiple subreddits were outlierswith extremely large subscriber counts, perhapsdue to past users being auto-subscribed to defaultcommunities or historical popularity spikes. Fu-ture work could look into more reﬁned methods ofestimating the number of users who browse but donot comment in communities (Sun et al., 2014).Active communities of practice require regu-lar interaction among their members (Holmes andMeyerhoff, 1999; Wenger, 2000). Our metric formeasuring user activity is the average number of comments per user in that subreddit, and we ﬁndthat communities with more community-speciﬁclanguage have more active users ( p < . , Fig-ure 3). However, within each community, wedid not ﬁnd signiﬁcant or meaningful correlationsbetween a user’s number of comments in thatcommunity and the probability of them using acommunity-speciﬁc word.Speakers with more local engagement tend touse more vernacular language, as it expresseslocal identity (Eckert, 2012; Bucholtz and Hall,2005). Our proxy for measuring this kind of en-gagement is the fraction of loyal users in a com-munity, where loyal users are those who haveat least 50% of their comments in that particu-lar subreddit. We use the deﬁnition of user loy-alty introduced by Hamilton et al. (2017), ﬁlter-ing out users with fewer than 10 comments andcounting only top-level comments. Communit-ies with more community-speciﬁc language havemore loyal users, which extends Hamilton et al.(2017)’s conclusion that loyal users value collect-ive identity ( p < . , Figure 3). We also foundthat in 93% of all communities, loyal users hada higher probability of using a word with M ∗ inthe 98th percentile than a nonloyal user (U-test, p < . ), and in 90% of all communities, loyalusers had a higher probability of using a word with T ∗ in the 98th percentile (U-test, p < . ).Thus, users who use Reddit mostly to interact ina single community demonstrate deeper accultur-ation into the language of that community.A speech community is driven by the dens-ity of its communication, and dense networks en-force shared norms (Guy, 2011; Milroy and Mil-roy, 1992; Sharma and Dodsworth, 2020). Pre-vious studies of face-to-face social networks maydeﬁne edges using friend or familial ties, butReddit interactions can occur between strangers.For network density , we calculate the density ofthe undirected direct-reply network of a subred-dit based on comment threads: an edge existsbetween two users if one replies to the other.Following Hamilton et al. (2017), we only con-sider the top 20% of users when constructing thisnetwork. More dense communities exhibit morecommunity-speciﬁc language ( p < . , Figure3). Previous work using ethnography and friend-ship naming data has shown that a speaker’s po-sition in a social network is sometimes reﬂectedin the language they use, where individuals on the igure 4: A comparison of sense and type variationacross subreddits, where each marker is a subred-dit. The x-axis is the fraction of words with M ∗ inthe 98th percentile, and the y-axis is the fraction ofwords with T ∗ in the 98th percentile. The subred-dit r/transcribersofreddit, which had an unusually highfraction of words with high T ∗ (0.4101), was croppedout for visual clarity. periphery adopt less of the vernacular of a socialgroup compared to those in the core (Labov, 1973;Milroy, 1987; Sharma and Dodsworth, 2020). Tosee whether users’ position in Reddit direct-replynetworks show a similar phenomena, we use Co-hen et al. (2014)’s method to approximate users’closeness centrality ( (cid:15) = 10 − , k = 5000 ). Withineach community, we did not ﬁnd a meaningful cor-relation between closeness centrality and the prob-ability of a user using a community-speciﬁc word.This ﬁnding suggests that conversation networkson Reddit may not convey a user’s degree of be-longing to a community in the same manner as re-lationship networks in the physical world.The four attributes we examine also have signi-ﬁcant relationships with language variation when F is separated out into its two lexical and se-mantic components (the fraction of words with T ∗ > . and the fraction of words with M ∗ > . ). In other words, the patterns inFigure 3 persist when counting only unique wordtypes and when counting only unique meanings.This is because communities with greater lexicaldistinctiveness also exhibit greater semantic vari-ation (Spearman’s r s = 0 . , p < . , Fig-ure 4). So, communities with strong linguisticidentities express both types of variation. Fur-ther causal investigations could reveal whether thesame factors, such as users’ need for efﬁciencyand expressivity, produce both unique words andunique meanings (Blank, 1999). Figure 5: A bar plot showing the average F of subred-dits in different topics. “DASW” stands for the “Dis-gusting/Angering/Scary/Weird” category. Error barsare 95% conﬁdence intervals. Language varieties can be based on interest or oc-cupation (Fishman, 1972; Lewandowski, 2010),so we also examine what topics tend to be dis-cussed by communities with distinctive language(Figure 5). We use r/ListofSubreddit’s categoriz-ation of subreddits, focusing on the 474 subred-dits in our study. This categorization is hier-archical, and we choose a level of granularityso that each topic contains at least ﬁve of oursubreddits. Video Games, TV, Sports, Hob-bies/Occupations, and Technology tend to havemore community-speciﬁc language. These com-munities often discuss a particular subset of theoverall topic, such as a speciﬁc hobby or videogame, which are rich with technical terminology.For example, r/mechanicalkeyboards ( F = 0 . )is categorized under Hobbies/Occupations. Theirhighly community-speciﬁc words include key-board stores (e.g. kprepublic ), types of keyboards(e.g. ortholinear ), and keyboard components (e.g. pudding , reds ). Finally, we run ordinary least squares regressionswith attributes of Reddit communities as featuresand the dependent variable as communities’ F scores. The ﬁrst model has only user-based attrib-utes as features, while the second includes a topic-related feature. These experiments help us un-tangle whether the topic discussed in a community ependent variable: F (1) (2)intercept 0.0318*** 0.0318***(0.001) (0.001)community size -0.0050*** -0.0042***(0.001) (0.001)user activity 0.0181*** 0.0179***(0.001) (0.001)user loyalty 0.0178*** 0.0162***(0.001) (0.001)network density -0.0091*** -0.0091***(0.001) (0.001)topic 0.0057***(0.001)Observations 474 474 R R Note: * p < . , ** p < . , *** p < . Table 7: Ordinary least squares regression results forthe effect of various community attributes on the frac-tion of community-speciﬁc words used in each com-munity. has a greater impact on linguistic distinctivenessthan the behaviors of the community’s users. Forthe topic variable, we code the value as 1 if thecommunity belongs to a topic identiﬁed as hav-ing high F (Technology, TV, Video Games, Hob-bies/Occ., Sports, or Other), and 0 otherwise.Once we account for other user-based attrib-utes, higher network density actually has a negat-ive effect on variation (Table 7), suggesting thatits earlier marginal positive effect is due to thepresence of correlated features. We ﬁnd that evenwhen a community discusses a topic that tendsto have high amounts of community-speciﬁc lan-guage, attributes related to user behavior still havea bigger and more signiﬁcant relationship withlanguage use, with similar coefﬁcients for thosevariables between the two models. This suggeststhat who is involved in a community matters morethan what these community members discuss. The Reddit posts and comments in our studyare accessible by the public and were crawledby Baumgartner et al. (2020). Our project wasdeemed exempt from IRB review for human sub-jects research by the relevant administrative ofﬁceat our institution. Even so, there are important eth-ical considerations to take when using social me-dia data (franzke et al., 2020; Webb et al., 2017).Users on Reddit are not typically aware of researchbeing conducted using their data, and therefore care needs to be taken to ensure that these usersremain anonymous and unidentifable. In addition,posts and comments that are deleted by users afterdata collection still persist in the archived data-set. Our study minimizes risks by focusing on ag-gregated results, and our research questions do notinvolve understanding sensitive information aboutindividual users. There is debate on whether toinclude direct quotes of users’ content in publica-tions (Webb et al., 2017; Vitak et al., 2016). We in-clude a few excerpts from comments in our paperto adequately illustrate our ideas, especially sincethe exact wording of text can inﬂuence the predic-tions of NLP models, but we choose examples thatdo not pertain to users’ personal information.

We use type- and sense-based methods to de-tect community-speciﬁc language in Reddit com-munities. Our results conﬁrm several sociolin-guistic hypotheses related to the behavior of usersand their use of community-speciﬁc language.Future work could develop annotated WSI data-sets for online language similar to the standardSemEval benchmarks we used, since models de-veloped directly on this domain may better ﬁt itsrich diversity of meanings.We set a foundation for further investigationson how BERT could help deﬁne unknown wordsor meanings in niche communities, or how lin-guistic norms vary across communities discuss-ing similar topics. Our community-level analysescould be expanded to measure linguistic simil-arity between communities and map the disper-sion of ideas among them. It is possible thatthe preferences of some communities towards spe-ciﬁc senses is due to words being commonly poly-semous and one meaning being particularly relev-ant to the topic of that community, while othersmight be linguistic innovations created by users.More research on semantic shifts may help un-tangle these differences.

Acknowledgements

We are grateful for the helpful feedback of the an-onymous reviewers and our action editor, WalterDaelemans. In addition, Olivia Lewke helped uscollect and organize subreddits’ glossaries. Thiswork was supported by funding from the NationalScience Foundation (Graduate Research Fellow-ship DGE-1752814 and grant IIS-1813470). eferences

Eduardo G. Altmann, Janet B. Pierrehumbert, andAdilson E. Motter. 2011. Niche as a determin-ant of word fate in online groups.

PLOS One ,6(5).Reinald Kim Amplayo, Seung-won Hwang, andMin Song. 2019. Autosense model forword sense induction. In

Proceedings ofthe AAAI Conference on Artiﬁcial Intelligence ,volume 33, pages 6212–6219.Asaf Amrami and Yoav Goldberg. 2018. Wordsense induction with neural biLM and symmet-ric patterns. In

Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing , pages 4860–4867, Brussels,Belgium. Association for Computational Lin-guistics.Asaf Amrami and Yoav Goldberg. 2019. Towardsbetter substitution-based word sense induction. arXiv preprint arXiv:1905.12598 .David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seed-ing. In

Proceedings of the Eighteenth An-nual ACM-SIAM Symposium on Discrete Al-gorithms , SODA ’07, page 1027–1035, USA.Society for Industrial and Applied Mathemat-ics.David Bamman, Chris Dyer, and Noah A. Smith.2014. Distributed representations of geograph-ically situated language. In

Proceedings ofthe 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 2: ShortPapers) , pages 828–834, Baltimore, Maryland.Association for Computational Linguistics.Marco Baroni, Silvia Bernardini, Adriano Fer-raresi, and Eros Zanchetta. 2009. The wackywide web: a collection of very large linguistic-ally processed web-crawled corpora.

LanguageResources and Evaluation , 43(3):209–226.Osman Ba¸skaya, Enis Sert, Volkan Cirik, andDeniz Yuret. 2013. AI-KU: Using substi-tute vectors and co-occurrence modeling forword sense induction and disambiguation. In

Second Joint Conference on Lexical and Com-putational Semantics (*SEM), Volume 2: Pro-ceedings of the Seventh International Workshopon Semantic Evaluation (SemEval 2013) , pages 300–306, Atlanta, Georgia, USA. Associationfor Computational Linguistics.Jason Baumgartner, Savvas Zannettou, Brian Kee-gan, Megan Squire, and Jeremy Blackburn.2020. The Pushshift Reddit dataset. In

Pro-ceedings of the International AAAI Conferenceon Web and Social Media , volume 14, pages830–839.Andreas Blank. 1999. Why do new meanings oc-cur? A cognitive typology of the motivations forlexical semantic change.

Historical Semanticsand Cognition , pages 61–90.Gerlof Bouma. 2009. Normalized (pointwise)mutual information in collocation extraction.

Proceedings of the German Society for Compu-tational Linguistics and Language Technology(GSCL) , pages 31–40.Sergey Brin and Lawrence Page. 1998. The ana-tomy of a large-scale hypertextual web searchengine.

Computer Networks and ISDN Systems ,30(1-7):107–117.M.R. Brito, E.L. Chávez, A.J. Quiroz, and J.E.Yukich. 1997. Connectivity of the mutual k-nearest-neighbor graph in clustering and out-lier detection.

Statistics & Probability Letters ,35(1):33 – 42.Mary Bucholtz and Kira Hall. 2005. Identity andinteraction: A sociocultural linguistic approach.

Discourse Studies , 7(4-5):585–614.Baobao Chang, Wenzhe Pei, and Miaohong Chen.2014. Inducing word sense with automatic-ally learned hidden concepts. In

Proceedings ofCOLING 2014, the 25th International Confer-ence on Computational Linguistics: TechnicalPapers , pages 355–364, Dublin, Ireland. Dub-lin City University and Association for Compu-tational Linguistics.Edith Cohen, Daniel Delling, Thomas Pajor, andRenato F. Werneck. 2014. Computing clas-sic closeness centrality, at scale. In

Proceed-ings of the Second ACM Conference on On-line Social Networks , COSN ’14, page 37–50,New York, NY, USA. Association for Comput-ing Machinery.Cristian Danescu-Niculescu-Mizil, Robert West,Dan Jurafsky, Jure Leskovec, and Christopherotts. 2013. No country for old members: Userlifecycle and linguistic change in online com-munities. In

Proceedings of the 22nd Interna-tional Conference on World Wide Web , WWW’13, page 307–318, New York, NY, USA. Asso-ciation for Computing Machinery.Marco Del Tredici and Raquel Fernández. 2017.Semantic variation in online communities ofpractice. In

IWCS 2017 - 12th InternationalConference on Computational Semantics - Longpapers .Marco Del Tredici and Raquel Fernández. 2018.The road to success: Assessing the fate oflinguistic innovations in online communities.In

Proceedings of the 27th International Con-ference on Computational Linguistics , pages1591–1603, Santa Fe, New Mexico, USA. As-sociation for Computational Linguistics.Yohana Desta. 2014. The evolution of Internetspeak.

Mashable .Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-trainingof deep bidirectional transformers for languageunderstanding. In

Proceedings of the 2019Conference of the North American Chapter ofthe Association for Computational Linguist-ics: Human Language Technologies, Volume1 (Long and Short Papers) , pages 4171–4186,Minneapolis, Minnesota. Association for Com-putational Linguistics.Shehzaad Dhuliawala, Diptesh Kanojia, and Push-pak Bhattacharyya. 2016. SlangNet: A Word-Net like resource for English slang. In

Pro-ceedings of the Tenth International Confer-ence on Language Resources and Evaluation(LREC’16) , pages 4329–4332, Portorož, Slove-nia. European Language Resources Association(ELRA).Penelope Eckert. 2012. Three waves of variationstudy: The emergence of meaning in the studyof sociolinguistic variation.

Annual Review ofAnthropology , 41(1):87–100.Penelope Eckert and Sally McConnell-Ginet.1992. Think practically and look locally: Lan-guage and gender as community-based practice.

Annual Review of Anthropology , 21(1):461–488. Jacob Eisenstein. 2013. What to do about badlanguage on the internet. In

Proceedings ofthe 2013 Conference of the North AmericanChapter of the Association for ComputationalLinguistics: Human Language Technologies ,pages 359–369, Atlanta, Georgia. Associationfor Computational Linguistics.Jacob Eisenstein, Brendan O’Connor, Noah A.Smith, and Eric P. Xing. 2014. Diffusion oflexical change in social media.

PLOS ONE ,9(11):1–13.Joshua A. Fishman. 1972. The sociology of lan-guage. In

The Sociology of Language: An In-terdisciplinary Social Science Approach to Lan-guage in Society , chapter 3, pages 1–7. New-bury House Publishers, Rowley, MA.aline shakti franzke, Anja Bechmann, Michael Zi-mmer, Charles Ess, and the Association of In-ternet Researchers. 2020. Internet research:Ethical guidelines 3.0. https://aoir.org/reports/ethics3.pdf .Ryan J. Gallagher, Andrew J. Reagan, Chris-topher M. Danforth, and Peter Sheridan Dodds.2018. Divergent discourse between protestsand counter-protests:

PLOS ONE , 13(4):1–23.Mario Giulianelli, Marco Del Tredici, and RaquelFernández. 2020. Analysing lexical semanticchange with contextualised word representa-tions. In

Proceedings of the 58th Annual Meet-ing of the Association for Computational Lin-guistics , pages 3960–3973, Online. Associationfor Computational Linguistics.Suchin Gururangan, Ana Marasovi´c, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, DougDowney, and Noah A. Smith. 2020. Don’t stoppretraining: Adapt language models to domainsand tasks. In

Proceedings of the 58th AnnualMeeting of the Association for ComputationalLinguistics , pages 8342–8360, Online. Associ-ation for Computational Linguistics.Gregory R. Guy. 2011.

Language, social class,and status , Cambridge Handbooks in Languageand Linguistics, chapter 10. Cambridge Univer-sity Press.Christian Hadiwinoto, Hwee Tou Ng, andWee Chung Gan. 2019. Improved word senseisambiguation using pre-trained contextual-ized word representations. In

Proceedingsof the 2019 Conference on Empirical Meth-ods in Natural Language Processing and the9th International Joint Conference on Nat-ural Language Processing (EMNLP-IJCNLP) ,pages 5297–5306, Hong Kong, China. Associ-ation for Computational Linguistics.William Hamilton, Justine Zhang, CristianDanescu-Niculescu-Mizil, Dan Jurafsky, andJure Leskovec. 2017. Loyalty in online com-munities. In

Proceedings of the InternationalAAAI Conference on Web and Social Media ,volume 11, pages 540–543.Xiaochuang Han and Jacob Eisenstein. 2019. Un-supervised domain adaptation of contextualizedembeddings for sequence labeling. In

Pro-ceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing andthe 9th International Joint Conference on Nat-ural Language Processing (EMNLP-IJCNLP) ,pages 4237–4247, Hong Kong, China. Associ-ation for Computational Linguistics.Susan C. Herring and John C. Paolillo. 2006.Gender and genre variation in weblogs.

Journalof Sociolinguistics , 10(4):439–459.Janet Holmes and Miriam Meyerhoff. 1999. Thecommunity of practice: Theories and method-ologies in language and gender research.

Lan-guage in Society , 28(2):173–183.Renfen Hu, Shen Li, and Shichen Liang. 2019.Diachronic sense modeling with deep contex-tualized word embeddings: An ecological view.In

Proceedings of the 57th Annual Meeting ofthe Association for Computational Linguistics ,pages 3899–3908, Florence, Italy. Associationfor Computational Linguistics.Jianbo Shi and J. Malik. 2000. Normalized cutsand image segmentation.

IEEE Transactionson Pattern Analysis and Machine Intelligence ,22(8):888–905.Quentin Jones, Gilad Ravid, and Sheizaf Rafaeli.2004. Information overload and the messagedynamics of online interaction spaces: A theor-etical model and empirical exploration.

Inform-ation Systems Research , 15(2):194–210. David Jurgens and Ioannis Klapaftis. 2013.SemEval-2013 task 13: Word sense induc-tion for graded and non-graded senses. In

Second Joint Conference on Lexical and Com-putational Semantics (*SEM), Volume 2: Pro-ceedings of the Seventh International Workshopon Semantic Evaluation (SemEval 2013) , pages290–299, Atlanta, Georgia, USA. Associationfor Computational Linguistics.William Labov. 1973. The linguistic con-sequences of being a lame.

Language in So-ciety , 2(1):81–115.Jey Han Lau, Paul Cook, and Timothy Baldwin.2013. unimelb: Topic modelling-based wordsense induction. In

Second Joint Conference onLexical and Computational Semantics (*SEM),Volume 2: Proceedings of the Seventh Interna-tional Workshop on Semantic Evaluation (Se-mEval 2013) , pages 307–311, Atlanta, Georgia,USA. Association for Computational Linguist-ics.Alex Leavitt. 2015. "this is a throwaway account":Temporary technical identities and perceptionsof anonymity in a massive online community.In

Proceedings of the 18th ACM Conferenceon Computer Supported Cooperative Work &Social Computing , CSCW ’15, page 317–327,New York, NY, USA. Association for Comput-ing Machinery.Marcin Lewandowski. 2010. Sociolects andregisters–a contrastive analysis of two kinds oflinguistic variation.

Investigationes Linguist-icae , 20:60–79.J. Lin. 1991. Divergence measures based on theshannon entropy.

IEEE Transactions on In-formation Theory , 37(1):145–151.S. Lloyd. 1982. Least squares quantization inpcm.

IEEE Transactions on Information The-ory , 28(2):129–137.Jinghui Lu, Maeve Henchion, and BrianMac Namee. 2020. Diverging divergences:Examining variants of Jensen Shannon di-vergence for corpus comparison tasks. In

Proceedings of the 12th Language Resourcesand Evaluation Conference , pages 6740–6744, Marseille, France. European LanguageResources Association.arco Lui and Timothy Baldwin. 2012.langid.py: An off-the-shelf language iden-tiﬁcation tool. In

Proceedings of the ACL 2012System Demonstrations , pages 25–30, JejuIsland, Korea. Association for ComputationalLinguistics.Ulrike von Luxburg. 2007. A tutorial on spec-tral clustering.

Statistics and Computing ,17(4):395–416.Raquel Magalhães. 2019. Do you speak internet?How internet slang is changing language.

Un-derstanding with Unbabel .Suresh Manandhar and Ioannis Klapaftis. 2009.SemEval-2010 task 14: Evaluation setting forword sense induction & disambiguation sys-tems. In

Proceedings of the Workshop on Se-mantic Evaluations: Recent Achievements andFuture Directions (SEW-2009) , pages 117–122,Boulder, Colorado. Association for Computa-tional Linguistics.Christopher D. Manning, Prabhakar Raghavan,and Hinrich Schütze. 2008.

Introduction toInformation Retrieval . Cambridge UniversityPress.Rada Mihalcea and Paul Tarau. 2004. TextRank:Bringing order into text. In

Proceedings ofthe 2004 Conference on Empirical Methods inNatural Language Processing , pages 404–411,Barcelona, Spain. Association for Computa-tional Linguistics.George A. Miller, Claudia Leacock, RandeeTengi, and Ross T. Bunker. 1993. A semanticconcordance. In

Human Language Technology:Proceedings of a Workshop , Plainsboro, NewJersey.L. Milroy. 1987.

Language and Social Networks .Language in Society. Wiley-Blackwell, Oxford.Lesley Milroy and James Milroy. 1992. Socialnetwork and social class: Toward an integratedsociolinguistic model.

Language in Society ,21(1):1–26.Dong Nguyen, A. Seza Do˘gruöz, Carolyn P. Rosé,and Franciska de Jong. 2016. Computationalsociolinguistics: A Survey.

Computational Lin-guistics , 42(3):537–593. Dong Nguyen and Carolyn P. Rosé. 2011. Lan-guage use as a reﬂection of socialization in on-line communities. In

Proceedings of the Work-shop on Language in Social Media (LSM 2011) ,pages 76–85, Portland, Oregon. Association forComputational Linguistics.Eitan Adam Pechenick, Christopher M. Danforth,and Peter Sheridan Dodds. 2015. Characteriz-ing the google books corpus: Strong limits toinferences of socio-cultural and linguistic evol-ution.

PLOS ONE , 10(10):1–24.F. Pedregosa, G. Varoquaux, A. Gramfort,V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay.2011. Scikit-learn: Machine learning in Py-thon.

Journal of Machine Learning Research ,12:2825–2830.Zhengqi Pei, Zhewei Sun, and Yang Xu. 2019.Slang detection and identiﬁcation. In

Proceed-ings of the 23rd Conference on ComputationalNatural Language Learning (CoNLL) , pages881–889, Hong Kong, China. Association forComputational Linguistics.Dan Pelleg and Andrew W. Moore. 2000. X-means: Extending k-means with efﬁcient estim-ation of the number of clusters. In

Proceedingsof the Seventeenth International Conference onMachine Learning , ICML ’00, page 727–734,San Francisco, CA, USA. Morgan KaufmannPublishers Inc.Tom Postmes, Russell Spears, and Martin Lea.2000. The formation of group norms incomputer-mediated communication.

HumanCommunication Research , 26(3):341–371.Emily Reif, Ann Yuan, Martin Wattenberg,Fernanda B. Viegas, Andy Coenen, AdamPearce, and Been Kim. 2019. Visualizing andmeasuring the geometry of BERT. In

Advancesin Neural Information Processing Systems 32 ,pages 8594–8603.Rahmtin Rotabi and Jon Kleinberg. 2016. Thestatus gradient of trends in social media. In

Pro-ceedings of the International AAAI Conferenceon Web and Social Media , volume 10, pages319–328.evyani Sharma and Robin Dodsworth. 2020.Language variation and social networks.

An-nual Review of Linguistics , 6(1):341–361.Linfeng Song, Zhiguo Wang, Haitao Mi, andDaniel Gildea. 2016. Sense embedding learn-ing for word sense induction. In

Proceedings ofthe Fifth Joint Conference on Lexical and Com-putational Semantics , pages 85–90, Berlin, Ger-many. Association for Computational Linguist-ics.Ian Stewart, Stevie Chancellor, MunmunDe Choudhury, and Jacob Eisenstein. 2017. ,pages 4353–4361.Ian Stewart and Jacob Eisenstein. 2018. Mak-ing “fetch” happen: The inﬂuence of social andlinguistic context on nonstandard word growthand decline. In

Proceedings of the 2018 Con-ference on Empirical Methods in Natural Lan-guage Processing , pages 4360–4370, Brussels,Belgium. Association for Computational Lin-guistics.Na Sun, Patrick Pei-Luen Rau, and Liang Ma.2014. Understanding lurkers in online com-munities: A literature review.

Comput. Hum.Behav. , 38:110–117.Robert Tibshirani, Guenther Walther, and TrevorHastie. 2001. Estimating the number of clustersin a data set via the gap statistic.

Journal of theRoyal Statistical Society: Series B (StatisticalMethodology) , 63(2):411–423.Jessica Vitak, Katie Shilton, and Zahra Ashktorab.2016. Beyond the belmont principles: Ethicalchallenges, practices, and beliefs in the onlinedata research community. In

Proceedings of the19th ACM Conference on Computer-SupportedCooperative Work & Social Computing , pages941–953.Ellen M. Voorhees. 1999. The TREC-8 questionanswering track report. In

Proceedings of the8th Text Retrieval Conference (TREC-8) .Helena Webb, Marina Jirotka, Bernd CarstenStahl, William Housley, Adam Edwards, Mat-thew Williams, Rob Procter, Omer Rana, and Pete Burnap. 2017. The ethical challenges ofpublishing Twitter data for research dissemin-ation. In

Proceedings of the 2017 ACM onWeb Science Conference , WebSci ’17, page339–348, New York, NY, USA. Association forComputing Machinery.Etienne Wenger. 2000. Communities of prac-tice and social learning systems.

Organization ,7(2):225–246.Gregor Wiedemann, Steffen Remus, Avi Chawla,and Chris Biemann. 2019. Does BERT makeany sense? Interpretable word sense disam-biguation with contextualized embeddings. In

Proceedings of the 15th Conference on Nat-ural Language Processing (KONVENS 2019):Long Papers , pages 161–170, Erlangen, Ger-many. German Society for Computational Lin-guistics & Language Technology.Thomas Wolf, Lysandre Debut, Victor Sanh, Ju-lien Chaumond, Clement Delangue, AnthonyMoi, Pierric Cistac, Tim Rault, Remi Louf,Morgan Funtowicz, Joe Davison, Sam Shleifer,Patrick von Platen, Clara Ma, Yacine Jernite,Julien Plu, Canwen Xu, Teven Le Scao, SylvainGugger, Mariama Drame, Quentin Lhoest, andAlexander Rush. 2020. Transformers: State-of-the-art natural language processing. In

Pro-ceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing: Sys-tem Demonstrations , pages 38–45, Online. As-sociation for Computational Linguistics.Yi Yang and Jacob Eisenstein. 2017. Overcom-ing language variation in sentiment analysiswith social attention.

Transactions of the Asso-ciation for Computational Linguistics , 5:295–307.Justine Zhang, William L. Hamilton, CristianDanescu-Niculescu-Mizil, Dan Jurafsky, andJure Leskovec. 2017. Community identity anduser engagement in a multi-community land-scape. In