Umashanthi Pavalanathan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Umashanthi Pavalanathan is active.

Explore More

Publication

Featured researches published by Umashanthi Pavalanathan.

empirical methods in natural language processing | 2015

Confounds and Consequences in Geotagged Twitter Data

Umashanthi Pavalanathan; Jacob Eisenstein

Twitter is often used in quantitative studies that identify geographically-preferred topics, writing styles, and entities. These studies rely on either GPS coordinates attached to individual messages, or on the user-supplied location field in each profile. In this paper, we compare these data acquisition techniques and quantify the biases that they introduce; we also measure their effects on linguistic analysis and text-based geolocation. GPS-tagging and self-reported locations yield measurably different corpora, and these linguistic differences are partially attributable to differences in dataset composition by age and gender. Using a latent variable model to induce age and gender, we show how these demographic variables interact with geography to affect language use. We also show that the accuracy of text-based geolocation varies with population demographics, giving the best results for men above the age of 40.

American Speech | 2015

AUDIENCE-MODULATED VARIATION IN ONLINE SOCIAL MEDIA

Umashanthi Pavalanathan; Jacob Eisenstein

Stylistic variation in online social media writing is well attested: for example, geographical analysis of the social media service Twitter has replicated isoglosses for many known lexical variables from speech, while simultaneously revealing a wealth of new geographical lexical variables, including emoticons, phonetic spellings, and phrasal abbreviations. However, less is known about the social role of variation in online writing. This paper examines online writing variation in the context of audience design, focusing on affordances offered by Twitter that allow users to modulate a messages intended audience. We find that the frequency of non-standard lexical variables is inversely related to the size of the intended audience: as writers target smaller audiences, the frequency of lexical variables increases. In addition, these variables are more often used in messages that are addressed to individuals who are known to be geographically local. This phenomenon holds for geographically-differentiated lexical variables, but also for non-standard variables that are widely used throughout the United States. These findings suggest that users of social media are attuned to both the nature of their audience and the social meaning of lexical variation, and that they customize their self-presentation accordingly. Introduction Social media writing is often stylistically distinct from other written genres (Crystal 2006; Eisenstein 2013a), but it also displays an impressive internal stylistic diversity (Herring 2007; Androutsopoulos 2011). Many stylistic variables in social media have been shown to align with macro-level properties of the author, such as geographical location (Eisenstein et al. 2010), age (Schler et al. 2006), race (Eisenstein, Smith, and Xing 2011), and gender (Herring and Paolillo 2006). Linguistic differences are robust enough to support unnervingly accurate predictions of these characteristics based on writing style – with algorithmic predictions in some cases outperforming those of human judgments (Burger et al. 2011). This focus on prediction aligns with Silversteins (2003) concept of first-order indexicality – the direct association of linguistic variables with macro-level social categories. The huge size of social media corpora makes it easy to identify hundreds of such variables through statistical analysis (e.g., Eisenstein, Smith, and Xing 2011). But social media data has more to offer sociolinguistics than size alone: even though platforms such as Twitter are completely public, they capture language use in natural contexts with real social stakes. These platforms play host to a diverse array of interactional situations, from high school gossip to political debate, and from career networking to intense music fandom. As such, social media data offer new possibilities for understanding the social nature of language: not only who says what, but how stylistic variables are perceived by readers and writers, and how they are used to achieve communicative goals. In this paper, we focus on the relevance of audience to sociolinguistic variation. A rich theoretical literature is already dedicated to this issue, including models of accommodation (Giles, Coupland, and Coupland 1991), audience design (Bell 1984), and stancetaking (Du Bois 2007). Empirical evidence for these models has typically focused on relatively small corpora of conversational speech, with a small number of hand-chosen variables. Indeed, the applicability of audience design and related models to a large-scale corpus of online written communication may appear doubtful – is audience a relevant and quantifiable concept in social media? In public “broadcast” media such as blogs, the properties of the audience seem difficult to identify. Conversely, in directed communication such as e-mails and SMS, the identity of the audience is clear, but acquisition of large amounts of data is impeded by obvious privacy considerations. However, ethnographic research suggests that users of Twitter have definite ideas about who their audience is, and that they customize their self-presentation accordingly (Marwick and boyd 2011). Furthermore, contemporary social media platforms such as Twitter and Facebook offer authors increasingly nuanced capabilities for manipulating the composition of their audience, enabling them to reach both within and beyond the social networks defined by explicitly-stated friendship ties (called “following” in Twitter; Kwak et al. 2010). We define these affordances in detail below. This paper examines these notions of audience in the context of a novel dataset with thousands of writers and more than 200 lexical variables. The variables are organized into two sets: the first consists of terms that distinguish major American metropolitan areas from each other, and is obtained using an automatic technique based on regularized log-odds ratio. The second set of variables consists of the most frequently-used non-standard terms among Twitter users in the United States. In both cases, we find strong evidence of style-shifting according to audience size and proximity. When communication is intended for an individual recipient – particularly a recipient from the same geographical area as the author – both geographically-specific variables and medium-based variables are used at a significantly higher rate. Conversely, when communication is intended to reach a broad audience, outside the individuals social network, both types of variables are inhibited. These findings use a matched dataset design to control for the identify of the author, showing that individual authors are less likely to use non-standard and geographically-specific variables as the intended size of the audience grows. This provides evidence that individuals modulate their linguistic performance as they use social media affordances to control the intended audience of their messages. It also suggests that these non-standard variables – some of which appear to be endogenous to social media and recent in origin – are already viewed as socially marked, and are regulated accordingly.

conference on computer supported cooperative work | 2017

What (or Who) Is Public?: Privacy Settings and Social Media Content Sharing

Casey Fiesler; Michaelanne Dye; Jessica L. Feuston; Chaya Hiruncharoenvate; Clayton J. Hutto; Shannon Morrison; Parisa Khanipour Roshan; Umashanthi Pavalanathan; Amy Bruckman; Munmun De Choudhury; Eric Gilbert

When social networking sites give users granular control over their privacy settings, the result is that some content across the site is public and some is not. How might this content--or characteristics of users who post publicly versus to a limited audience--be different? If these differences exist, research studies of public content could potentially be introducing systematic bias. Via Mechanical Turk, we asked 1,815 Facebook users to share recent posts. Using qualitative coding and quantitative measures, we characterize and categorize the nature of the content. Using machine learning techniques, we analyze patterns of choices for privacy settings. Contrary to expectations, we find that content type is not a significant predictor of privacy setting; however, some demographics such as gender and age are predictive. Additionally, with consent of participants, we provide a dataset of nearly 9,000 public and non-public Facebook posts.

meeting of the association for computational linguistics | 2017

A Multidimensional Lexicon for Interpersonal Stancetaking.

Umashanthi Pavalanathan; Jim Fitzpatrick; Scott F. Kiesling; Jacob Eisenstein

The sociolinguistic construct of stancetaking describes the activities through which discourse participants create and signal relationships to their interlocutors, to the topic of discussion, and to the talk itself. Stancetaking underlies a wide range of interactional phenomena, relating to formality, politeness, affect, and subjectivity. We present a computational approach to stancetaking, in which we build a theoretically-motivated lexicon of stance markers, and then use multidimensional analysis to identify a set of underlying stance dimensions. We validate these dimensions intrinscially and extrinsically, showing that they are internally coherent, match pre-registered hypotheses, and correlate with social phenomena.

WISE Workshops | 2011

Levi - A Workflow Engine Using BPMN 2.0

Keheliya Gallaba; Umashanthi Pavalanathan; Ishan Jayawardena; Eranda Sooriyabandara; Vishaka Nanayakkara

Increasing benefits of business process automation and information technology (IT) based governance encourage organizations to model and manage their day to day business activities using business process management systems, in order to achieve increased efficiency and productivity. Many business process languages, such as Business Process Execution Language (BPEL), use a programming oriented view in process modeling as opposed to human oriented view. Recent standardization of Business Process Model and Notation version 2.0 (BPMN 2.0) provides a way to support inter-operation of business processes at user level, rather than at the software engine level. Wide adoption of the BPMN 2.0 standard is limited by the lack of runtimes natively supporting BPMN 2.0. In this paper we discuss about Levi, a cloud-ready BPMN 2.0 execution engine built using the core concurrent runtime of Apache based open source process engine ODE (Orchestration Director Engine), which executes BPMN 2.0 processes natively.

Computational Linguistics | 2018

Interactional Stancetaking in Online Forums

Scott F. Kiesling; Umashanthi Pavalanathan; Jim Fitzpatrick; Xiaochuang Han; Jacob Eisenstein

Language is shaped by the relationships between the speaker/writer and the audience, the object of discussion, and the talk itself. In turn, language is used to reshape these relationships over the course of an interaction. Computational researchers have succeeded in operationalizing sentiment, formality, and politeness, but each of these constructs captures only some aspects of social and relational meaning. Theories of interactional stancetaking have been put forward as holistic accounts, but until now, these theories have been applied only through detailed qualitative analysis of (portions of) a few individual conversations. In this article, we propose a new computational operationalization of interpersonal stancetaking. We begin with annotations of three linked stance dimensions—affect, investment, and alignment—on 68 conversation threads from the online platform Reddit. Using these annotations, we investigate thread structure and linguistic properties of stancetaking in online conversations. We identify lexical features that characterize the extremes along each stancetaking dimension, and show that these stancetaking properties can be predicted with moderate accuracy from bag-of-words features, even with a relatively small labeled training set. These quantitative analyses are supplemented by extensive qualitative analysis, highlighting the compatibility of computational and qualitative methods in synthesizing evidence about the creation of interactional meaning.

Archive | 2017

Studying Military Community Health, Well-Being, and Discourse Through the Social Media Lens

Umashanthi Pavalanathan; Vivek V. Datla; Svitlana Volkova; Lauren Charles-Smith; Meg Pirrung; Josh Harrison; Alan R. Chappell; Courtney D. Corley

Social media can provide a resource for characterizing communities and targeted populations through activities and content shared online. For instance, studying the armed forces’ use of social media may provide insights into their health and well-being. In this paper, we address three broad research questions: (1) How do military populations use social media? (2) What topics do military users discuss in social media? (3) Do military users talk about health and well-being differently than civilians? Military Twitter users were identified through keywords in the profile description of users who posted geo-tagged tweets at military installations. These military tweets were compared with the tweets from remaining population. Our analysis indicates that military users talk more about military related responsibilities and events, whereas nonmilitary users talk more about school, work, and leisure activities. A significant difference in online content generated by both populations was identified, involving sentiment, health, language, and social media features.

international world wide web conferences | 2015