Daniel Preoţiuc-Pietro
University of Pennsylvania
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Daniel Preoţiuc-Pietro.
international joint conference on natural language processing | 2015
Daniel Preoţiuc-Pietro; Vasileios Lampos; Nikolaos Aletras
Social media content can be used as a complementary source to the traditional methods for extracting and studying collective social attributes. This study focuses on the prediction of the occupational class for a public user profile. Our analysis is conducted on a new annotated corpus of Twitter users, their respective job titles, posted textual content and platform-related attributes. We frame our task as classification using latent feature representations such as word clusters and embeddings. The employed linear and, especially, non-linear methods can predict a user’s occupational class with strong accuracy for the coarsest level of a standard occupation taxonomy which includes nine classes. Combined with a qualitative assessment, the derived results confirm the feasibility of our approach in inferring a new user attribute that can be embedded in a multitude of downstream applications.
meeting of the association for computational linguistics | 2017
Daniel Preoţiuc-Pietro; Ye Liu; Daniel Hopkins; Lyle H. Ungar
Automatic political orientation prediction from social media posts has to date proven successful only in distinguishing between publicly declared liberals and conservatives in the US. This study examines users’ political ideology using a seven-point scale which enables us to identify politically moderate and neutral users – groups which are of particular interest to political scientists and pollsters. Using a novel data set with political ideology labels self-reported through surveys, our goal is two-fold: a) to characterize the groups of politically engaged users through language use on Twitter; b) to build a fine-grained model that predicts political ideology of unseen users. Our results identify differences in both political leaning and engagement and the extent to which each group tweets using political keywords. Finally, we demonstrate how to improve ideology prediction accuracy by exploiting the relationships between the user groups.
PLOS ONE | 2015
Daniel Preoţiuc-Pietro; Svitlana Volkova; Vasileios Lampos; Nikolaos Aletras
Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions.
north american chapter of the association for computational linguistics | 2015
Daniel Preoţiuc-Pietro; Johannes C. Eichstaedt; Gregory Park; Maarten Sap; Laura Smith; Victoria Tobolsky; H. Andrew Schwartz; Lyle H. Ungar
Mental illnesses, such as depression and post traumatic stress disorder (PTSD), are highly underdiagnosed globally. Populations sharing similar demographics and personality traits are known to be more at risk than others. In this study, we characterise the language use of users disclosing their mental illness on Twitter. Language-derived personality and demographic estimates show surprisingly strong performance in distinguishing users that tweet a diagnosis of depression or PTSD from random controls, reaching an area under the receiveroperating characteristic curve ‐ AUC ‐ of around .8 in all our binary classification tasks. In fact, when distinguishing users disclosing depression from those disclosing PTSD, the single feature of estimated age shows nearly as strong performance (AUC = .806) as using thousands of topics (AUC = .819) or tens of thousands of n-grams (AUC = .812). We also find that differential language analyses, controlled for demographics, recover many symptoms associated with the mental illnesses in the clinical literature.
north american chapter of the association for computational linguistics | 2015
Daniel Preoţiuc-Pietro; Maarten Sap; H. Andrew Schwartz; Lyle H. Ungar
This article is a system description and report on the submission of the World Well-Being Project from the University of Pennsylvania in the ‘CLPsych 2015’ shared task. The goal of the shared task was to automatically determine Twitter users who self-reported having one of two mental illnesses: post traumatic stress disorder (PTSD) and depression. Our system employs user metadata and textual features derived from Twitter posts. To reduce the feature space and avoid data sparsity, we consider several word clustering approaches. We explore the use of linear classifiers based on different feature sets as well as a combination use a linear ensemble. This method is agnostic of illness specific features, such as lists of medicines, thus making it readily applicable in other scenarios. Our approach ranked second in all tasks on average precision and showed best results at .1 false positive rates.
meeting of the association for computational linguistics | 2016
Lucie Flekova; Jordan Carpenter; Salvatore Giorgi; Lyle H. Ungar; Daniel Preoţiuc-Pietro
User traits disclosed through written text, such as age and gender, can be used to personalize applications such as recommender systems or conversational agents. However, human perception of these traits is not perfectly aligned with reality. In this paper, we conduct a large-scale crowdsourcing experiment on guessing age and gender from tweets. We systematically analyze the quality and possible biases of these predictions. We identify the textual cues which lead to miss-assessments of traits or make annotators more or less confident in their choice. Our study demonstrates that differences between real and perceived traits are noteworthy and elucidates inaccurately used stereotypes in human perception.
knowledge discovery and data mining | 2013
Daniel Preoţiuc-Pietro; Justin Cranshaw; Tae Yano
In this work we explore the use of incidentally generated social network data for the folksonomic characterization of cities by the types of amenities located within them. Using data collected about venue categories in various cities, we examine the effect of different granularities of spatial aggregation and data normalization when representing a city as a collection of its venues. We introduce three vector-based representations of a city, where aggregations of the venue categories are done within a grid structure, within the citys municipal neighborhoods, and across the city as a whole. We apply our methods to a novel dataset consisting of Foursquare venue data from 17 cities across the United States, totaling over 1 million venues. Our preliminary investigation demonstrates that different assumptions in the urban perception could lead to qualitative, yet distinctive, variations in the induced city description and categorization.
north american chapter of the association for computational linguistics | 2016
Daniel Preoţiuc-Pietro; H. Andrew Schwartz; Gregory Park; Johannes C. Eichstaedt; Margaret L. Kern; Lyle H. Ungar; Elisabeth Shulman
Access to expressions of subjective personal posts increased with the popularity of Social Media. However, most of the work in sentiment analysis focuses on predicting only valence from text and usually targeted at a product, rather than affective states. In this paper, we introduce a new data set of 2895 Social Media posts rated by two psychologicallytrained annotators on two separate ordinal nine-point scales. These scales represent valence (or sentiment) and arousal (or intensity), which defines each post’s position on the circumplex model of affect, a well-established system for describing emotional states (Russell, 1980; Posner et al., 2005). The data set is used to train prediction models for each of the two dimensions from text which achieve high predictive accuracy – correlated at r = .65 with valence and r = .85 with arousal annotations. Our data set offers a building block to a deeper study of personal affect as expressed in social media. This can be used in applications such as mental illness detection or in automated large-scale psychological studies.
web science | 2017
Sharath Chandra Guntuku; Weisi Lin; Jordan Carpenter; Wee Keong Ng; Lyle H. Ungar; Daniel Preoţiuc-Pietro
Interacting with images through social media has become widespread due to ubiquitous Internet access and multimedia enabled devices. Through images, users generally present their daily activities, preferences or interests. This study aims to identify the way and extent to which personality differences, measured using the Big Five model, are related to online image posting and liking. In two experiments, the larger consisting of ~1.5 million Twitter images both posted and liked by ~4,000 users, we extract interpretable semantic concepts using large-scale image content analysis and analyze differences specific of each personality trait. Predictive results show that image content can predict personality traits, and that there can be significant performance gain by fusing the signal from both posted and liked images.
empirical methods in natural language processing | 2015
Lucie Flekova; Daniel Preoţiuc-Pietro; Eugen Ruppert
Contemporary sentiment analysis ap- proaches rely heavily on lexicon based methods. This is mainly due to their simplicity, although the best empirical results can be achieved by more complex techniques. We introduce a method to assess suitability of generic sentiment lexicons for a given domain, namely to identify frequent bigrams where a polar word switches polarity. Our bigrams are scored using Lexicographers Mutual Information and leveraging large automatically obtained corpora. Our score matches human perception of polarity and demonstrates improvements in classification results using our enhanced context- aware method. Our method enhances the assessment of lexicon based sentiment de- tection algorithms and can be further used to quantify ambiguous words.