Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Matt Taddy is active.

Publication


Featured researches published by Matt Taddy.


Journal of the American Statistical Association | 2013

Multinomial Inverse Regression for Text Analysis

Matt Taddy

Text data, including speeches, stories, and other document forms, are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-sufficient dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represented as draws from a multinomial distribution, and we show that logistic regression of phrase counts onto document annotations can be used to obtain low-dimensional document representations that are rich in sentiment information. To facilitate this modeling, a novel estimation technique is developed for multinomial logistic regression with very high-dimensional response. In particular, independent Laplace priors with unknown variance are assigned to each regression coefficient, and we detail an efficient routine for maximization of the joint posterior over coefficients and their prior scale. This “gamma-lasso” scheme yields stable and effective estimation for general high-dimensional logistic regression, and we argue that it will be superior to current methods in many settings. Guidelines for prior specification are provided, algorithm convergence is detailed, and estimator properties are outlined from the perspective of the literature on nonconcave likelihood penalization. Related work on sentiment analysis from statistics, econometrics, and machine learning is surveyed and connected. Finally, the methods are applied in two detailed examples and we provide out-of-sample prediction studies to illustrate their effectiveness.


Journal of Business & Economic Statistics | 2016

A Nonparametric Bayesian Analysis of Heterogenous Treatment Effects in Digital Experimentation

Matt Taddy; Matt Gardner; Liyun Chen; David Draper

Randomized controlled trials play an important role in how Internet companies predict the impact of policy decisions and product changes. In these “digital experiments,” different units (people, devices, products) respond differently to the treatment. This article presents a fast and scalable Bayesian nonparametric analysis of such heterogenous treatment effects and their measurement in relation to observable covariates. New results and algorithms are provided for quantifying the uncertainty associated with treatment effect measurement via both linear projections and nonlinear regression trees (CART and random forests). For linear projections, our inference strategy leads to results that are mostly in agreement with those from the frequentist literature. We find that linear regression adjustment of treatment effect averages (i.e., post-stratification) can provide some variance reduction, but that this reduction will be vanishingly small in the low-signal and large-sample setting of digital experiments. For regression trees, we provide uncertainty quantification for the machine learning algorithms that are commonly applied in tree-fitting. We argue that practitioners should look to ensembles of trees (forests) rather than individual trees in their analysis. The ideas are applied on and illustrated through an example experiment involving 21 million unique users of EBay.com.


Technometrics | 2013

Measuring Political Sentiment on Twitter: Factor Optimal Design for Multinomial Inverse Regression

Matt Taddy

This article presents a short case study in text analysis: the scoring of Twitter posts for positive, negative, or neutral sentiment directed toward particular U.S. politicians. The study requires selection of a subsample of representative posts for sentiment scoring, a common and costly aspect of sentiment mining. As a general contribution, our application is preceded by a proposed algorithm for maximizing sampling efficiency. In particular, we outline and illustrate greedy selection of documents to build designs that are D-optimal in a topic-factor decomposition of the original text. The strategy is applied to our motivating dataset of political posts, and we outline a new technique for predicting both generic and subject-specific document sentiment through the use of variable interactions in multinomial inverse regression. Results are presented for analysis of 2.1 million Twitter posts collected around February 2012. Computer codes and data are provided as supplementary material online.


The Annals of Applied Statistics | 2015

Distributed Multinomial Regression

Matt Taddy

This article introduces a model-based approach to distributed computing for multinomial logistic (softmax) regression. We treat counts for each response category as independent Poisson regressions via plug-in estimates for fixed effects shared across categories. The work is driven by the high-dimensional-response multinomial models that are used in analysis of a large number of random counts. Our motivating applications are in text analysis, where documents are tokenized and the token counts are modeled as arising from a multinomial dependent upon document attributes. We estimate such models for a publicly available data set of reviews from Yelp, with text regressed onto a large set of explanatory variables (user, business, and rating information). The fitted models serve as a basis for exploring the connection between words and variables of interest, for reducing dimension into supervised factor scores, and for prediction. We argue that the approach herein provides an attractive option for social scientists and other text analysts who wish to bring familiar regression tools to bear on text data.


National Bureau of Economic Research | 2017

Text As Data

Matthew Gentzkow; Bryan T. Kelly; Matt Taddy

An ever increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications.


Journal of the American Statistical Association | 2013

Rejoinder: Efficiency and Structure in MNIR

Matt Taddy

project to analyze the Congressional Record from 1873 to the present. With latent sentiment encoded as a random variable v, we can try to measure how it changes over time, such as for a particular lawmaker or author. Specifically, we can embed vt,d in a state-space model (West and Harrison 1997) and model the observed sentiment yt,d as a noisy realization. We can also consider how the relationship between language and sentiment changes, placing each term’s coefficient φw in a similar state-space model. While these innovations lead to more complex inference problems, placing the coefficients and sentiment variables in a time series could capture interesting changes in large collections of texts. • Richer structure with multilevel models. The intercept term α models the language that is unrelated to sentiment. In this article α is a simple unigram model, where there is one value for each term in the vocabulary. With additional structure, we might better tease apart the sentiment words from the background language and rely less on a careful selection of the vocabulary. For example, consider modeling items and their ratings with MIR (such as Amazon reviews). If an item receives universally positive reviews then MIR will attribute words that describe it, such as “DiCaprio,” as positive-sentiment words. This is a deficiency since we would not consider “DiCaprio” to be a word that indicates sentiment. (That said, if our only goal was prediction this might be a perfectly good inference.) Per-item intercepts, shared by all the reviews of a single item, would solve this problem. They can capture that “DiCaprio” is specific to particular movies but not to general sentiment. In MIR, this can be captured by developing the random effect variables uij . In this case, they would be hierarchical random effects (Gelman and Hill 2007) that are conditioned on attributes of the reviewed item. In his article, Taddy does not fully exploit the uij variables as his example problems did not require them. (Further, a model that includes more structured random effects variables will not enjoy the speed-ups that Taddy obtains by collapsing document counts.) With more complicated collections, richer random effects could be important to untangling the text into its sentiment and nonsentiment components. • Unsupervised sentiment analysis. With the view of MIR as a complete hierarchical model, it would be interesting to consider unsupervised sentiment analysis. Given a collection of unlabeled texts, can we identify a dimension of “sentiment” that these texts are expressing? Initially, this might sound far-fetched. But consider the success story of ideal point models in political methodology research (Clinton, Jackman, and Rivers 2004). Based only on a matrix of votes, we can recover the known political spectrum on which the lawmakers lie. In principle, though the data are significantly sparser and have higher dimension, we can do the same for language. Given a large enough collection of texts—again consider Amazon reviews or political speeches—we should be able to distinguish the prominent patterns in descriptive prose from those that express a spectrum of sentiment.


arXiv: Applications | 2016

Hockey Player Performance via Regularized Logistic Regression

Robert B. Gramacy; Matt Taddy; Sen Tian

A hockey players plus-minus measures the difference between goals scored by and against that players team while the player was on the ice. This measures only a marginal effect, failing to account for the influence of the others he is playing with and against. A better approach would be to jointly model the effects of all players, and any other confounding information, in order to infer a partial effect for this individual: his influence on the box score regardless of who else is on the ice. This chapter describes and illustrates a simple algorithm for recovering such partial effects. There are two main ingredients. First, we provide a logistic regression model that can predict which team has scored a given goal as a function of who was on the ice, what teams were playing, and details of the game situation (e.g. full-strength or power-play). Since the resulting model is so high dimensional that standard maximum likelihood estimation techniques fail, our second ingredient is a scheme for regularized estimation. This adds a penalty to the objective that favors parsimonious models and stabilizes estimation. Such techniques have proven useful in fields from genetics to finance over the past two decades, and have demonstrated an impressive ability to gracefully handle large and highly imbalanced data sets. The latest software packages accompanying this new methodology -- which exploit parallel computing environments, sparse matrices, and other features of modern data structures -- are widely available and make it straightforward for interested analysts to explore their own models of player contribution.


National Bureau of Economic Research | 2016

Measuring Polarization in High-Dimensional Data: Method and Application to Congressional Speech

Matthew Gentzkow; Jesse M. Shapiro; Matt Taddy


Archive | 2013

Reverse Survivorship Bias

John Cochrane; Phil English; Gene Fama; Andrea Frazzini; Bobbie Goettler; Mark Grinblatt; John Heaton; Markku Kaustia; Bryan T. Kelly; Matti Keloharju; Ralph Koijen; Annette Larson; Robin L. Lumsdaine; Toby Moskowitz; Robert Novy-Marx; Ľuboš Pástor; Antti Petäjistö; Alexi Savov; Clemens Sialm; Rob Stambaugh; Matt Taddy; Luke Taylor; Pietro Veronesi; Z. Jay Wang; Campbell R. Harvey


Journal of Computational and Graphical Statistics | 2017

One-step estimator paths for concave regularization

Matt Taddy

Collaboration


Dive into the Matt Taddy's collaboration.

Top Co-Authors

Avatar

David Draper

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jesse M. Shapiro

National Bureau of Economic Research

View shared research outputs
Top Co-Authors

Avatar

Mark Grinblatt

National Bureau of Economic Research

View shared research outputs
Top Co-Authors

Avatar

Mladen Kolar

Carnegie Mellon University

View shared research outputs
Researchain Logo
Decentralizing Knowledge