Abstract

This is the rejoinder for discussion of "Multinomial Inverse Regression for Text Analysis", Journal of the American Statistical Association 108, 2013.

Full PDF

aa r X i v : . [ s t a t . A P ] A ug Rejoinder: Efﬁciency and Structure in MNIR

Matt Taddy, The University of Chicago Booth School of BusinessI thank Prof. Blei and Grimmer for their comments; it is great to have one’s work discussedby researchers who are both excellent statisticians and experts in their respective ﬁelds.The discussion can be summarized under two themes. Prof. Blei is interested in extendingMNIR to modeling additional, often latent, structure in text. Prof. Grimmer is concerned withcausation and interpretability. Both will be answered in context of my original motivation forMNIR: the estimation efﬁciency derived from assumptions on x | y . We’ll begin with estimatorproperties in a simple illustration, then turn to discussion of latent factors and causal inference. A related question of efﬁciency has been studied by Efron (1975) and Ng and Jordan (2002)in comparisons between logistic regression and ‘generative’ discriminant analysis. Efron’sgenerative classiﬁer applies Bayes rule to inverse multivariate normals x | y ∼ N( µ y , Σ) , where µ y = E [ x | y ] varies with y ∈ { , } but the covariance matrix is shared across populations.Given true normal covariate distributions separated by root Mahalanobis distances of 3 to 4, heﬁnds predictions from this routine to be 1.5 to 3 times more efﬁcient than logistic regression.This efﬁciency gain is smaller than that found by Ng and Jordan for a Naive Bayes algorithm(each covariate is ﬁt as independent of the others given y ), with their results loosely interpretedto imply log( n ) times higher efﬁciency for the generative predictor. Although Naive Bayesindependence is not assumed for the data itself, requirements on the amount of informationabout y available in each covariate have the effect of limiting conditional dependence.Our model presents a third scenario: covariate dependence is fully speciﬁed via the negativecorrelation of a multinomial. Consider binary response y ∈ { , } and the joint word-sentimentdistribution p( x , y ) = MN( x | q ( y ))p( y ) where q j ( y ) = exp[ α j + ϕ j y ] / P l exp[ α l + ϕ l y ] –that is, the collapsed model in Equation 1 of the main paper. Then the expected information for ϕ is π W , where π = E [ y ] and W = diag( q ) − q q ′ with q = q ( y = 1) , and standard results(e.g., van der Vaart, 1998, chap. 5) imply that in a ﬁxed vocabulary the variance for maximumlikelihood estimator ˆ ϕ scales with M = P i P j x ij , the total number of words. ROPOSITION

Assume the above joint model for y and x with π > , and write ˆ ϕ for theMLE ﬁt of ϕ in our collapsed MNIR model. The estimation error converges in distribution as √ πM ( ˆ ϕ − ϕ ) N (cid:0) , W − (cid:1) Thus variance decreases with the amount of speech rather than with the number of speakers.Prediction requires an accompanying forward model. If the collapsed model holds true,Bayes rule implies a forward predictor and results of Proposition 1.1 apply directly. A morerealistic scenario has the collapsed model misspeciﬁed on an individual level. Consider a modelof individual heterogeneity such that x ⊥⊥ y | x ′ ϕ , u where ϕ can be estimated consistentlyas in Proposition 1.1 and u is a vector of unobserved random effects – for example, the modelof Section 3.3 with x ij ∼ Po (exp[ µ j + ϕ j y i + u ij ]) and y i ⊥⊥ u ij ∼ N(0 , . Write z = ϕ ′ f = ϕ ′ ( x /m − n P i x i /m i ) for projection of mean shifted frequencies F = [ f · · · f n ] ′ , andsay MNIR-OLS is the two-stage estimation of ˆ ϕ in collapsed MNIR and [ ˆ α, ˆ β ] given ˆz = F ˆ ϕ via least-squares (OLS). Consider the simple forward approximation E [ y | f , u ] = α + βz (e.g.,if y = ˜ α + ˜ βz + γ ′ u + ε and u j = a j + b j z + ν j with ν j ⊥⊥ z , then β = ˜ β + γ ′ b ). Since E [ y | f ] = E [ y | f , u ] we have E argmin θ P i ( y i − α − f ′ i θ ) = ϕ β , such that OLS and MNIR-OLShave the same expectation and the effect of u on z is subsumed in β .The distinction of MNIR-OLS is its estimation precision.P ROPOSITION

Consider data from the joint word-sentiment distribution of Proposition 1.1partitioned into documents { x i , y i } ni =1 where < P i y i < n . Assuming a ﬁnite upper-boundfor each | ˆ ϕ j | , the MNIR-OLS predictor ˆ y ( x ) for a new document x has var (ˆ y ( x )) M →∞ −−−−→ σ (cid:18) n + z P ni =1 z i (cid:19) where z = f ′ ϕ is the true projection for x and σ is residual variance for regression of y on z .Proof. Note ¯ z = and var(ˆ y ( x )) = var( ˆ α ) + f ′ var( ˆ ϕ ˆ β ˆz ) f where ˆ β ˆz is OLS slope on ˆz = F ˆ ϕ .From Proposition 1.1 and the continuous mapping theorem we have ˆ ϕ p → ϕ and ˆ β ˆz ˆ β z .Slutsky’s lemma yields ˆ ϕ ˆ β ˆz ϕ ˆ β z with variance ϕ var( ˆ β z ) ϕ ′ = σ ϕϕ ′ / P i z i . Given that ˆ ϕ ˆ ϕ ˆ β ˆz is bounded on its ﬁnite domain, the Portmanteau lemma implies our convergence.Thus, in our simple cartoon, MNIR-OLS approaches with number-of-words the error rateof univariate least-squares. This holds for inﬁll (where n is constant but speech-per-documentgrows) as well as when n is growing with M and the right-hand-side of 1.2 is decreasing.Regularized estimation, say as applied in the main article, should help efﬁciency in tougheretups (e.g., where vocabulary grows with M ) but will increase bias. Although we’ve focusedon linear models many other options are available – for example, tree methods (e.g., Breiman,2001) work well in low dimensions for nonlinearity and variable interaction. The principlesremain the same: results like Proposition (1.1) show efﬁciency in collapsed IR, and one hopes tobe able to account for individual-level misspeciﬁcation in the low dimensional forward model. Prof. Blei’s 2nd extension is an especially promising idea. Random effects were originallyviewed as a nuisance necessary for understanding misspeciﬁcation. However, a low-dimensionallatent factorization of these effects would be a powerful tool for exploration and prediction. Itprovides a middle ground between LDA and MNIR.Such a model has log-odds η = α + Φy + Γu where u = [ u . . . u K ] ′ is a K -dimensionalfactor vector. Γ can then be interpreted as logit-transformed LDA topics for variation in textnot explained by variables in y . Just as Φ ′ x is sufﬁcient for y , the topic projection Γ ′ x will besufﬁcient for latent factors. Therefore the model provides both a new way to think about latentstructure in text and a strategy for fast computation of topic weights.The difﬁculty with latent factor modeling is estimation. On the one hand, although themodel is more complex, estimation variance should still decrease with M because of the multi-nomial assumption on x (indeed, similar arguments can explain the solid performance of LDAand sLDA regression). However, there are two big computational issues in posterior maximiza-tion with document-speciﬁc Γu i : you can no longer collapse the likelihood, and you need tojointly solve for Γ and U = [ u . . . u n ] ′ . Since the discussants and I work on corpora manyorders larger than the examples in this article, additional latent structure is only useful if wecan devise scalable algorithms for its estimation.On the lack of collapsibility, which is also an issue for high-dimensional y , I have hadsuccess applying a MapReduce strategy (Dean and Ghemawat, 2004). A factorized likelihoodis obtained by assuming counts x ij and x ik for j = k are independent and Poisson distributedgiven y i and u i (centered on intensity exp( m i /p ) for convenience). The Map step groups countson each column of X (i.e., for each word) and the Reduce step is a (possibly zero-inﬂated)Poisson log regression of each word count onto y i and u i . Exponential family parametrizationof the Poisson allows the same sufﬁciency results, and the multinomial distribution for vectorsof independent Poissons given their sum implies a close connection to MNIR. A paper on thisapproach to distributed multinomial regression is under preparation.ven with these parallel algorithms, it is difﬁcult to solve for both U and Γ . A ﬁxed-pointsolver (iterating between maximization for each conditional on the other) is usually too slow.One could impute a rough guess for U (e.g., from a PCA of document tf-idf), but this is onlya stand-in solution. Recent advances in distributed optimization using ADMM (Boyd et al.,2010) may offer a way forward, iterating from unique U j for each j th word towards shared U across vocabulary, but this is just conjecture. The problem of latent factor MNIR for largecorpora remains unsolved. I look forward to further discussion with Prof. Blei on this becauseit is something that his lab, if anybody, has a good chance of tackling. Prof. Grimmer’s comments are focused on interpretability: the translation from estimated mod-els to scientiﬁc mechanisms. In particular, he and other social scientists are interested in ques-tions of causation . This is among the toughest of topics in statistics, and one that is onlygrowing in both difﬁculty and importance with the amount and dimension of our data.First, we should not underestimate the importance of predictive ability in causal modeling.The goal is always good prediction, but to understand causation we want a model that predictswell when one covariate changes and all others stay constant. Some of the best causal inferenceschemes are explicitly predictive: matching, treatment-effects models, and propensity scoresrely upon estimation of the rate at which treated individuals were assigned to that group. Asan example, colleagues and I are interested in measuring attribution for digital advertisements(i.e., how an ad causes changes in consumer behavior). This is a notoriously tough problem,since the fact that a consumer sees an ad is highly correlated with the likelihood that they werealready looking to buy a certain product. MNIR for a consumer’s text (e.g. on social media)and their browser history (where website counts are treated like word counts) can be used toefﬁciently predict the probabilities both that they see an ad and that they buy a product, and wehope to use this to disentangle these correlated outcomes.However, instead of using text to help control for unobserved variables, Prof. Grimmer isseeking methods to infer the mechanisms behind word choice. This is because he rightly wantsto ensure that word loadings correspond to a general notion of partisanship – one that is portablebetween, say, newspapers and congressional speech. This is the causal problem exploded tosimultaneous inference for thousands of correlated outputs. Regardless, MNIR is a naturalstarting point: I assume that ‘sentiment’ causes speech rather than the inverse. From this onecan look to apply the structural models used in econometrics and biostatistics. As mentioned,he effects of other inputs are ‘controlled for’ by including them in the log-odds, say as η = α + ϕ y + Θv where v = [ v . . . v d ] ′ are confounding variables. Going further, an MNIR treatmenteffects estimator would regress y on v and include the ﬁtted expectation in the equation for η . One needs to be careful here, as techniques used for efﬁciency in high dimensions, suchas sparse regularization, can bias inference in unexpected ways. See Belloni et al. (2012) forrecent work on sparse high-dimensional treatment effects estimation.Finally, we should be aware of the limits of frameworks like MNIR (this also relates to Prof.Blei’s 3rd extension). As Prof. Grimmer says, it is difﬁcult to know what covariates should beincluded or excluded from the model. However, this will always be as much of a problem intext analysis as it has long been in social science. The ‘what’ that we measure is only everdeﬁned in terms of observables and the model assumed around them (even with human coderssentiment is dictated by the questions we ask). The goal is to have this be as close as possibleto our abstract ideal. For example, an ongoing project at Booth is investigating the history ofpartisanship in congressional speech. To deﬁne partisanship, we look at average predictabilityof party identity given words drawn from the distribution of speech for a given party. Thequestion of partisanship has been transformed to one of predictability, and this notion is reﬁnedby controlling for causes of word choice (e.g., geography, race) that we understand as distinctfrom partisanship. It is healthy to keep this inference separate from abstract meanings forsentiment or partisanship, in order to be clear on where evidence ends and speculation begins. Thanks to Jesse Shapiro, Matt Gentzkow, and Christian Hansen for helpful discussion.

References

Belloni, A., V. Chernozhukov, and C. Hansen (2012). Inference on treatment effects after selectionamongst high-dimensional controls. MIT Department of Economics Working Paper No. 12-13.Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein (2010). Distributed optimization and statisticallearning via the alternating direction method of multipliers.

Foundations and Trends in MachineLearning 3 , 1–122.Breiman, L. (2001). Random forests.

Machine Learning 45 , 5–32.Dean, J. and S. Ghemawat (2004). MapReduce: Simpliﬁed data processing on large clusters. In

Pro-ceedings of Operating Systems Design and Implementation , pp. 137–150.Efron, B. (1975). The efﬁciency of logistic regression compared to normal discriminant analysis.

Journalof the American Statistical Association (70), 892–898.Ng, A. Y. and M. I. Jordan (2002). On discriminative vs generative classiﬁers: A comparison of logisticregression and naive Bayes. In

Advances in Neural Information Processing Systems (NIPS) .van der Vaart, A. W. (1998).