Econometrics

"Big Data" and its Origins

Against the background of explosive growth in data volume, velocity, and variety, I investigate the origins of the term "Big Data". Its origins are a bit murky and hence intriguing, involving both academics and industry, statistics and computer science, ultimately winding back to lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid 1990s. The Big Data phenomenon continues unabated, and the ongoing development of statistical machine learning tools continues to help us confront it.

Econometrics

A Basket Half Full: Sparse Portfolios

The existing approaches to sparse wealth allocations (1) are limited to low-dimensional setup when the number of assets is less than the sample size; (2) lack theoretical analysis of sparse wealth allocations and their impact on portfolio exposure; (3) are suboptimal due to the bias induced by an ℓ 1 -penalty. We address these shortcomings and develop an approach to construct sparse portfolios in high dimensions. Our contribution is twofold: from the theoretical perspective, we establish the oracle bounds of sparse weight estimators and provide guidance regarding their distribution. From the empirical perspective, we examine the merit of sparse portfolios during different market scenarios. We find that in contrast to non-sparse counterparts, our strategy is robust to recessions and can be used as a hedging vehicle during such times.

Econometrics

A Canonical Representation of Block Matrices with Applications to Covariance and Correlation Matrices

We obtain a canonical representation for block matrices. The representation facilitates simple computation of the determinant, the matrix inverse, and other powers of a block matrix, as well as the matrix logarithm and the matrix exponential. These results are particularly useful for block covariance and block correlation matrices, where evaluation of the Gaussian log-likelihood and estimation are greatly simplified. We illustrate this with an empirical application using a large panel of daily asset returns. Moreover, the representation paves new ways to regularizing large covariance/correlation matrices and to test block structures in matrices.

Econometrics

A Class of Time-Varying Vector Moving Average Models: Nonparametric Kernel Estimation and Application

Multivariate dynamic time series models are widely encountered in practical studies, e.g., modelling policy transmission mechanism and measuring connectedness between economic agents. To better capture the dynamics, this paper proposes a wide class of multivariate dynamic models with time-varying coefficients, which have a general time-varying vector moving average (VMA) representation, and nest, for instance, time-varying vector autoregression (VAR), time-varying vector autoregression moving-average (VARMA), and so forth as special cases. The paper then develops a unified estimation method for the unknown quantities before an asymptotic theory for the proposed estimators is established. In the empirical study, we investigate the transmission mechanism of monetary policy using U.S. data, and uncover a fall in the volatilities of exogenous shocks. In addition, we find that (i) monetary policy shocks have less influence on inflation before and during the so-called Great Moderation, (ii) inflation is more anchored recently, and (iii) the long-run level of inflation is below, but quite close to the Federal Reserve's target of two percent after the beginning of the Great Moderation period.

Econometrics

A Comparison of Methods for Treatment Assignment with an Application to Playlist Generation

This study presents a systematic comparison of methods for individual treatment assignment, a general problem that arises in many applications and has received significant attention from economists, computer scientists, and social scientists. We characterize the various methods proposed in the literature into three general approaches: learning models to predict outcomes, learning models to predict causal effects, and learning models to predict optimal treatment assignments. We show analytically that optimizing for outcome or causal effect prediction is not the same as optimizing for treatment assignments, and thus we should prefer learning models that optimize for treatment assignments. We then compare and contrast the three approaches empirically in the context of choosing, for each user, the best algorithm for playlist generation in order to optimize engagement. This is the first comparison of the different treatment assignment approaches on a real-world application at scale (based on more than half a billion individual treatment assignments). Our results show (i) that applying different algorithms to different users can improve streams substantially compared to deploying the same algorithm for everyone, (ii) that personalized assignments improve substantially with larger data sets, and (iii) that learning models by optimizing for treatment assignment can increase engagement by 28% more than when optimizing for outcome or causal effect predictions.

Econometrics

A Comparison of Statistical and Machine Learning Algorithms for Predicting Rents in the San Francisco Bay Area

Urban transportation and land use models have used theory and statistical modeling methods to develop model systems that are useful in planning applications. Machine learning methods have been considered too 'black box', lacking interpretability, and their use has been limited within the land use and transportation modeling literature. We present a use case in which predictive accuracy is of primary importance, and compare the use of random forest regression to multiple regression using ordinary least squares, to predict rents per square foot in the San Francisco Bay Area using a large volume of rental listings scraped from the Craigslist website. We find that we are able to obtain useful predictions from both models using almost exclusively local accessibility variables, though the predictive accuracy of the random forest model is substantially higher.

Econometrics

A Consistent LM Type Specification Test for Semiparametric Panel Data Models

This paper develops a consistent series-based specification test for semiparametric panel data models with fixed effects. The test statistic resembles the Lagrange Multiplier (LM) test statistic in parametric models and is based on a quadratic form in the restricted model residuals. The use of series methods facilitates both estimation of the null model and computation of the test statistic. The asymptotic distribution of the test statistic is standard normal, so that appropriate critical values can easily be computed. The projection property of series estimators allows me to develop a degrees of freedom correction. This correction makes it possible to account for the estimation variance and obtain refined asymptotic results. It also substantially improves the finite sample performance of the test.

Econometrics

A Control Function Approach to Estimate Panel Data Binary Response Model

We propose a new control function (CF) method to estimate a binary response model in a triangular system with multiple unobserved heterogeneities The CFs are the expected values of the heterogeneity terms in the reduced form equations conditional on the histories of the endogenous and the exogenous variables. The method requires weaker restrictions compared to CF methods with similar imposed structures. If the support of endogenous regressors is large, average partial effects are point-identified even when instruments are discrete. Bounds are provided when the support assumption is violated. An application and Monte Carlo experiments compare several alternative methods with ours.

Econometrics

A Correlated Random Coefficient Panel Model with Time-Varying Endogeneity

This paper studies a class of linear panel models with random coefficients. We do not restrict the joint distribution of the time-invariant unobserved heterogeneity and the covariates. We investigate identification of the average partial effect (APE) when fixed-effect techniques cannot be used to control for the correlation between the regressors and the time-varying disturbances. Relying on control variables, we develop a constructive two-step identification argument. The first step identifies nonparametrically the conditional expectation of the disturbances given the regressors and the control variables, and the second step uses "between-group" variations, correcting for endogeneity, to identify the APE. We propose a natural semiparametric estimator of the APE, show its n − − √ asymptotic normality and compute its asymptotic variance. The estimator is computationally easy to implement, and Monte Carlo simulations show favorable finite sample properties. Control variables arise in various economic and econometric models, and we provide variations of our argument to obtain identification in some applications. As an empirical illustration, we estimate the average elasticity of intertemporal substitution in a labor supply model with random coefficients.