Efficient Computation for Centered Linear Regression with Sparse Inputs
EEfficient Computation for Centered Linear Regression with Sparse Inputs
Jeffrey WongExperimentation PlatformNetflix, Inc.October 30, 2019
Abstract
Regression with sparse inputs is a common theme for large scale models. Optimizing the underlying linearalgebra for sparse inputs allows such models to be estimated faster. At the same time, centering the inputshas benefits in improving the interpretation and convergence of the model. However, centering the data natu-rally makes sparse data become dense, limiting opportunities for optimization. We propose an efficient strategythat estimates centered regression while taking advantage of sparse structure in data, improving computationalperformance and decreasing the memory footprint of the estimator.
Centering regression inputs is an operation done when estimating linear models, such as Ordinary Least Squares(OLS) that can improve interpretation and convergence. Given the model y = β + Xβ + (cid:15) with (cid:15) ∼ N (0 , V ),centering removes the average of X from each feature vector in the dataset so that the centered feature representsthe deviation from the average. This allows the parameter, ˆ β , to represent the mean of y for the average observation.The coefficients, ˆ β , then represent offsets from the average observation (Aiken, West, and Reno 1991). In randomizedand controlled trials, we often seek the average treatment effect, which is the average deviation between observationswhere treatment was applied and observations where treatment was not applied. Without centering, ˆ β representsthe mean of y for an arbitrary baseline group, and ˆ β would be the offset from that arbitrary baseline.When the model uses features that are transformations of X , for example, X , centering the regression makesit easier to estimate the distribution of ˆ β = ( ˆ β , ˆ β ). Removing the centers of the features shrinks the covariancesbetween the features and decreases the condition number. As the covariances shrink, the covariance matrix behavesmore like a diagonal matrix, which is easy to invert and solve.Optimizing regression models for sparse inputs is an important component of performance (Wong, Lewis, andWardrop 2019), especially when using one hot encoded categorical variables, however it can conflict with centeredregression. Given a model matrix with arbitrary features, M , and a response vector, y , we wish to estimate OLSwith centered inputs. The matrix M can be stored as a dense matrix, M D , or as a sparse matrix, M S . If M isstored as a dense matrix, centering the inputs, then estimating OLS incurs little overhead. However, if M is storedas a sparse matrix, centering M S results in a dense matrix, limiting sparse optimizations that take advantage of thestructure of M S . In this paper we show how to expand OLS so that we take advantage of the structure of M S asmuch as possible. We show this expansion for three common cases, and then describe linear algebra optimizationsto create an efficient solver. Let M S ∈ R n × p be a model matrix, µ T ∈ R × p be a row vector for the p column means of M S , and 1 n be a length n column vector of all ones. Then the centered model matrix is ˜ M = ( M S − n µ T ). Estimating OLS with ˜ M and y yields the estimator ˆ β transformed = ( ˜ M T ˜ M ) − ˜ M T y. This can be done by materializing ˜ M and passing it as input to a standard OLS solver, such as StatsModels(Seabold and Perktold 2010). However, materializing ˜ M transforms the inputs from sparse to dense. Instead, we use a1 a r X i v : . [ s t a t . C O ] O c t trategy that splits operations into sparse optimized operations, and dense operations that are easy to compute. Thissection describes the linear algebra operations for three different use cases, then section 4 describes the optimizations.To better utilize the sparsity in M S we write the first expansion asˆ β transformed = (cid:16) ( M S − n µ T ) T ( M S − n µ T ) (cid:17) − ( M S − n µ T ) T y = (cid:16) M TS M S − M TS n µ T − µ Tn M S + µ Tn n µ T (cid:17) − ( M TS y − µ Tn y )where M TS M S is a sparse optimized matrix multiplication, and − M TS n µ T − µ Tn M S + µ Tn n µ T is an easy tocompute dense operation, discussed in section 4. Suppose the regression is weighted by the vector w . Then centering refers to removing the weighted mean 1 n µ Tw where µ Tw is the weighted column means of M S weighted by w . The weighted OLS problem becomesˆ β transformed = (cid:16) ( M S − n µ Tw ) T W ( M S − n µ Tw ) (cid:17) − ( M S − n µ Tw ) T W y = (cid:16) M TS W M S − M TS W n µ Tw − µ w Tn W M S + µ w Tn W n µ Tw (cid:17) − ( M TS W y − µ w Tn W y ) . In addition to centering, we may also wish to scale M S so that each column has weighted variance 1. For instance,the lasso (Tibshirani 1996) shrinks coefficients to zero and assumes data has been centered and scaled. Let σ Tw bethe row vector of weighted column standard deviations of M S weighted by w . Scaling M S to have weighted columnvariance of 1 is equivalent to the operation M ∗ S = M S Σ where Σ = /σ w /σ w p . The scaled and centeredOLS estimator becomesˆ β transformed = (cid:32)(cid:16) ( M S − n µ Tw )Σ (cid:17) T W (cid:16) ( M S − n µ Tw )Σ (cid:17)(cid:33) − (cid:16) ( M S − n µ Tw )Σ (cid:17) T W y = (cid:16) Σ M TS W M S Σ − Σ M TS W n µ Tw Σ − Σ µ w Tn W M S Σ + Σ µ w Tn W n µ Tw Σ (cid:17) − (Σ M TS W y − Σ µ w Tn W y )= (cid:16) M ∗ T S W M ∗ S − M ∗ T S W n µ Tw Σ − Σ µ w Tn W M ∗ S + Σ µ w Tn W n µ Tw Σ (cid:17) − ( M ∗ T S W y − Σ µ w Tn W y ) When computing homoskedastic covariances where var ( (cid:15) ) = k I , the covariance of ˆ β transformed can be computedusing the standard formula Cov ( ˆ β transformed ) = ( ˜ M T ˜ M ) − ˆ k where ˆ k = n − p (cid:80) i ( y − ˆ y ) . For efficient computation,we reuse the expansion( ˜ M T ˜ M ) − = (cid:32)(cid:16) ( M S − n µ Tw )Σ (cid:17) T W (cid:16) ( M S − n µ Tw )Σ (cid:17)(cid:33) − = (cid:16) M ∗ T S W M ∗ S − M ∗ T S W n µ Tw Σ − Σ µ w Tn W M ∗ S + Σ µ w Tn W n µ Tw Σ (cid:17) − . Likewise, for heteroskedastic-consistent covariances (White et al. 1980) we expand ( ˜ M T ˜ M ) − ( ˜ M T diag ( (cid:15) ) ˜ M T )( ˜ M T ˜ M ) − by replacing W with a diagonal matrix having diagonal entries w(cid:15) .2 Predictions on M S Fitting OLS on centered and scaled inputs with centers µ w and standard deviations σ w yields coefficients that canbe used to create predictions for the centered and scaled data, such as ˆ y = ( M S − n µ Tw )Σ ˆ β transformed . However, inpractice we will receive new data in the form of M S , not ( M S − n µ Tw )Σ. To predict directly on M S , we compute aˆ β original for the original scale of the data, where ˆ β original = ( I p × p − (1) p µ Tw )Σ ˆ β transformed , and 1 (1) p is a length p columnvector with values 1 located at index 1, and 0 everywhere else. There are four key terms throughout the OLS expansions that can be computed efficiently:1. M ∗ T S W M ∗ S . This product is symmetric, so only half of it needs to be computed. It can be optimized usingsparse matrix multiplications, which are implemented in Eigen (Guennebaud, Jacob, et al. 2010).2. M ∗ T S W n µ Tw Σ. The multiplication can be ordered specifically as ( M ∗ T S W n )( µ Tw Σ). ( M ∗ T S W n ) reduces toweighted column sums of M ∗ S weighted by w . ( µ Tw Σ) reduces to elementwise multiplication between the length p vectors µ w and 1 /σ w .3. Σ µ w Tn W n µ Tw Σ. This product is symmetric again. The multiplication can be ordered specifically as (Σ µ w )(1 Tn W n )( µ Tw Σ).From 2) Σ µ w is computed using elementwise multiplication. (1 Tn W n ) is the sum of the weights vector, w .4. W y and 1 Tn W y . W y is elementwise multiplication of the length n vectors w and y . 1 Tn W y sums that elementwisemultiplication.In general, the runtime complexity for estimating OLS is O ( np + p ). When n > p , it is dominated byconstructing M T M , otherwise it is dominated by a pseudoinverse for ( M T M ) − . Despite requesting centered inputs,these linear algebra optimizations allow us to form M ∗ T S M ∗ S using sparse inputs, which combined with simple denseoperations can construct ˜ M T ˜ M efficiently. However, we cannot invert ˜ M T ˜ M efficiently without using dense algebra.For large n , the memory requirements to materialize ˜ M can be very large. Using optimized operations, we onlymaterialize ˜ M T ˜ M , without materializing ˜ M . The memory requirement to materialize ( M S , W, µ Tw , σ Tw ), which arethe components for the OLS expansions, is O ( np · density + n ). Storing ˜ M T ˜ M only requires O ( p ) memory. In this section we compare the performance of the efficient solver against that of the naive solver, ˆ β transformed =( ˜ M T ˜ M ) − ˜ M T y , which materializes the dense ˜ M matrix. The experiment simulates sparse model matrices withdensity rates of 0.01, 0.05, 0.1, 0.15, 0.2 and 0.25, with p = 100 features. We also vary the sample size from 0.1million observations to 10 million observations. Results are displayed in the figure below. When data is large andsparse, the efficient solver estimates OLS much faster than the naive solver, up to 35 times as fast. Similarly, thememory used for the efficient solver is 1 / density that of the naive solver. Centered regression has benefits in interpretability and convergence, though centering the features can lead to acomputationally expensive regression with dense inputs. We described a computational strategy that expands thestandard OLS estimator to take advantage of sparse data structures while still centering the inputs. The resultingimplementation resolves the conflict between the desirability of centered regression and the performance benefits ofsparse data.
References [AWR91] Leona S Aiken, Stephen G West, and Raymond R Reno.
Multiple regression: Testing and interpretinginteractions . Sage, 1991.[G+10] Ga¨el Guennebaud, Benoˆıt Jacob, et al.
Eigen v3 . http://eigen.tuxfamily.org. 2010.3 a) Time to estimate centered regression for various sample sizeand density rates. (b) Memory usage for various sample size and density rates. [SP10] Skipper Seabold and Josef Perktold. “statsmodels: Econometric and statistical modeling with python”.In: . 2010.[Tib96] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In:
Journal of the Royal StatisticalSociety: Series B (Methodological) econometrica arXiv preprint arXiv:1910.01305arXiv preprint arXiv:1910.01305