Abstract

Regression with sparse inputs is a common theme for large scale models. Optimizing the underlying linear algebra for sparse inputs allows such models to be estimated faster. At the same time, centering the inputs has benefits in improving the interpretation and convergence of the model. However, centering the data naturally makes sparse data become dense, limiting opportunities for optimization. We propose an efficient strategy that estimates centered regression while taking advantage of sparse structure in data, improving computational performance and decreasing the memory footprint of the estimator.

Full PDF

EEﬃcient Computation for Centered Linear Regression with Sparse Inputs

Jeﬀrey WongExperimentation PlatformNetﬂix, Inc.October 30, 2019

Abstract

Regression with sparse inputs is a common theme for large scale models. Optimizing the underlying linearalgebra for sparse inputs allows such models to be estimated faster. At the same time, centering the inputshas beneﬁts in improving the interpretation and convergence of the model. However, centering the data natu-rally makes sparse data become dense, limiting opportunities for optimization. We propose an eﬃcient strategythat estimates centered regression while taking advantage of sparse structure in data, improving computationalperformance and decreasing the memory footprint of the estimator.

Centering regression inputs is an operation done when estimating linear models, such as Ordinary Least Squares(OLS) that can improve interpretation and convergence. Given the model y = β + Xβ + (cid:15) with (cid:15) ∼ N (0 , V ),centering removes the average of X from each feature vector in the dataset so that the centered feature representsthe deviation from the average. This allows the parameter, ˆ β , to represent the mean of y for the average observation.The coeﬃcients, ˆ β , then represent oﬀsets from the average observation (Aiken, West, and Reno 1991). In randomizedand controlled trials, we often seek the average treatment eﬀect, which is the average deviation between observationswhere treatment was applied and observations where treatment was not applied. Without centering, ˆ β representsthe mean of y for an arbitrary baseline group, and ˆ β would be the oﬀset from that arbitrary baseline.When the model uses features that are transformations of X , for example, X , centering the regression makesit easier to estimate the distribution of ˆ β = ( ˆ β , ˆ β ). Removing the centers of the features shrinks the covariancesbetween the features and decreases the condition number. As the covariances shrink, the covariance matrix behavesmore like a diagonal matrix, which is easy to invert and solve.Optimizing regression models for sparse inputs is an important component of performance (Wong, Lewis, andWardrop 2019), especially when using one hot encoded categorical variables, however it can conﬂict with centeredregression. Given a model matrix with arbitrary features, M , and a response vector, y , we wish to estimate OLSwith centered inputs. The matrix M can be stored as a dense matrix, M D , or as a sparse matrix, M S . If M isstored as a dense matrix, centering the inputs, then estimating OLS incurs little overhead. However, if M is storedas a sparse matrix, centering M S results in a dense matrix, limiting sparse optimizations that take advantage of thestructure of M S . In this paper we show how to expand OLS so that we take advantage of the structure of M S asmuch as possible. We show this expansion for three common cases, and then describe linear algebra optimizationsto create an eﬃcient solver. Let M S ∈ R n × p be a model matrix, µ T ∈ R × p be a row vector for the p column means of M S , and 1 n be a length n column vector of all ones. Then the centered model matrix is ˜ M = ( M S − n µ T ). Estimating OLS with ˜ M and y yields the estimator ˆ β transformed = ( ˜ M T ˜ M ) − ˜ M T y. This can be done by materializing ˜ M and passing it as input to a standard OLS solver, such as StatsModels(Seabold and Perktold 2010). However, materializing ˜ M transforms the inputs from sparse to dense. Instead, we use a1 a r X i v : . [ s t a t . C O ] O c t trategy that splits operations into sparse optimized operations, and dense operations that are easy to compute. Thissection describes the linear algebra operations for three diﬀerent use cases, then section 4 describes the optimizations.To better utilize the sparsity in M S we write the ﬁrst expansion asˆ β transformed = (cid:16) ( M S − n µ T ) T ( M S − n µ T ) (cid:17) − ( M S − n µ T ) T y = (cid:16) M TS M S − M TS n µ T − µ Tn M S + µ Tn n µ T (cid:17) − ( M TS y − µ Tn y )where M TS M S is a sparse optimized matrix multiplication, and − M TS n µ T − µ Tn M S + µ Tn n µ T is an easy tocompute dense operation, discussed in section 4. Suppose the regression is weighted by the vector w . Then centering refers to removing the weighted mean 1 n µ Tw where µ Tw is the weighted column means of M S weighted by w . The weighted OLS problem becomesˆ β transformed = (cid:16) ( M S − n µ Tw ) T W ( M S − n µ Tw ) (cid:17) − ( M S − n µ Tw ) T W y = (cid:16) M TS W M S − M TS W n µ Tw − µ w Tn W M S + µ w Tn W n µ Tw (cid:17) − ( M TS W y − µ w Tn W y ) . In addition to centering, we may also wish to scale M S so that each column has weighted variance 1. For instance,the lasso (Tibshirani 1996) shrinks coeﬃcients to zero and assumes data has been centered and scaled. Let σ Tw bethe row vector of weighted column standard deviations of M S weighted by w . Scaling M S to have weighted columnvariance of 1 is equivalent to the operation M ∗ S = M S Σ where Σ =  /σ w /σ w p  . The scaled and centeredOLS estimator becomesˆ β transformed = (cid:32)(cid:16) ( M S − n µ Tw )Σ (cid:17) T W (cid:16) ( M S − n µ Tw )Σ (cid:17)(cid:33) − (cid:16) ( M S − n µ Tw )Σ (cid:17) T W y = (cid:16) Σ M TS W M S Σ − Σ M TS W n µ Tw Σ − Σ µ w Tn W M S Σ + Σ µ w Tn W n µ Tw Σ (cid:17) − (Σ M TS W y − Σ µ w Tn W y )= (cid:16) M ∗ T S W M ∗ S − M ∗ T S W n µ Tw Σ − Σ µ w Tn W M ∗ S + Σ µ w Tn W n µ Tw Σ (cid:17) − ( M ∗ T S W y − Σ µ w Tn W y ) When computing homoskedastic covariances where var ( (cid:15) ) = k I , the covariance of ˆ β transformed can be computedusing the standard formula Cov ( ˆ β transformed ) = ( ˜ M T ˜ M ) − ˆ k where ˆ k = n − p (cid:80) i ( y − ˆ y ) . For eﬃcient computation,we reuse the expansion( ˜ M T ˜ M ) − = (cid:32)(cid:16) ( M S − n µ Tw )Σ (cid:17) T W (cid:16) ( M S − n µ Tw )Σ (cid:17)(cid:33) − = (cid:16) M ∗ T S W M ∗ S − M ∗ T S W n µ Tw Σ − Σ µ w Tn W M ∗ S + Σ µ w Tn W n µ Tw Σ (cid:17) − . Likewise, for heteroskedastic-consistent covariances (White et al. 1980) we expand ( ˜ M T ˜ M ) − ( ˜ M T diag ( (cid:15) ) ˜ M T )( ˜ M T ˜ M ) − by replacing W with a diagonal matrix having diagonal entries w(cid:15) .2 Predictions on M S Fitting OLS on centered and scaled inputs with centers µ w and standard deviations σ w yields coeﬃcients that canbe used to create predictions for the centered and scaled data, such as ˆ y = ( M S − n µ Tw )Σ ˆ β transformed . However, inpractice we will receive new data in the form of M S , not ( M S − n µ Tw )Σ. To predict directly on M S , we compute aˆ β original for the original scale of the data, where ˆ β original = ( I p × p − (1) p µ Tw )Σ ˆ β transformed , and 1 (1) p is a length p columnvector with values 1 located at index 1, and 0 everywhere else. There are four key terms throughout the OLS expansions that can be computed eﬃciently:1. M ∗ T S W M ∗ S . This product is symmetric, so only half of it needs to be computed. It can be optimized usingsparse matrix multiplications, which are implemented in Eigen (Guennebaud, Jacob, et al. 2010).2. M ∗ T S W n µ Tw Σ. The multiplication can be ordered speciﬁcally as ( M ∗ T S W n )( µ Tw Σ). ( M ∗ T S W n ) reduces toweighted column sums of M ∗ S weighted by w . ( µ Tw Σ) reduces to elementwise multiplication between the length p vectors µ w and 1 /σ w .3. Σ µ w Tn W n µ Tw Σ. This product is symmetric again. The multiplication can be ordered speciﬁcally as (Σ µ w )(1 Tn W n )( µ Tw Σ).From 2) Σ µ w is computed using elementwise multiplication. (1 Tn W n ) is the sum of the weights vector, w .4. W y and 1 Tn W y . W y is elementwise multiplication of the length n vectors w and y . 1 Tn W y sums that elementwisemultiplication.In general, the runtime complexity for estimating OLS is O ( np + p ). When n > p , it is dominated byconstructing M T M , otherwise it is dominated by a pseudoinverse for ( M T M ) − . Despite requesting centered inputs,these linear algebra optimizations allow us to form M ∗ T S M ∗ S using sparse inputs, which combined with simple denseoperations can construct ˜ M T ˜ M eﬃciently. However, we cannot invert ˜ M T ˜ M eﬃciently without using dense algebra.For large n , the memory requirements to materialize ˜ M can be very large. Using optimized operations, we onlymaterialize ˜ M T ˜ M , without materializing ˜ M . The memory requirement to materialize ( M S , W, µ Tw , σ Tw ), which arethe components for the OLS expansions, is O ( np · density + n ). Storing ˜ M T ˜ M only requires O ( p ) memory. In this section we compare the performance of the eﬃcient solver against that of the naive solver, ˆ β transformed =( ˜ M T ˜ M ) − ˜ M T y , which materializes the dense ˜ M matrix. The experiment simulates sparse model matrices withdensity rates of 0.01, 0.05, 0.1, 0.15, 0.2 and 0.25, with p = 100 features. We also vary the sample size from 0.1million observations to 10 million observations. Results are displayed in the ﬁgure below. When data is large andsparse, the eﬃcient solver estimates OLS much faster than the naive solver, up to 35 times as fast. Similarly, thememory used for the eﬃcient solver is 1 / density that of the naive solver. Centered regression has beneﬁts in interpretability and convergence, though centering the features can lead to acomputationally expensive regression with dense inputs. We described a computational strategy that expands thestandard OLS estimator to take advantage of sparse data structures while still centering the inputs. The resultingimplementation resolves the conﬂict between the desirability of centered regression and the performance beneﬁts ofsparse data.

References [AWR91] Leona S Aiken, Stephen G West, and Raymond R Reno.

Multiple regression: Testing and interpretinginteractions . Sage, 1991.[G+10] Ga¨el Guennebaud, Benoˆıt Jacob, et al.

Eigen v3 . http://eigen.tuxfamily.org. 2010.3 a) Time to estimate centered regression for various sample sizeand density rates. (b) Memory usage for various sample size and density rates. [SP10] Skipper Seabold and Josef Perktold. “statsmodels: Econometric and statistical modeling with python”.In: . 2010.[Tib96] Robert Tibshirani. “Regression shrinkage and selection via the lasso”. In:

Journal of the Royal StatisticalSociety: Series B (Methodological) econometrica arXiv preprint arXiv:1910.01305arXiv preprint arXiv:1910.01305