Fast generalised linear models by database sampling and one-step polishing
aa r X i v : . [ s t a t . C O ] M a r Fast generalised linear modelsby database sampling and one-step polishing
Thomas LumleyDepartment of StatisticsUniversity of AucklandMarch 15, 2018
Abstract
In this note, I show how to fit a generalised linear model to N observationson p variables stored in a relational database, using one sampling query and oneaggregation queries, as long as N + δ observations can be stored in memory.The resulting estimator is fully efficient and asymptotically equivalent to themaximum likelihood estimator, and so its variance can be estimated from theFisher information in the usual way. A proof-of-concept implementation uses Rwith MonetDB and with SQLite, and could easily be adapted to other populardatabases. I illustrate the approach with examples of taxi-trip data in NewYork City and factors related to car colour in New Zealand. Keywords: big data, maximum likelihood, Fisher scoring Introduction
Generalised linear models became one of the basic tools of statistical modelling inpart because the Newton–Raphson/Fisher scoring algorithm is easy to implementand behaves well. In this note, I take advantage of the simple form and good be-haviour of the algorithm to propose an implementation for large data sets based onin-core fitting to a small subsample followed by a SQL aggregation query. I present animplementation in R(R Core Team, 2017) using MonetDB, a column-store databasedesigned for scientific computing(Muehleisen et al., 2017), and a second implementa-tion using SQLite. The R packages for both these systems allow for zero-configurationdatabase setup and neither requires interprocess communication; they differ in that
MonetDBLite directly accesses the R heap and so requires less data copying, and alsohas an efficient primitive for random subsampling.The algorithm has two steps1. Extract a random subsample of the data into R and compute the maximumlikelihood estimator ˜ β
2. Evaluate the score function at ˜ β on the full data set, and perform a one-stepFisher scoring updateIn section 2 I describe the computation and the theory behind it in more detail,explaining why the one-step estimator is fully efficient. Section 4 gives two examplesof the use of the method, with comparison to other approaches. Section 5 discussespossible extensions beyond simple random sampling and generalised linear models.2 Theory and methods
Suppose we have a sample of size N and want to estimate a finite-dimensional param-eter β in a generalised linear model, and that a starting estimator ˜ β is available. It iswell known that when ˜ β is √ N -consistent, one step of the Newton–Raphson or Fisherscoring algorithms will give an efficient estimator, one that differs by o p ( N − / ) fromthe maximum likelihood estimator. It is less well known that the same is true for awide range of models when ˜ β − β = o p ( N − / ) (Cheng, 2013) and that it is true formany generalised linear models when ˜ β − β = o p ( N − / ). The Appendix gives a proofof the latter case, based on the proof for robust linear-regression M -estimators bySimpson et al. (1992); the key step in the proof is that a first-order Taylor expansionof the score has remainder term O p ( N k ˜ β − β k ), not just o p ( N k ˜ β − β k ) . I take advantage of this result by using the maximum likelihood estimator in arandom sample of size n = N / as a starting value. With this exponent, when N is one billion n is only 100,000, and a subsample of size n can easily be handled inmemory. It is necessary to be able to take a random sample: in my implementation,I use MonetDB’s SAMPLE n qualifier to the
SELECT keyword for returning randomsubsamples, making this step easy. In many other systems it is possible to use a qual-ifier of the form
WHERE RAND()< k or WHERE RANDOM()< k , with k chosen to give thecorrect sampling probability. The SQLite implementation uses the latter approach);the resulting subsample size n will be random, but with standard deviation smallenough to still ensure ˜ β − β = o p ( N − / ).The second key point is that in a generalised linear model the likelihood score isof the form U ( β ) = n X i =1 x i w i ( β )( y i − µ i ( β )) , with w i = 1 for the most widely used models (those with the canonical link). For all3he commonly-used choices of link and variance function this can be computed witha single SQL query as long as the exponentiation function is available. Although notpart of the SQL standard, exponentiation is widely supported in relational databases,either as a built-in function or as an add-on.It is also feasible to also compute the expected Fisher information matrix with asingle SQL query involving (cid:0) p (cid:1) terms, since it is of the form I ( β ) = N X v i ( β ) x i x Ti , for a scalar v i (). I will write ˆ β for the one-step estimator using I n ( ˜ β ) computedfrom the subsample, and ˆ β full for the one-step estimator using I N ( ˜ β ) from the wholesample. It turns out that ˆ β full and ˆ β are asymptotically equivalent, as demonstratedtheoretically in the Appendix and empirically in Section 4, so that computing theinformation on the full sample is not necessary. Proof-of-concept code in the form of an R package is at github.com/tslumley/dbglm .The tidypredict (Ruiz, 2018) and dbplyr packages(Wickham, 2017) are used to simplifythe second aggregation query, but the sampling query, which is not supported by db-plyr , is written directly in SQL using the
DBI interface(R Special Interest Group on Databases (R-SIG-DB) et al.,2017).The code has certain limitations for speed and simplicity. First, indicator vari-ables will be defined to represent categorical variables, but only for those levels thatare present in the subsample; other values will effectively be combined with thereference category. Second, because different databases use different syntax for pseu-dorandom number generation, the code works only with the
MonetDBLite database4onnectors. It should be straightforward to add other databases. Third, the codesupports only a specific list of generalised linear models and does not allow user-defined link and variance functions. Adding a specific additional link/variance com-bination that uses only functions supported by the database should be straightfor-ward. Fourth, the code does not support transformations. Variables correspondingto transformations need to be created before dbglm is called.
All analyses were conducted on a Macbook Air laptop with Intel i7-4650U processorat 1.7GHz, 8GiB of memory, and a solid state drive, running OS X 10.13. 2 using Rversion 3.4.0. Instructions for obtaining the data and scripts for analysing it are in the dblgm package, which can be obtained from https://github.com/tslumley/dbglm . The New Zealand Transport Agency maintains a list of all vehicles (currently licensedor not) in the country and their owners. Information on the vehicles, the “FleetVehicle Statistic” is available for download. The version used here is current at2017–11–30. It was downloaded from on 2017–12–14. There are 5 million records; Iselected the 3.3 million ”PASSENGER CAR/VAN” records. The recorded numberof seats varies from 0 to 999 and the recorded number of axles from 0 to 9; I keptonly those with 2–6 seats, and 2 axles, leaving 1.7 million records.I fitted a logistic regression model with outcome variable 1 if the basic_colour variable was
RED and 0 otherwise, using power_rating , number_of_seats , and gross_vehicle_mass as predictors. Table 4.1 shows the results from three reali-sations of the one-step estimator and from a fit to the whole data.5able 1: Log odds ratios from logistic regression predicting red colour from ve-hicle characteristics; data from New Zealand Transport Agency. Analysis of all N = 1726134 records and three realisations of a one-step estimator starting with asubsample of 2917 records.Intercept Power /1000 Seats Mass/1000Full data ˆ β -1.04 3.12 -0.149 -0.302SE ×
100 2.9 3.3 0.59 0.42Replicate 1 ˆ β -0.99 3.10 -0.160 -0.296SE ×
100 2.4 3.3 0.49 0.46Replicate 2 ˆ β -1.06 3.05 -0.143 -0.301SE ×
100 2.8 3.5 0.58 0.41Replicate 3 ˆ β -1.01 3.08 -0.153 -0.302SE ×
100 2.7 3.3 0.54 0.42We see that cars with higher power are more likely to be red, and that this is notdue to red cars being bigger; they have lower mass and fewer seats on average.The one-step estimator took between 1.3 and 1.4 seconds using
MonetDBLite and3.1–3.4s using
RSQLite . The full maximum likelihood estimator is feasible with adata set of this size; it took 7.6 seconds, not including time for data transfer fromthe database. The bigglm from the biglm package,(Lumley, 2013) which computesthe full maximum likelihood estimator by reading the data set in chunks within eachiteration, took 15s using the
MonetDBLite package and 26s using
RSQLite . In response to Freedom of Information requests, the New York City Taxi Commis-sion has provided data on taxi trips in New York. Here, I analyse the data from6016 for traditional ‘yellow cabs’ that are authorised to pick up passengers from thestreet anywhere in the city. The data were downloaded on 2017–12–14 and –15 from https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-XX.csv ,with
XX=01 to . There are 131 million records. I filtered the data to only the tripspaid by credit card (because these have tip information), and excluded trips of over50 miles, leaving 86 million records.I defined the response variable bad_tip to indicate a tip of less than 20%, anddefined night to be from 8pm to 4am, and weekend to be from 8pm Friday untilmidnight Sunday.Table 2 compares the one-step approach to using bigglm with MonetDBLite . Alow tip is less likely at night, and for airport flights. Long trips are more likely toattract a low tip, but the relationship is not strong. There is little difference betweenweekend and weekday trips, or with number of passengers. A low tip is much morelikely for ‘negotiated rate’ trips: presumably either the rate includes the tip or thecustomer believes it should. In this relatively large data set the agreement betweenthe estimators is mostly good, though the coefficent for rate code 3, “Newark”, doesdiffer.
If the data can be regarded as in random order, so that the first n observations canbe used as the subsample, a true one-pass implementation is possible; however, it isimportant that the subsample truly be representative.In a cluster or cloud context It would clearly be straightforward to paralleliseboth database queries. Parallelising the aggregation query is likely to be helpful;whether parallelising the sampling query is helpful depends on whether the databasealready has efficient sequential sampling algorithms.7able 2: Tipping in NY taxis: logistic regression model for odds of <
20% tipOne-step Fullˆ β ( 95%, CI) ˆ β ( 95%, CI)(Intercept) -0.815 (-0.812, -0.808) -0.814 (-0.817, -0.810)weekend 0.029 ( 0.036, 0.043) 0.026 ( 0.019, 0.033)night -0.209 (-0.206, -0.203) -0.205 (-0.208, -0.201)weekend night -0.005 ( 0.002, 0.009) 0.012 ( 0.005, 0.019)passenger count -0.008 (-0.007, -0.005) -0.009 (-0.010, -0.007)weekend × passenger count -0.021 (-0.017, -0.014) -0.002 (-0.006, 0.001)night × passenger count -0.007 (-0.005, -0.004) -0.004 (-0.006, -0.002)weekend night × passenger count 0.014 ( 0.018, 0.021) 0.003 (-0.000, 0.007)trip distance 0.033 ( 0.033, 0.033) 0.033 ( 0.033, 0.034)Compared to standard rateTo/from JFK -0.266 (-0.262, -0.259) -0.268 (-0.272, -0.264)To/from Newark -0.138 (-0.130, -0.121) -0.199 (-0.209, -0.189)Nassau & Westchester 0.002 ( 0.021, 0.040) -0.095 (-0.117, -0.073)Negotiated rate 1.406 ( 1.413, 1.421) 1.401 ( 1.393, 1.409) Computation time cpu: 430s, elapsed: 340s cpu: 890s, elapsed: 920s n might have been desirable;if data had to come across a network connection from a separate database server itwould be desirable to keep n as small as possible.The same general approach can be used for other regression models fitted bymaximum likelihood, such as those considered by Yee and Wild (1996), as long as thederivative of the log likelihood with respect to the linear predictors can be expressedin closed form in SQL. For example, it would be feasible to fit a Weibull model,or a zero-inflated Poisson, but fitting the overdispersion parameter of a negativebinomial or the degrees of freedom of a Student’s t model would not be feasiblesince the score for these models involves the digamma function. Even when thecomputations are feasible it may sometimes be necessary to use either two iterationsor a larger subsample for some models; Cheng (2013) shows that a starting estimatorwith o p ( N − / ) error is in general required for a single iteration to give an efficientestimator, so n larger than N / would be needed.If efficient sampling on the outcome variable or predictors is possible, it may bepossible to use a smaller subsample. For example, if Y is a rare outcome with M ≪ N observations having Y = 1, a sample of M + δ cases and, say, 5 M + δ controls willsuffice for the subsample. Whether this or more sophisticated two-phase samplingstrategies are useful in practice will depend on details of the database setup such asthe cost of constructing new indexes. A Theorem and proof
Suppose we are fitting a generalised linear model with regression parameters β , out-come Y , and predictors X . Let β be the true value of β . Assume the second9artial derivatives of the loglikelihood have uniformly bounded second moments ona compact neighbourhood K of β . Let U N ( β ) be the score evaluated at β on N observations and I N ( β ) be the expected Fisher information evaluated at β on N observations. Let ∆ be the tensor of third partial derivatives of the log likelihood,and assume its elements (∆ ) ijk = ∂ ∂x i ∂x j x∂ k log ℓ ( Y ; X, β )have uniformly bounded second moments on K . Theorem 1.
Let n = N + δ for some δ ∈ (0 , / , and let ˜ β be the maximumlikelihood estimator of β on a subsample of size n . Define ˜ I ( ˜ β ) = Nn I n ( ˜ β ) , theestimated full-sample information based on the subsample The one-step estimators ˆ β full = ˜ β + I N ( ˜ β ) U N ( ˜ β ) and ˆ β = ˜ β + ˜ I ( ˜ β ) U N ( ˜ β ) are first-order efficient Proof:
The score function at the true parameter value is U N ( β ) = N X i =1 x i w i ( β )( y i − µ i ( β )By the mean-value form of Taylor’s theorem we have U N ( β ) = U N ( ˜ β ) + I N ( ˜ β )( ˜ β − β ) + ∆ ( β ∗ )( ˜ β − β , ˜ β − β )where β ∗ is on the interval between ˜ β and β . With probability 1, ˜ β and thus β ∗ is in K for all sufficiently large n , so the remainder term is O p ( N n − / n − / ) = o p ( N / ).Thus I − N ( ˜ β ) U N ( β ) = I − N ( ˜ β ) U N ( ˜ β ) + ˜ β − β + o p ( N − / )10et ˆ β MLE be the maximum likelihood estimator. It is a standard result thatˆ β MLE = β + I − N ( β ) U N ( β ) + o p ( N − / )So ˆ β MLE = ˜ β + I − N ( ˜ β ) U N ( ˜ β ) + o p ( N − / )= ˆ β full + o p ( N − / )Now let I ( ˜ β ) = E X,Y [ N − I N ] be the expected per-observation information. Bythe Central Limit Theorem we have I N ( ˜ β ) = I n ( ˜ β ) + ( N − n ) I ( ˜ β ) + O p (( N − n ) n − / ) , so I N ( ˜ β ) (cid:18) Nn I n ( ˜ β ) (cid:19) − = Id p + O p ( n − / )where Id p is the p × p identity matrix. We haveˆ β − ˜ β = ( ˆ β full − ˜ β ) I N ( ˜ β ) − (cid:18) Nn I n ( ˜ β ) (cid:19) = ( ˆ β full − ˜ β ) (cid:0) Id p + O p ( n − / (cid:1) = ( ˆ β full − ˜ β ) + O p ( n − )= ( ˆ β full − ˜ β ) + o p ( N − / )so ˆ β is also asymptotically efficient. References
Cheng, G. (2013). How many iterations are sufficient for efficient semiparametricestimation?
Scandinavian Journal of Statistics , 40:592–618.11umley, T. (2013). biglm: bounded memory linear and generalized linear models . Rpackage version 0.9-1.Muehleisen, H., Damico, A., Raasveldt, M., Lumley, T., MonetDB B.V., CWI, TheRegents of the University of California, Kungliga Tekniska Hogskolan, and FreeSoftware Foundation, Inc. (2017).
MonetDBLite: In-Process Version of ’Mon-etDB’ . R package version 0.5.1.R Core Team (2017).
R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria.R Special Interest Group on Databases (R-SIG-DB), Wickham, H., and M¨uller, K.(2017).
DBI: R Database Interface . R package version 0.7.Ruiz, E. (2018). tidypredict: Run Predictions Inside the Database . R package version0.2.0.Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992). On one-step GM estimatesand stability of inferences in linear regression.
Journal of the American StatisticalAssociation , 87:439–50.Wickham, H. (2017). dbplyr: A ’dplyr’ Back End for Databases . R package version1.1.0.Yee, T. W. and Wild, C. J. (1996). Vector generalized additive models.