[PDF] Fast generalised linear models by database sampling and one-step polishing

Abstract

In this note, I show how to fit a generalised linear model to N observations on p variables stored in a relational database, using one sampling query and one aggregation queries, as long as N 1 2 +δ observations can be stored in memory. The resulting estimator is fully efficient and asymptotically equivalent to the maximum likelihood estimator, and so its variance can be estimated from the Fisher information in the usual way. A proof-of-concept implementation uses R with MonetDB and with SQLite, and could easily be adapted to other popular databases. I illustrate the approach with examples of taxi-trip data in New York City and factors related to car colour in New Zealand.

Full PDF

aa r X i v : . [ s t a t . C O ] M a r Fast generalised linear modelsby database sampling and one-step polishing

Thomas LumleyDepartment of StatisticsUniversity of AucklandMarch 15, 2018

Abstract

In this note, I show how to ﬁt a generalised linear model to N observationson p variables stored in a relational database, using one sampling query and oneaggregation queries, as long as N + δ observations can be stored in memory.The resulting estimator is fully eﬃcient and asymptotically equivalent to themaximum likelihood estimator, and so its variance can be estimated from theFisher information in the usual way. A proof-of-concept implementation uses Rwith MonetDB and with SQLite, and could easily be adapted to other populardatabases. I illustrate the approach with examples of taxi-trip data in NewYork City and factors related to car colour in New Zealand. Keywords: big data, maximum likelihood, Fisher scoring Introduction

Generalised linear models became one of the basic tools of statistical modelling inpart because the Newton–Raphson/Fisher scoring algorithm is easy to implementand behaves well. In this note, I take advantage of the simple form and good be-haviour of the algorithm to propose an implementation for large data sets based onin-core ﬁtting to a small subsample followed by a SQL aggregation query. I present animplementation in R(R Core Team, 2017) using MonetDB, a column-store databasedesigned for scientiﬁc computing(Muehleisen et al., 2017), and a second implementa-tion using SQLite. The R packages for both these systems allow for zero-conﬁgurationdatabase setup and neither requires interprocess communication; they diﬀer in that

MonetDBLite directly accesses the R heap and so requires less data copying, and alsohas an eﬃcient primitive for random subsampling.The algorithm has two steps1. Extract a random subsample of the data into R and compute the maximumlikelihood estimator ˜ β

2. Evaluate the score function at ˜ β on the full data set, and perform a one-stepFisher scoring updateIn section 2 I describe the computation and the theory behind it in more detail,explaining why the one-step estimator is fully eﬃcient. Section 4 gives two examplesof the use of the method, with comparison to other approaches. Section 5 discussespossible extensions beyond simple random sampling and generalised linear models.2 Theory and methods

Suppose we have a sample of size N and want to estimate a ﬁnite-dimensional param-eter β in a generalised linear model, and that a starting estimator ˜ β is available. It iswell known that when ˜ β is √ N -consistent, one step of the Newton–Raphson or Fisherscoring algorithms will give an eﬃcient estimator, one that diﬀers by o p ( N − / ) fromthe maximum likelihood estimator. It is less well known that the same is true for awide range of models when ˜ β − β = o p ( N − / ) (Cheng, 2013) and that it is true formany generalised linear models when ˜ β − β = o p ( N − / ). The Appendix gives a proofof the latter case, based on the proof for robust linear-regression M -estimators bySimpson et al. (1992); the key step in the proof is that a ﬁrst-order Taylor expansionof the score has remainder term O p ( N k ˜ β − β k ), not just o p ( N k ˜ β − β k ) . I take advantage of this result by using the maximum likelihood estimator in arandom sample of size n = N / as a starting value. With this exponent, when N is one billion n is only 100,000, and a subsample of size n can easily be handled inmemory. It is necessary to be able to take a random sample: in my implementation,I use MonetDB’s SAMPLE n qualiﬁer to the

SELECT keyword for returning randomsubsamples, making this step easy. In many other systems it is possible to use a qual-iﬁer of the form

WHERE RAND()< k or WHERE RANDOM()< k , with k chosen to give thecorrect sampling probability. The SQLite implementation uses the latter approach);the resulting subsample size n will be random, but with standard deviation smallenough to still ensure ˜ β − β = o p ( N − / ).The second key point is that in a generalised linear model the likelihood score isof the form U ( β ) = n X i =1 x i w i ( β )( y i − µ i ( β )) , with w i = 1 for the most widely used models (those with the canonical link). For all3he commonly-used choices of link and variance function this can be computed witha single SQL query as long as the exponentiation function is available. Although notpart of the SQL standard, exponentiation is widely supported in relational databases,either as a built-in function or as an add-on.It is also feasible to also compute the expected Fisher information matrix with asingle SQL query involving (cid:0) p (cid:1) terms, since it is of the form I ( β ) = N X v i ( β ) x i x Ti , for a scalar v i (). I will write ˆ β for the one-step estimator using I n ( ˜ β ) computedfrom the subsample, and ˆ β full for the one-step estimator using I N ( ˜ β ) from the wholesample. It turns out that ˆ β full and ˆ β are asymptotically equivalent, as demonstratedtheoretically in the Appendix and empirically in Section 4, so that computing theinformation on the full sample is not necessary. Proof-of-concept code in the form of an R package is at github.com/tslumley/dbglm .The tidypredict (Ruiz, 2018) and dbplyr packages(Wickham, 2017) are used to simplifythe second aggregation query, but the sampling query, which is not supported by db-plyr , is written directly in SQL using the

DBI interface(R Special Interest Group on Databases (R-SIG-DB) et al.,2017).The code has certain limitations for speed and simplicity. First, indicator vari-ables will be deﬁned to represent categorical variables, but only for those levels thatare present in the subsample; other values will eﬀectively be combined with thereference category. Second, because diﬀerent databases use diﬀerent syntax for pseu-dorandom number generation, the code works only with the

MonetDBLite database4onnectors. It should be straightforward to add other databases. Third, the codesupports only a speciﬁc list of generalised linear models and does not allow user-deﬁned link and variance functions. Adding a speciﬁc additional link/variance com-bination that uses only functions supported by the database should be straightfor-ward. Fourth, the code does not support transformations. Variables correspondingto transformations need to be created before dbglm is called.

All analyses were conducted on a Macbook Air laptop with Intel i7-4650U processorat 1.7GHz, 8GiB of memory, and a solid state drive, running OS X 10.13. 2 using Rversion 3.4.0. Instructions for obtaining the data and scripts for analysing it are in the dblgm package, which can be obtained from https://github.com/tslumley/dbglm . The New Zealand Transport Agency maintains a list of all vehicles (currently licensedor not) in the country and their owners. Information on the vehicles, the “FleetVehicle Statistic” is available for download. The version used here is current at2017–11–30. It was downloaded from on 2017–12–14. There are 5 million records; Iselected the 3.3 million ”PASSENGER CAR/VAN” records. The recorded numberof seats varies from 0 to 999 and the recorded number of axles from 0 to 9; I keptonly those with 2–6 seats, and 2 axles, leaving 1.7 million records.I ﬁtted a logistic regression model with outcome variable 1 if the basic_colour variable was

RED and 0 otherwise, using power_rating , number_of_seats , and gross_vehicle_mass as predictors. Table 4.1 shows the results from three reali-sations of the one-step estimator and from a ﬁt to the whole data.5able 1: Log odds ratios from logistic regression predicting red colour from ve-hicle characteristics; data from New Zealand Transport Agency. Analysis of all N = 1726134 records and three realisations of a one-step estimator starting with asubsample of 2917 records.Intercept Power /1000 Seats Mass/1000Full data ˆ β -1.04 3.12 -0.149 -0.302SE ×

100 2.9 3.3 0.59 0.42Replicate 1 ˆ β -0.99 3.10 -0.160 -0.296SE ×

100 2.4 3.3 0.49 0.46Replicate 2 ˆ β -1.06 3.05 -0.143 -0.301SE ×

100 2.8 3.5 0.58 0.41Replicate 3 ˆ β -1.01 3.08 -0.153 -0.302SE ×

100 2.7 3.3 0.54 0.42We see that cars with higher power are more likely to be red, and that this is notdue to red cars being bigger; they have lower mass and fewer seats on average.The one-step estimator took between 1.3 and 1.4 seconds using

MonetDBLite and3.1–3.4s using

RSQLite . The full maximum likelihood estimator is feasible with adata set of this size; it took 7.6 seconds, not including time for data transfer fromthe database. The bigglm from the biglm package,(Lumley, 2013) which computesthe full maximum likelihood estimator by reading the data set in chunks within eachiteration, took 15s using the

MonetDBLite package and 26s using

RSQLite . In response to Freedom of Information requests, the New York City Taxi Commis-sion has provided data on taxi trips in New York. Here, I analyse the data from6016 for traditional ‘yellow cabs’ that are authorised to pick up passengers from thestreet anywhere in the city. The data were downloaded on 2017–12–14 and –15 from https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2016-XX.csv ,with

XX=01 to . There are 131 million records. I ﬁltered the data to only the tripspaid by credit card (because these have tip information), and excluded trips of over50 miles, leaving 86 million records.I deﬁned the response variable bad_tip to indicate a tip of less than 20%, anddeﬁned night to be from 8pm to 4am, and weekend to be from 8pm Friday untilmidnight Sunday.Table 2 compares the one-step approach to using bigglm with MonetDBLite . Alow tip is less likely at night, and for airport ﬂights. Long trips are more likely toattract a low tip, but the relationship is not strong. There is little diﬀerence betweenweekend and weekday trips, or with number of passengers. A low tip is much morelikely for ‘negotiated rate’ trips: presumably either the rate includes the tip or thecustomer believes it should. In this relatively large data set the agreement betweenthe estimators is mostly good, though the coeﬃcent for rate code 3, “Newark”, doesdiﬀer.

If the data can be regarded as in random order, so that the ﬁrst n observations canbe used as the subsample, a true one-pass implementation is possible; however, it isimportant that the subsample truly be representative.In a cluster or cloud context It would clearly be straightforward to paralleliseboth database queries. Parallelising the aggregation query is likely to be helpful;whether parallelising the sampling query is helpful depends on whether the databasealready has eﬃcient sequential sampling algorithms.7able 2: Tipping in NY taxis: logistic regression model for odds of <

20% tipOne-step Fullˆ β ( 95%, CI) ˆ β ( 95%, CI)(Intercept) -0.815 (-0.812, -0.808) -0.814 (-0.817, -0.810)weekend 0.029 ( 0.036, 0.043) 0.026 ( 0.019, 0.033)night -0.209 (-0.206, -0.203) -0.205 (-0.208, -0.201)weekend night -0.005 ( 0.002, 0.009) 0.012 ( 0.005, 0.019)passenger count -0.008 (-0.007, -0.005) -0.009 (-0.010, -0.007)weekend × passenger count -0.021 (-0.017, -0.014) -0.002 (-0.006, 0.001)night × passenger count -0.007 (-0.005, -0.004) -0.004 (-0.006, -0.002)weekend night × passenger count 0.014 ( 0.018, 0.021) 0.003 (-0.000, 0.007)trip distance 0.033 ( 0.033, 0.033) 0.033 ( 0.033, 0.034)Compared to standard rateTo/from JFK -0.266 (-0.262, -0.259) -0.268 (-0.272, -0.264)To/from Newark -0.138 (-0.130, -0.121) -0.199 (-0.209, -0.189)Nassau & Westchester 0.002 ( 0.021, 0.040) -0.095 (-0.117, -0.073)Negotiated rate 1.406 ( 1.413, 1.421) 1.401 ( 1.393, 1.409) Computation time cpu: 430s, elapsed: 340s cpu: 890s, elapsed: 920s n might have been desirable;if data had to come across a network connection from a separate database server itwould be desirable to keep n as small as possible.The same general approach can be used for other regression models ﬁtted bymaximum likelihood, such as those considered by Yee and Wild (1996), as long as thederivative of the log likelihood with respect to the linear predictors can be expressedin closed form in SQL. For example, it would be feasible to ﬁt a Weibull model,or a zero-inﬂated Poisson, but ﬁtting the overdispersion parameter of a negativebinomial or the degrees of freedom of a Student’s t model would not be feasiblesince the score for these models involves the digamma function. Even when thecomputations are feasible it may sometimes be necessary to use either two iterationsor a larger subsample for some models; Cheng (2013) shows that a starting estimatorwith o p ( N − / ) error is in general required for a single iteration to give an eﬃcientestimator, so n larger than N / would be needed.If eﬃcient sampling on the outcome variable or predictors is possible, it may bepossible to use a smaller subsample. For example, if Y is a rare outcome with M ≪ N observations having Y = 1, a sample of M + δ cases and, say, 5 M + δ controls willsuﬃce for the subsample. Whether this or more sophisticated two-phase samplingstrategies are useful in practice will depend on details of the database setup such asthe cost of constructing new indexes. A Theorem and proof

Suppose we are ﬁtting a generalised linear model with regression parameters β , out-come Y , and predictors X . Let β be the true value of β . Assume the second9artial derivatives of the loglikelihood have uniformly bounded second moments ona compact neighbourhood K of β . Let U N ( β ) be the score evaluated at β on N observations and I N ( β ) be the expected Fisher information evaluated at β on N observations. Let ∆ be the tensor of third partial derivatives of the log likelihood,and assume its elements (∆ ) ijk = ∂ ∂x i ∂x j x∂ k log ℓ ( Y ; X, β )have uniformly bounded second moments on K . Theorem 1.

Let n = N + δ for some δ ∈ (0 , / , and let ˜ β be the maximumlikelihood estimator of β on a subsample of size n . Deﬁne ˜ I ( ˜ β ) = Nn I n ( ˜ β ) , theestimated full-sample information based on the subsample The one-step estimators ˆ β full = ˜ β + I N ( ˜ β ) U N ( ˜ β ) and ˆ β = ˜ β + ˜ I ( ˜ β ) U N ( ˜ β ) are ﬁrst-order eﬃcient Proof:

The score function at the true parameter value is U N ( β ) = N X i =1 x i w i ( β )( y i − µ i ( β )By the mean-value form of Taylor’s theorem we have U N ( β ) = U N ( ˜ β ) + I N ( ˜ β )( ˜ β − β ) + ∆ ( β ∗ )( ˜ β − β , ˜ β − β )where β ∗ is on the interval between ˜ β and β . With probability 1, ˜ β and thus β ∗ is in K for all suﬃciently large n , so the remainder term is O p ( N n − / n − / ) = o p ( N / ).Thus I − N ( ˜ β ) U N ( β ) = I − N ( ˜ β ) U N ( ˜ β ) + ˜ β − β + o p ( N − / )10et ˆ β MLE be the maximum likelihood estimator. It is a standard result thatˆ β MLE = β + I − N ( β ) U N ( β ) + o p ( N − / )So ˆ β MLE = ˜ β + I − N ( ˜ β ) U N ( ˜ β ) + o p ( N − / )= ˆ β full + o p ( N − / )Now let I ( ˜ β ) = E X,Y [ N − I N ] be the expected per-observation information. Bythe Central Limit Theorem we have I N ( ˜ β ) = I n ( ˜ β ) + ( N − n ) I ( ˜ β ) + O p (( N − n ) n − / ) , so I N ( ˜ β ) (cid:18) Nn I n ( ˜ β ) (cid:19) − = Id p + O p ( n − / )where Id p is the p × p identity matrix. We haveˆ β − ˜ β = ( ˆ β full − ˜ β ) I N ( ˜ β ) − (cid:18) Nn I n ( ˜ β ) (cid:19) = ( ˆ β full − ˜ β ) (cid:0) Id p + O p ( n − / (cid:1) = ( ˆ β full − ˜ β ) + O p ( n − )= ( ˆ β full − ˜ β ) + o p ( N − / )so ˆ β is also asymptotically eﬃcient. References

Cheng, G. (2013). How many iterations are suﬃcient for eﬃcient semiparametricestimation?

Scandinavian Journal of Statistics , 40:592–618.11umley, T. (2013). biglm: bounded memory linear and generalized linear models . Rpackage version 0.9-1.Muehleisen, H., Damico, A., Raasveldt, M., Lumley, T., MonetDB B.V., CWI, TheRegents of the University of California, Kungliga Tekniska Hogskolan, and FreeSoftware Foundation, Inc. (2017).

MonetDBLite: In-Process Version of ’Mon-etDB’ . R package version 0.5.1.R Core Team (2017).

R: A Language and Environment for Statistical Computing . RFoundation for Statistical Computing, Vienna, Austria.R Special Interest Group on Databases (R-SIG-DB), Wickham, H., and M¨uller, K.(2017).

DBI: R Database Interface . R package version 0.7.Ruiz, E. (2018). tidypredict: Run Predictions Inside the Database . R package version0.2.0.Simpson, D. G., Ruppert, D., and Carroll, R. J. (1992). On one-step GM estimatesand stability of inferences in linear regression.

Journal of the American StatisticalAssociation , 87:439–50.Wickham, H. (2017). dbplyr: A ’dplyr’ Back End for Databases . R package version1.1.0.Yee, T. W. and Wild, C. J. (1996). Vector generalized additive models.