[PDF] AC/DC: In-Database Learning Thunderstruck

Abstract

We report on the design and implementation of the AC/DC gradient descent solver for a class of optimization problems over normalized databases. AC/DC decomposes an optimization problem into a set of aggregates over the join of the database relations. It then uses the answers to these aggregates to iteratively improve the solution to the problem until it converges. The challenges faced by AC/DC are the large database size, the mixture of continuous and categorical features, and the large number of aggregates to compute. AC/DC addresses these challenges by employing a sparse data representation, factorized computation, problem reparameterization under functional dependencies, and a data structure that supports shared computation of aggregates. To train polynomial regression models and factorization machines of up to 154K features over the natural join of all relations from a real-world dataset of up to 86M tuples, AC/DC needs up to 30 minutes on one core of a commodity machine. This is up to three orders of magnitude faster than its competitors R, MadLib, libFM, and TensorFlow whenever they finish and thus do not exceed memory limitation, 24-hour timeout, or internal design limitations.

Full PDF

AAC/DC: In-Database Learning Thunderstruck

Mahmoud Abo Khamis

RelationalAI, Inc

Hung Q. Ngo

RelationalAI, Inc

XuanLong Nguyen

University of Michigan

Dan Olteanu

University of Oxford

Maximilian Schleich

University of Oxford

ABSTRACT

We report on the design and implementation of the AC/DC gradientdescent solver for a class of optimization problems over normalizeddatabases. AC/DC decomposes an optimization problem into a setof aggregates over the join of the database relations. It then usesthe answers to these aggregates to iteratively improve the solutionto the problem until it converges.The challenges faced by AC/DC are the large database size, themixture of continuous and categorical features, and the large num-ber of aggregates to compute. AC/DC addresses these challengesby employing a sparse data representation, factorized computation,problem reparameterization under functional dependencies, and adata structure that supports shared computation of aggregates.To train polynomial regression models and factorization ma-chines of up to 154K features over the natural join of all relationsfrom a real-world dataset of up to 86M tuples, AC/DC needs upto 30 minutes on one core of a commodity machine. This is up tothree orders of magnitude faster than its competitors R, MadLib,libFM, and TensorFlow whenever they finish and thus do not exceedmemory limitation, 24-hour timeout, or internal design limitations.

Rode down the highwayBroke the limit, we hit the townWent through to Texas, yeah Texas, and we had some fun.– Thunderstruck (AC/DC)

In this paper we report our on-going work on the design and im-plementation of AC/DC, a gradient descent solver for a class ofoptimization problems including ridge linear regression, polyno-mial regression, and factorization machines. It extends our priorsystem F for factorized learning of linear regression models [22]to capture non-linear models, categorical features, and model repa-rameterization under functional dependencies (FDs). Its design isbut one fruit of our exploration of the design space for the AI en-gine currently under development at RelationalAI. It subscribes to arecent effort to bring analytics inside the database [8, 10, 13, 14, 22]and thereby avoid the non-trivial time spent on data import/exportat the interface between database systems and statistical packages.AC/DC solves optimization problems over design matrices de-fined by feature extraction queries over databases of possibly manyrelations. It is a unified approach for computing both the optimiza-tions and the underlying database queries; the two tasks not only AC/DC supports both categorical and continuous features and fast processing. Itsname allures at the duality of alternating and discrete currents and at the fast-pacedsound of a homonymous rock band. live in the same process space, they are intertwined in one execu-tion plan with asymptotically lower complexity than that of eitherone in isolation. This is possible due to several key contributions.First, AC/DC decomposes a given optimization problem into aset of aggregates whose answers are fed into a gradient descentsolver that iteratively approximates the solution to the problemuntil it reaches convergence. The aggregates capture the combina-tions of features in the input data, as required for computing thegradients of an objective function. They are group-by aggregates incase of combinations with at least one categorical feature and plainscalars for combinations of continuous features only. The formeraggregates are grouped by the query variables with a categoricaldomain. Prior work on in-database machine learning mostly consid-ered continuous features and one-hot encoded categorical features,e.g., [8, 10, 14, 22]. We avoid the expensive one-hot encoding ofcategorical features by a sparse representation using group-by aggre-gates [3]. Several tools, e.g., libFM [9, 21] for factorization machinesand LIBSVM [7] for support vector machines, employ sparse datarepresentations that avoid the redundancy introduced by one-hotencoding. These are computed on the result of the feature extractionquery once it is exported out of the database.Second, AC/DC factorizes the computation of these aggregatesover the feature extraction query to achieve the lowest known com-plexity. We recently pinpointed the complexity of AC/DC [3]. Thefactorized computation of the queries obtained by decomposingoptimization problems can be asymptotically faster than the com-putation of the underlying join query alone. This means that allmachine learning approaches that work on a design matrix definedby the result of the database join are asymptotically suboptimal. Theonly other end-to-end in-database learning system [14] that mayoutperform the underlying database join works for generalized lin-ear models over key-foreign key joins, does not decompose the taskto AC/DC’s granularity, and cannot recover the good complexityof AC/DC since it does not employ factorized computation.Third, AC/DC massively shares computation across the aggregate-join queries. These different queries use the same underlying joinand their aggregates have similar structures.Fourth, AC/DC exploits functional dependencies (FDs) in the inputdatabase to reduce the dimensionality of the optimization problem.Prior work exploited FDs for Naïve Bayes classification and featureselection [15]. AC/DC can reparameterize (non)linear regressionmodels with non-linear regularizers [3]. This reparameterizationrequires taking the inverses of matrices, which are sums of identitymatrices and vector dot products. To achieve performance improve-ments by model reparameterization, AC/DC uses an interplay ofits own data structure and the Eigen library for linear algebra [12].In this paper we report on the performance of AC/DC againstMADlib [13], R [20], libFM [21], TensorFlow[1] and our earlier a r X i v : . [ c s . D B ] J un rototype F [22]. We used a real-world dataset with five relationsof 86M tuples in total and up to 154K features to train a ridge linearregression model, a polynomial regression model of degree two,and a factorization machine of degree two. AC/DC is the fastestsystem in our experiments. It is orders of magnitude faster thanthe competitors or can finish successfully when the others exceedmemory limitation, 24-hour timeout, or internal design limitations.The performance gap is attributed to the optimizations of AC/DC,which none of our competitors support fully. TensorFlow, R, andlibFM require the materialization of the feature extraction query,as well as exporting/importing the query result from the databasesystem to their system. They also require a transformation of thedata into a sparse representation of the one-hot encoded trainingdata before learning. MADlib does not materialize and export thequery, but still requires an upfront one-hot encoding of the inputrelations, which comes with higher asymptotic complexity andprohibitively large relations with lots of zero entries. None of thesesystems benefit from factorized computation nor exploit functionaldependencies. F is designed for linear regression models. It usesfactorized and shared computation of aggregates. It however doesnot exploit functional dependencies and requires the same one-hotencoding of categorical features in the input relations as MADlib.Our results confirm a counter-intuitive theoretical result from [3]stating that, under certain conditions, exploiting query structuresand smart aggregation algorithms, one can train a model usingbatch gradient descent (BGD) faster than scanning through the dataonce. In particular, this means BGD can be faster than one epochof stochastic gradient descent (SGD), in contrast to the commonlyaccepted assumption that SGD is typically faster than BGD.The paper is organized as follows. Section 2 presents the class ofoptimization problems supported by AC/DC. Section 3 overviewsthe foundations of AC/DC. Section 4 describes the data structuresused by AC/DC, its optimizations for factorized and shared compu-tation of aggregates. Section 5 discusses the reparameterization ofoptimization problems and their computation using the aggregatescomputed in a previous step. Section 6 reports on experiments. We consider solving an optimization problem of a particular forminside a database; a more general problem formulation is presentedin [3]. Suppose we have p parameters θ = ( θ , . . . , θ p ) and trainingdataset T that contains tuples with n features x = ( x , . . . , x n ) andresponse y . The features come in two flavors: continuous, e.g., price ,and qualitative/categorical, e.g., city . The former are encoded asscalars, e.g., the price is 10.5. The latter are one-hot encoded asindicator vectors, e.g., if there were three cities then the vector [ , , ] indicates that the first city appears in the training record.The “learning” phase in machine learning typically comes downto solving the optimization problem θ ∗ : = arg min θ J ( θ ) , where J ( θ ) is the loss function. We use the square loss and ℓ -regularizer: J ( θ ) = | T | (cid:213) ( x , y )∈ T (⟨ д ( θ ) , h ( x )⟩ − y ) + λ ∥ θ ∥ . (1)The model is ⟨ д ( θ ) , h ( x )⟩ , where д and h are parameter-mapping,and respectively feature-mapping, functions that uniformly capturecontinuous and categorical variables. In the case of continuous features only, д and h are vector-valued functions д : R p → R m and h : R n → R m , where m is a positive integer. Each componentfunction д j of д = ( д j ) j ∈[ m ] is a multivariate polynomial . Eachcomponent function h j of h = ( h j ) j ∈[ m ] is a multivariate monomial .If there are categorical variables as well, the components of h and д become tensors. See [3] and Example 2.1 below for details.The training dataset T is the result of a feature extraction query Q over an input database D consisting of several relations. It iscommon for Q to be the natural join of the relations in D and selectthe columns hosting the desired features. Furthermore, Q can beenhanced with additional selections and aggregates to constructnew (intensional) features based on the input (extensional) ones.We next show how Eq. (1) captures various regression models. Example 2.1.

Consider a feature extraction query that returnstuples over the variables { units _ sold , city, country , price } , where units _ sold , price are continuous and city, country are categorical.A ridge linear regression ( LR ) model with response units _ sold and features { city, country , price } can be learned by optimizing (1)with the functions g ( θ ) = ( θ , θ city , θ country , θ price ) and h ( x ) = ( , x city , x country , x price ) . The entries in д and h for city and country are vectors, because the features x city , x country are indicator vectors.To learn a degree- polynomial regression ( PR ) model, we extendthe function h from the LR model with all pairwise combinationsof features: x city ⊗ x country , x city x price , x country x price , and x price .We do not include x city ⊗ x city and x country ⊗ x country becausethey encode the same information as the original indicator vectors.The function д from the LR model is correspondingly extendedwith parameters for each interaction, e.g., the parameter vector θ ( city , price ) for the vector of interactions x city x price .Similarly, the degree- rank- r factorization machines ( FaMa r )model can be learned by extending the h function from LR with allpairwise interactions of distinct features as for PR , yet without theinteraction x price . In contrast to PR , the parameters correspondingto interactions are now factorized: The entry in д corresponding tothe interaction x city x price is (cid:205) r ℓ = θ ( ℓ ) city · θ ( ℓ ) price . AC/DC is a batch gradient-descent (BGD) solver that optimizesthe objective J ( θ ) over the training dataset T defined by a featureextraction query Q over a database D . Its high-level structure isgiven in Algorithm 1. The inner loop repeatedly computes the lossfunction J ( θ ) and its gradient ∇ J ( θ ) . This can be sped up massivelyby factoring out the data-dependent computation from the opti-mization loop [3, 22]. The former is cast as computing aggregatesover joins, which benefit from recent algorithmic advances [4, 5]. From Optimization to Aggregates.

Let us define the matrix Σ = ( σ ij ) i , j ∈[ m ] , the vector c = ( c i ) i ∈[ m ] , and the scalar s Y by σ ij = | Q ( D )| (cid:213) ( x , y )∈ Q ( D ) h i ( x ) ⊗ h j ( x ) (2) c i = | Q ( D )| (cid:213) ( x , y )∈ Q ( D ) y · h i ( x ) (3) s Y = | Q ( D )| (cid:213) ( x , y )∈ Q ( D ) y . (4) lgorithm 1: BGD with Armijo line search. θ ← a random point; while not converged yet do α ← next step size // Barzilai-Borwein [6] ; d ← ∇ J ( θ ) ; while (cid:16) J ( θ − α d ) ≥ J ( θ ) − α ∥ d ∥ (cid:17) do α ← α / // line search ; θ ← θ − α d ;Then, J ( θ ) = д ( θ ) ⊤ Σ д ( θ ) − ⟨ д ( θ ) , c ⟩ + s Y + λ ∥ θ ∥ (5) ∇ J ( θ ) = ∂ д ( θ ) ⊤ ∂ θ Σ д ( θ ) − ∂ д ( θ ) ⊤ ∂ θ c + λ θ (6)The quantity ∂ д ( θ ) ⊤ ∂ θ is a p × m matrix, and Σ is an m × m matrix of sparse tensors . Statistically, Σ is related to the covariance matrix, c tothe correlation between the response and the regressors, and s Y tothe empirical second moment of the regressand. When all input fea-tures are continuous, each component function h j ( x ) is a monomialgiving a scalar value. In real-world workloads there is always a mixof categorical and continuous features, where the monomials h j ( x ) become tensor products , and so the quantities h i ( x ) ⊗ h j ( x ) are alsotensor products, represented by relational queries with group-by.The group-by variables for the aggregate query computing the sumof tensor products (cid:205) ( x , y )∈ Q ( D ) h i ( x ) ⊗ h j ( x ) in σ ij are preciselythe categorical variables occurring in the monomials defining thecomponent functions h i and h j . For example, if h i ( x ) = x A and h j ( x ) = x B , where A and B are continuous features, then thereis no group-by variable and the above sum of tensor products isexpressed as follows in SQL: SELECT sum(A*B) from Q;

On the other hand, if both A and B are categorical variables, thenthe SQL encoding has two group-by variables A and B : SELECT A, B, count(*) FROM Q GROUP BY A, B;

These aggregates exploit the sparsity of the representation of σ ij over categorical features to achieve succinct representation:The group-by clause ensures that only combinations of categoriesfor the query variables A and B that exist in the training datasetare considered. The aggregates c and s Y are treated similarly.The above rewriting allows us to compute the data-dependentquantities Σ , c , and s Y in the loss function J and its gradient ∇ J once for all iterations of AC/DC. They can be computed efficiently inside the database as aggregates over the query Q . Aggregateswith different group-by clauses may require different evaluationstrategies to attain the best known complexity [4, 5]. AC/DC settlesinstead for one strategy for all aggregates, cf. Section 4. This hastwo benefits. First, the underlying join is only computed once forall aggregates. Second, the computation of the aggregates can beshared massively. These benefits may easily dwarf the gain of usingspecialised evaluation strategies for individual aggregates in case ofvery many aggregates (hundreds of millions) and large databases. Reparameterization under Functional Dependencies (FDs).

AC/DC exploits the FDs among variables in the feature extrac-tion query Q to reduce the dimensionality of the optimization problem by eliminating functionally determined variables and re-parameterizing the model. We thus only compute the quantities Σ , c , and s Y on the subset of the features that are not functionallydetermined and solve the lower-dimensional optimization problem.The effect on the loss function and its gradient is immediate anduniform across all optimization problems in our class: We have lessterms to compute since the functionally determined variables aredropped. The effect on the non-linear penalty term Ω is howevernon-trivial and depends on the model at hand [3].We next explain the reparameterization of the ridge linear re-gression from Example 2.1 under the FD city → country [3]. Thecategorical features x city and x country are represented by indicatorvectors and the latter can be recovered from the former using themapping between values for city and country in the input database.We can extract this map R ( country , city ) that is a sparse representa-tion of a matrix R for which x city = Rx country . The model becomes: ⟨ θ , x ⟩ = (cid:213) j (cid:60) { city , country } (cid:10) θ j , x j (cid:11) + (cid:42) θ city + R ⊤ θ country (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) γ city , x city (cid:43) . The parameters θ city and θ country are replaced by new parameters γ city for the categorical features x city . This new form can be usedin the loss function (1). We can further optimize out θ country fromthe non-linear penalty term by setting the partial derivative of J with respect to θ country to 0. Then, the new penalty term becomes: ∥ θ ∥ = (cid:213) j (cid:44) { city , country } (cid:13)(cid:13) θ j (cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) γ city − R ⊤ θ country (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) θ country (cid:13)(cid:13) = (cid:213) j (cid:44) { city , country } (cid:13)(cid:13) θ j (cid:13)(cid:13) + (cid:68) ( I city + R ⊤ R ) − γ city , γ city (cid:69) , where I city is the identity matrix in the order of the active domainsize of city . A similar formulation of the penalty term holds forpolynomial regression models.AC/DC thus exploits FDs to compute fewer aggregates at the costof a more complex regularizer term that requires matrix inversionand multiplication. While these matrix operations are expressible asdatabase queries and can be evaluated similarly to the other aggre-gates [3], AC/DC uses instead the Eigen library for linear algebrato achieve significant speedups for model reparameterization overthe strawman approach that does not exploit FDs. An immediate approach to computing the aggregates in Σ , c , and s Y for a given optimization problem is to first materialize the resultof the feature extraction query Q using an efficient query engine,e.g., a worst-case optimal join algorithm, and then compute theaggregates in one pass over the query result. This approach, how-ever, is suboptimal, since the listing representation of the queryresult is highly redundant and not necessary for the computationof the aggregates. AC/DC avoids this redundancy by factorizingthe computation of aggregates over joins, as detailed in Section 4.1.In a nutshell, this factorized approach unifies three powerful ideas:worst-case optimality for join processing, query plans defined byfractional hypertree decompositions of join queries, and an opti-mization that partially pushes aggregates past joins. AC/DC further ggregates (variable order ∆ , varMap, relation ranges R [ x , y ] , . . . , R d [ x d , y d ] ) A = root ( ∆ ) ; context = π dep ( A ) ( varMap ) ; reset ( aggregates A ) ; aggregates = (cid:12)(cid:12) aggregates A (cid:12)(cid:12) ; if ( dep ( A ) (cid:44) anc ( A )) { aggregates A = cache A [ context ] ; if ( aggregates A [ ] (cid:44) ∅) return ; } foreach i ∈ [ d ] do R i [ x ′ i , y ′ i ] = R i [ x i , y i ] ; foreach a ∈ (cid:209) i ∈[ d ] such that A ∈ vars ( R i ) π A ( R i [ x i , y i ]) do { foreach i ∈ [ d ] such that A ∈ vars ( R i ) do find range R i [ x ′ i , y ′ i ] ⊆ R i [ x i , y i ] such that π A ( R i [ x ′ i , y ′ i ]) = {( A : a )} ; switch ( A ) : continuous feature : λ A = [{() (cid:55)→ } , {() (cid:55)→ a } , . . . , {() (cid:55)→ a · deдree }}] ; categorical feature : λ A = [{() (cid:55)→ } , { a (cid:55)→ }] ; no feature : λ A = [{() (cid:55)→ }] ; switch ( ∆ ) : leaf node A : foreach l ∈ [ aggregates ] do { [ i ] = R A [ l ] ; aggregates A [ l ] += λ A [ i ] ; } inner node A ( ∆ , . . . , ∆ k ) : foreach j ∈ [ k ] do aggregates ( ∆ j , varMap × {( A : a )} , ranges R [ x ′ , y ′ ] , . . . , R d [ x ′ d , y ′ d ]) ; if ( ∀ j ∈ [ k ] : aggregates root ( ∆ j ) [ ] (cid:44) ∅) foreach l ∈ [ aggregates ] do { [ i , i , . . . , i k ] = R A [ l ] ; aggregates A [ l ] += λ A [ i ] ⊗ (cid:203) j ∈[ k ] aggregates root ( ∆ j ) [ i j ] ; }} if ( dep ( A ) (cid:44) anc ( A )) cache A [ context ] = aggregates A ; Figure 1: Algorithm for computing aggregates aggregates A . Each aggregate is a map from tuples over its group-by variables toscalars. The parameters of the initial call are the variable order ∆ of the feature extraction query, an empty map from variablesto values, and the full range of tuples for each relation R , . . . , R d in the input database. exploits similarities across the aggregates to massively share theircomputation, as detailed in Section 4.2. Factorized aggregate computation relies on a variable order ∆ forthe query Q to avoid redundant computation. In this paper, weassume that we are given a variable order. Prior work discusses thisquery optimization problem [4, 5, 18]. Variable Orders.

State-of-the-art query evaluation uses relation-at-a-time query plans. We use variable-at-a-time query plans, whichwe call variable orders. These are partial orders on the variables inthe query, capture the join dependencies in the query, and dictatethe order in which we solve each join variable. For each variable,we join all relations with that variable. Our choice is motivated bythe complexity of join evaluation: Relation-at-a-time query plansare provably suboptimal, whereas variable-at-a-time query planscan be chosen to be optimal [17].Given a join query Q , a variable X depends on a variable Y ifboth are in the schema of a relation in Q . Definition 4.1 (adapted from [19]).

A variable order ∆ for a joinquery Q is a pair ( F , dep ) , where F is a rooted forest with one nodeper variable in Q , and dep is a function mapping each variable X toa set of variables in F . It satisfies the following constraints: • For each relation in Q , its variables lie along the same root-to-leaf path in F . • For each variable X , dep ( X ) is the subset of its ancestors in F on which the variables in the subtree rooted at X depend. Without loss of generality, we use variables orders that are treesinstead of forests. We can convert a forest into a tree by addingto each relation the same dummy join variable that takes a singlevalue. For a variable X in the variable order ∆ , anc ( X ) is the set ofall ancestor variables of X in ∆ . The set of variables in ∆ (schema ofa relation R ) is denoted by vars ( ∆ ) ( vars ( R ) respectively) and thevariable at the root of ∆ is denoted by root ( ∆ ) . Example 4.2.

Figure 3(a) shows a variable order for the naturaljoin of relations R ( A , B , C ) , T ( B , D ) , and S ( A , E ) . Then, anc ( D ) = { A , B } and dep ( D ) = { B } , i.e., D has ancestors A and B , yet it onlydepends on B . Given B , the variables C and D are independent ofeach other. For queries with group-by variables, we choose a vari-able order where these variables sit above the other variables [5].Figure 1 presents the AC/DC algorithm for factorized computa-tion of SQL aggregates over the feature extraction query Q . Thebackbone of the algorithm without the code in boxes explores thefactorized join of the input relations R , . . . , R d over a variableorder ∆ of Q . As it traverses ∆ in depth-first preorder, it assignsvalues to the query variables. The assignments are kept in varMapand used to compute aggregates by the code in the boxes.The relations are sorted following a depth-first pre-order traver-sal of ∆ . Each call takes a range [ x i , y i ] of tuples in each relation R i . Initially, these ranges span the entire relations. Once the rootvariable A in ∆ is assigned a value a from the intersection of possi-ble A -values from the input relations, these ranges are narroweddown to those tuples with value a for A . o compute an aggregate over the variable order ∆ rooted at A ,we first initialize the aggregate to zeros. This is needed since the ag-gregates might have been used earlier for different assignments ofancestor variables in ∆ . We next check whether we previously com-puted the aggregate for the same assignments of variables in dep ( A ) ,denoted by context, and cached it in a map cache A . Caching is use-ful when dep ( A ) is strictly contained in anc ( A ) , since this meansthat the aggregate computed at A does not need to be recomputedfor distinct assignments of variables in anc ( A ) \ dep ( A ) . In this case,we probe the cache using as key the assignments in varMap of the dep ( A ) variables: cache A [ context ] . If we have already computed theaggregates over that assignment for dep ( A ) , then we can just reusethe previously computed aggregates and avoid recomputation.If A is a group-by variable, then we compute a map from each A -value a to a function of a and aggregates computed at childrenof A , if any. If A is not a group-by variable, then we compute a mapfrom the empty value () to such a function; in this latter case, wecould have just computed the aggregate instead of the map thoughwe use the map for uniformity. In case there are group-by variablesunder A , the computation at A returns maps whose keys are tuplesover all these group-by variables in vars ( ∆ ) . Example 4.3.

Consider the query Q with the variable order ∆ inFigure 3(a). We first compute the assignments for A as Q A = π A R ▷◁ π A T . For each assignment a ∈ Q A , we then find assignments forvariables under A within the narrow ranges of tuples that contain a . The assignments for B in the context of a are given by Q aB = π B ( σ A = a R ) ▷◁ π B S . For each b ∈ Q aB , the assignments for C and D are given by Q a , bC = π C ( σ A = a ∧ B = b R ) and Q bD = π D ( σ B = b S ) . Since D depends on B and not on A , the assignments for D under a given b are repeated for every occurrence of b with assignments for A . Theassignments for E given a ∈ Q A are computed as Q aE = π E ( σ A = a T ) .Consider the aggregate COUNT ( Q ) . The count at each variable X is computed as the sum over all value assignments of X of theproduct of the counts at the children of X in ∆ ; if X is a leaf in ∆ ,the product at children is considered 1. For our variable order, thiscomputation is captured by the following factorized expression: COUNT = (cid:213) a ∈ Q A × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) b ∈ Q aB × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) c ∈ Q a , bC × V D ( b ) (cid:170)(cid:174)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) × (cid:213) e ∈ Q aE V D ( b ) = (cid:205) d ∈ Q bD b for B and reused for all subsequent occurrences ofthis assignment under assignments for A .Summing all X -values in the result of Q for a variable X is donesimilarly, with the difference that at the variable X in ∆ we computethe sum of the values of X weighted by the product of the counts oftheir children. For instance, the aggregate SUM ( C ∗ E ) is computedover our variable order by the following factorized expression: SUM ( C · E ) = (cid:213) a ∈ Q A × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) b ∈ Q aB × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) c ∈ Q a , bC c × V D ( b ) (cid:170)(cid:174)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) × (cid:213) e ∈ Q aE e (8)To compute the aggregate SUM ( C ∗ E ) GROUP BY A , we compute SUM ( C ∗ E ) for each assignment for A instead of marginalizing away A . The result is a map from A -values to values of SUM ( C ∗ E ) . A good variable order may include variables that are not ex-plicitly used in the optimization problem. This is the case of joinvariables whose presence in the variable order ensures a good factor-ization. For instance, if we remove the variable B from the variableorder in Figure 3(a), the variables C , D are no longer independentand we cannot factorize the computation over C and D . AC/DCexploits the conditional independence enabled by B , but computesno aggregate over B if this is not required in the problem. Section 4.1 explains how to factorize the computation of one aggre-gate in Σ , c , and s Y over the join of database relations. In this sectionwe show how to share the computation across these aggregates. Example 4.4.

Let us consider the factorized expression of thesum aggregates

SUM ( C ) and SUM ( E ) over ∆ : SUM ( C ) = (cid:213) a ∈ Q A × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) b ∈ Q aB × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) c ∈ Q a , bC c × V D ( b ) (cid:170)(cid:174)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) × (cid:213) e ∈ Q aE SUM ( E ) = (cid:213) a ∈ Q A × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) b ∈ Q aB × (cid:169)(cid:173)(cid:173)(cid:171) (cid:213) c ∈ Q a , bC × V D ( b ) (cid:170)(cid:174)(cid:174)(cid:172)(cid:170)(cid:174)(cid:174)(cid:172) × (cid:213) e ∈ Q aE e (10)We can share computation across the expressions (7) to (10) sincethey are similar. For instance, given an assignment b for B , all theseaggregates need V D ( b ) . Similarly, for a given assignment a for A ,the aggregates (8) and (10) can share the computation of the sumaggregate over Q aE . For assignments a ∈ Q A and b ∈ Q aB , (8) and (9)can share the computation of the sum aggregate over Q a , bC .AC/DC computes all aggregates together over a single variableorder. It then shares as much computation as possible and signif-icantly improves the data locality of the aggregate computation.AC/DC thus decidedly sacrifices the goal of achieving the lowestknown complexity for individual aggregates for the sake of sharingas much computation as possible across these aggregates. Aggregate Decomposition and Registration.

For a model of de-gree deдree and a set of variables { A l } l ∈[ n ] , we have aggregates ofthe form SUM ( (cid:206) l ∈[ n ] A d l l ) , possibly with a group-by clause, suchthat 0 ≤ (cid:205) l ∈[ n ] d l ≤ · deдree , d l ≥

0, and all categorical variablesare turned into group-by variables. The reason for 2 · deдree is dueto the Σ matrix used to compute the gradient of the loss function (6),which pairs any two features of degree up to deдree . Each aggre-gate is thus defined uniquely by a monomial (cid:206) l ∈[ n ] A d l l ; we maydiscard the variables with exponent 0. For instance, the monomialfor SUM ( C ∗ E ) is CE while for SUM ( C ∗ E ) GROUP BY A is A CE .Aggregates can be decomposed into shareable components. Con-sider a variable order ∆ = A ( ∆ , . . . , ∆ k ) , with root A and subtrees ∆ to ∆ k . We can decompose any aggregate α to be computed over ∆ into k + A and aggregate j ∈ [ k ] is for root ( ∆ j ) . Then α is computed as the product of its k + α onto A or vars ( ∆ j ) . The aggregate j is thenpushed down the variable order and computed over the subtree ∆ j .If the projection of the monomial is empty, then the aggregate to bepushed down is SUM ( ) , which computes the size of the join defined ∆ ∆ k · · · α += α ⊗ (cid:203) j ∈[ k ] α j aggregates A = · · · α · · · i i · · · i k λ A = · · · α · · · aggregates root ( ∆ ) = · · · α · · · aggregates root ( ∆ k ) = · · · α k · · · i − i i + i − i i + i k − i k i k + Figure 2: Index structure provided by the aggregate register for a particular aggregate α that is computed over the variableorder ∆ = A ( ∆ , . . . , ∆ k ) . The computation of α is expressed as the sum of the Cartesian products of its aggregate componentsprovided by the indices i , . . . , i k . by ∆ j . If several aggregates push the same aggregate to the subtree ∆ j , this is computed only once for all of them.The decomposed aggregates form a hierarchy whose structure isthat of the underlying variable order ∆ . The aggregates at a variable X are denoted by aggregates X . All aggregates are to be computedat the root of ∆ , then fewer are computed at each of its children andso on. This structure is the same regardless of the input data andcan be constructed before data processing. We therefore constructat compile time for each variable X in ∆ an aggregate register R X that is an array of all aggregates to be computed over the subtree of ∆ rooted at X . This register is used as an index structure to facilitatethe computation of the actual aggregates. More precisely, an entryfor an aggregate α in the register of X is labeled by the monomialof α and holds an array of indices of the components of α locatedin the registers at the children of X in ∆ and in the local register Λ X of X . Figure 2 depicts this construction.The hierarchy of registers in Figure 3(b) forms an index struc-ture that is used by AC/DC to compute the aggregates. This indexstructure is stored as one contiguous array in memory, where theentry for an aggregate α in the register comes with an auxiliaryarray with the indices of α ’s aggregate components. The aggregatesare ordered in the register so that we increase sequential access,and thus cache locality, when updating them. Example 4.5.

Let us compute a regression model of degree 1 overa dataset defined by the join of the relations R ( A , B , C ) , S ( B , D ) , and T ( A , E ) . We assume that B and E are categorical features, and allother variables are continuous. The quantities ( Σ , c , s Y ) require thecomputation of the following aggregates: SUM ( ) , SUM ( X ) for eachvariable X , and SUM ( X ∗ Y ) for each pair of variables X and Y .Figure 3(a) depicts a variable order ∆ for the natural join of threerelations, and Figure 3(b) illustrates the aggregate register thatassigns a list of aggregates to each variable in ∆ . The aggregates areidentified by their respective monomials (the names in the registerentries). The categorical variables are shown in bold. Since they aretreated as group-by variables, we do not need aggregates whosemonomials include categorical variables with exponents higherthan 1. Any such aggregate is equivalent to the aggregate whosemonomial includes the categorical variable with degree 1 only. The register R A for the root A of ∆ has all aggregates neededto compute the model. The register R B has all aggregates from R A defined over the variables in the subtree of ∆ rooted at B . Thevariables C , D , and E are leaf nodes in ∆ , so the monomials forthe aggregates in the registers R C , R D , and R E are the respectivevariables only. We use two additional registers Λ A and Λ B , whichhold the aggregates corresponding to projections of the monomialsof the aggregates in R A , and respectively R B , onto A , respectively B . For a leaf node X , the registers Λ X and R X are the same.A path between two register entries in Figure 3(b) indicates thatthe aggregate in the register above uses the result of the aggregate inthe register below. For instance, each aggregate in R B is computedby the product of one aggregate from Λ B , R C , and R D . The fan-in ofa register entry thus denotes the amount of sharing of its aggregate:All aggregates from registers above with incoming edges to thisaggregate share its computation. For instance, the aggregates withmonomials AB , AC , and AD from R A share the computation of theaggregate with monomial A from Λ A as well as the count aggregatefrom R E . Their computation uses a sequential pass over the register R B . This improves performance and access locality as R B can bestored in cache and accessed to compute all these aggregates. Aggregate Computation.

Once the aggregate registers are inplace, we can ingest the input database and compute the aggre-gates over the join of the database relations following the factorizedstructure given by a variable order. The algorithm in Figure 1 doesprecisely this. Section 4.1 explained the factorized computation ofa single aggregate over the join. We explain here the case of severalaggregates organized into the aggregate registers. This is stated bythe pseudocode in the red boxes.Each aggregate is uniformly stored as a map from tuples overtheir categorical variables to payloads that represent the sums overthe projection of its monomial on all continuous variables. If theaggregate has no categorical variables, the key is the empty tuple.For each possible A -value a , we first compute the array λ A thatconsists of the projections of the monomials of the aggregates onto A . If A is categorical, then we only need to compute the 0 and 1powers of a . If A is continuous, we need to compute all powers of A from 0 to 2 · deдree . If A is not a feature used in the model, thenwe only compute a trivial count aggregate. B DC Edep ( C ) = { A , B } dep ( B ) = { A } dep ( D ) = { B } dep ( E ) = { A } dep ( A ) = { } (a) Variable Order ∆ . B C D E AA A B AC AD A

E B C B D BE CC CD C E DD D E R A = Λ A = B C D B C B D CC CD DD R B = E R E = Λ E = B Λ B = R C = Λ C = R D = Λ D = (b) Aggregate Registers. Figure 3: (a) Variable order ∆ for the natural join of the relations R(A,B,C), S(B,D), and T(A,E); (b) Aggregate registers for theaggregates needed to compute a linear regression model with degree 1 over ∆ . Categorical variables are shown in bold. We update the value of each aggregate α using the index struc-ture depicted in Figure 2 as we traverse the variable order bottomup. Assume we are at a variable A in the variable order. In case A isa leaf, the update is only a specific value in the local register λ A . Incase the variable A has children in the variable order, the aggregateis updated with the Cartesian product of all its component aggre-gates, i.e., one value from λ A and one aggregate for each child of A . The update value can be expressed in SQL as follows. Assumethe aggregate α has group-by variables C , which are partitionedacross A and its k children. Assume also that α ’s components are α and ( α j ) j ∈[ k ] . Recall that all aggregates are maps, which we mayrepresent as relations with columns for keys and one column P forpayload. Then, the update to α is: SELECT C , ( α . P ∗ . . . ∗ α k . P ) AS P FROM α , . . . , α k ; Further Considerations.

The auxiliary arrays that provide theprecomputed indices of aggregate components within registersspeed up the computation of the aggregates. Nevertheless, theystill represent one extra level of indirection since each update to anaggregate would first need to fetch the indices and then use them toaccess the aggregate components in registers that may not be neces-sarily in the cache. We have been experimenting with an aggressiveaggregate compilation approach that resolves all these indices atcompile time and generates the specific code for each aggregateupdate. In experiments with linear regression, this compilationleads to a 4 × performance improvements. However, the downsideis that the AC/DC code gets much larger and the C++ compilerneeds much more time to compile it. For higher-degree models, itcan get into situations where the C++ compiler crashes. We arecurrently working on a hybrid approach that partially resolves theindices while maintaining a reasonable code size. As shown in Section 3, the gradient descent solver repeatedly com-putes J ( θ ) and ∇ J ( θ ) , which require matrix-vector and vector-vector multiplications over the quantities ( Σ , c , s Y ). We discusshere the computation that involves the Σ matrix. It is possible that several entries σ ij ∈ Σ map to the same aggre-gate query that is computed by AC/DC. Consider, for instance, thefollowing scalar entries in the feature mapping vector h : h i ( x ) = x a , h j ( x ) = x b · x c , h k ( x ) = x b , h l ( x ) = x a · x c , h m ( x ) = x c , and h n ( x ) = x a · x b . By definition of Σ , any pair-wise product of theseentries in h corresponds to one entry in Σ . The entries σ ij , σ lk , and σ mn (as well as their symmetric counterparts) all map to the sameaggregate SUM ( A · B · C ) = (cid:205) ( x , y )∈ Q ( D ) x a · x b · x c . To avoid thisredundancy, we use a sparse representation of Σ , which assigns toeach distinct aggregate query a list of index pairs ( i , j ) that containsone pair for each entry σ ij ∈ Σ that maps to this query.AC/DC operates directly over the sparse representation of Σ .Consider the matrix vector product p = Σ д ( θ ) , and let A be theroot of the variable order ∆ . We compute p by iterating over allaggregate maps α ∈ aggregates A , and for each index pair ( i , j ) in Σ that is assigned to α , we add to the i ’s entry in p with the productof α and j ’s entry in д ( θ ) . If i (cid:44) j , we also add to j ’s entry in p withthe product of α and i ’s entry in д ( θ ) . Regularizer under FDs.

Section 3 explains how to rewrite theregularizer for a ridge linear regression model under the FD city → country . First, we need to construct the relation R ( country , city ) ,and then compute the inverse of the matrix D = ( I city + R ⊤ R ) . Notethat each entry ( i , j ) in R ⊤ R is 1, if two cities city i and city j are inthe same country . Otherwise, the entry ( i , j ) is zero.To facilitate the construction of the matrix D , we construct R asa map that groups the tuples of R by country . Thus, we construct amapping from country to the set of cities in this country that occurin the dataset, which can be computed during the computation ofthe factorized aggregates over the variable order.We then use this representation of R to iterate over the payloadset, and for any two cities city i , city j in the payload set, we incre-ment the corresponding index ( i , j ) in D by one. We store D as asparse matrix in the format used by the Eigen linear algebra library,and then use Eigen’s Sparse Cholesky Decomposition to computethe inverse of D and ultimately the solution for the regularizer. EXPERIMENTS

We report on the performance of learning regression models andfactorization machines over a real dataset used in retail applications;cf. the extended technical report [2] for further experiments.

Systems.

We consider two variants of our system: The plainAC/DC and its extension AC/DC+FD that exploits functional de-pendencies. We also report on five competitors: F learns linear re-gression models and one-hot encodes the categorical features [22];MADlib [13] 1.8 uses ols to compute the closed-form solution ofpolynomial regression models (MADlib also supports generalizedlinear models, but this is consistently slower than ols in our exper-iments and we do not report it here); R [20] 3.0.2 uses lm (linearmodel) based on QR-decomposition [11]; libFM [21] 1.4.2 supportsfactorization machines; and TensorFlow [1] 1.6 uses the Linear-Regressor estimator with ftrl optimization [16], which is based onthe conventional SGD optimization algorithm.The competitors come with strong limitations. MADlib inheritsthe limitation of at most 1600 columns per relation from its Post-greSQL host. The MADlib one-hot encoder transforms a categoricalvariable with n distinct values into n columns. Therefore, the num-ber of distinct values across all categorical variables plus the numberof continuous variables in the input data cannot exceed 1600. R lim-its the number of values in their data frames to 2 −

1. There existR packages, e.g., ff , which work around this limitation by storingdata structures on disk and mapping only chunks of data in mainmemory. The biglm package can compute the regression modelby processing one ff-chunk at a time. Chunking the data, however,can lead to rank deficiencies within chunks (feature interactionsmissing from chunks), which causes biglm to fail. Biglm fails in allour experiments due to this limitation, and, thus, we are unable tobenchmark against it. LibFM requires as input a zero-suppressedencoding of the join result. Computing this representation is anexpensive intermediary step between exporting the query resultfrom the database system and importing the data. To compute themodel, we used its more stable MCMC variant with a fixed numberof runs (300); its SGD implementation requires a fixed learning rate α and does not converge. AC/DC uses the adaptive learning ratefrom Algorithm 1 and runs until the parameters have convergedwith high accuracy (for FaMa , it uses 300 runs).TensorFlow uses a user-defined iterator interface to load a batchof tuples from the training dataset at a time. This iterator defines amapping from input tuples to (potentially one-hot encoded) featuresand is called directly by the learning algorithm. Learning overbatches requires a random shuffling of the input data, which inTensorFlow requires loading the entire dataset into memory. Thisfailed for our experiments and we therefore report its performancewithout shuffling the input data. We benchmark TensorFlow for LR only as it does not provide functionality to create all pairwiseinteraction terms for PR and FaMa , third-party implementationsof these models relied on python packages that failed to load ourdatasets. The optimal batch size for our experiments is 100,000tuples. Smaller batch sizes require loading too many batches, verylarge batches cannot fit into memory. Since TensorFlow requiresa fixed number of iterations, we report the times to do one epochover the dataset (i.e., computing 840 batches). This means that the algorithm learned over each input tuple once. In practice, it is oftennecessary to optimize with several epochs to get a good model.

Experimental Setup.

All experiments were performed on an In-tel(R) Core(TM) i7-4770 3.40GHz/64bit/32GB with Linux 3.13.0 andg++4.8.4. We report wall-clock times by running each system onceand then reporting the average of four subsequent runs with warmcache. We do not report the times to load the database into memoryfor the join as they are orthogonal to this work. All relations aregiven sorted by their join attributes.

Dataset.

We experimented with a real-world dataset in the retaildomain for forecasting user demands and sales. It has five relations:

Inventory (storing information about the inventory units for prod-ucts (sku) in a store (locn), at a given date),

Census (storing demo-graphics information per zipcode such as population, median age,repartition per ethnicities, house units and how many are occupied,number of children per household, number of males, females, andfamilies),

Location (storing the zipcode for each store and distancesto several other stores),

Item (storing the price and category, subcat-egory, and categoryCluster for each products), and

Weather (storingweather conditions such as mean temperature, and whether it rains,snows, or thunders for each store at different dates). The feature ex-traction query is the natural join of these five relations. It is acyclicand has 43 variables. We compute the join over the variable or-der: (locn (zip ( vars (Census), vars (Location)), date(sku( vars (Item)), vars (Weather)))) . The following 8 variables arecategorical: zip, sku, category, subcategory, categoryCluster, snow,rain, thunder . The variables locn and date are not features in ourmodels. We design 4 fragments of our dataset with an increasingnumber of categorical features. v is a partition of the entire datasetthat is specifically tailored to work within the limitations of R. Itincludes all categorical variables as features except for sku and zip . v computes the same model as v but over all rows in the data (5 × larger than v ). v extends v with zip , v and v are designed towork within the limitations of MADlib. v to v have no functionaldependency. Finally, v has all variables but zip and the functionaldependency sku → { category, subcategory, categoryCluster }.We learned LR , PR , and FaMa models that predict the amountof inventory units based on all other features. Summary of findings.

Table 1 shows our findings. AC/DC+FDis the fastest system in our experiments. It needs up to 30 minutesand computes up to 46M aggregates. This is orders of magnitudefaster than its competitors whenever they do not exceed memorylimitation, 24-hour timeout, or internal design limitations. The per-formance gap is due to the optimizations of AC/DC: (1) it avoidsmaterializing the join and the export-import step between databasesystems and statistical packages, which take longer than computingan end-to-end LR model in AC/DC. Instead, AC/DC performs thejoin together with the aggregates using one execution plan; (2) itfactorizes the computation of the aggregates and the underlyingjoin, which comes with a 20 × compression factor; (3) it massivelyshares the computation of large (up to 46M for PR ) sets of distinctnon-zero aggregates, which makes their computation up to 16K × faster than computing them individually; (5) it decouples the com-putation of the aggregates on the input data from the parameterconvergence step and thus avoids scanning the join result for eachof the up to 400 iterations; (6) it avoids the upfront one-hot encodingthat comes with higher asymptotic complexity and prohibitively arge covariance matrices by only computing non-identical, non-zero matrix entries. For PR and our dataset v , this leads to a 259 × reduction factor in the number of aggregates to compute; (7) itexploits the FD in the input data to reduce the number of featuresof the model, which leads to a 3.5x improvement factor. Categorical features.

As we move from v / v to v , we increasethe number of categorical features by approx. 50 × for LR (from 55 to2.7K) and 65 × for PR and FaMa (from 2.4K to 154K). For LR , thisincrease only led to a 7 × decrease in performance of AC/DC and atleast 9 × for MADlib (we stopped MADlib after 24 hours). For PR ,this yields a 13 . × performance decrease for AC/DC . This behaviorremains the same for

AC/DC ’s aggregate computation step withor without the convergence step, since the latter is dominated bythe former by up to three orders of magnitude. This sub-linearbehavior is partly explained by the ability of AC/DC to processmany aggregates much faster in bulk than individually: it takes 34seconds for 43 count aggregates, one per variable, but only 1819seconds for 37M sum aggregates! It is also explained by the same-order increase in the number of aggregates: 65 × (51 × ) more distinctnon-zero aggregates in v vs v for LR (resp. PR and FaMa ).The performance of TensorFlow is largely invariant to the in-crease in the number of categorical features, since its internal map-ping from tuples in the training dataset to the sparse representationof the features vector remains of similar size. Nevertheless, oursystem is consistently orders of magnitudes faster than computingonly a single epoch in TensorFlow. Increasing database size.

A 5 × increase in database size andjoin result from v to v leads to a similar decrease factor in perfor-mance for F and AC/DC on all models, since the number of featuresand aggregates stay roughly the same and the join is acyclic andprocessed in linear time. The performance of MADlib, TensorFlow,and libFM follows the same trend for LR and FaMa . MADlib runsout of time (24 hours) for both datasets for PR models. R cannotcope with the size increase due to internal design limitations. One-hot encoding vs. sparse representations with group-by aggregates.

One-hot encoding categorical features leads to alarge number of zero and/or redundant entries in the Σ matrix. Forinstance, for v and PR , the number of features is m = , Σ would have m ( m + )/ ≈ . × entries! Most of these are either zero or repeating. In contrast,AC/DC’s sparse representation only considers 46M non-zero anddistinct aggregates. The number of aggregates is reduced by 259x!Our competitors require the data be one-hot encoded before learning. The static one-hot encoding took (in seconds): 28.42 for Ron v ; 9.41 for F on v and v ; 2 for MADlib on v to v ; and slightlymore than an hour for libFM, due to the expensive zero-suppressionstep. TensorFlow one-hot encodes on the fly during the learningphase and cannot be reported separately. Functional dependencies.

The FD in our dataset v has atwofold effect on AC/DC (all other systems do not exploit FDs): iteffectively reduces the number of features and aggregates, whichleads to better performance of the in-database precomputation step;yet it requires a more elaborate convergence step due to the morecomplex regularizer. For LR , the aggregate step becomes 2 . × faster,while the convergence step increases 13 × . Nevertheless, the conver-gence step takes at most 2% of the overall compute time in this case.For degree-2 models, the FD brings an improvement by a factor of 3.5 × for PR , and 3.87 × for FaMa . This is due to a 10% decrease inthe number of categorical features, which leads to a 20% decreasein the number of group-by aggregates. Acknowledgments.

This project has received funding from the Eu-ropean Union’s Horizon 2020 research and innovation programmeunder grant agreement No 682588. XN is supported in part by grantsNSF CAREER DMS-1351362, NSF CNS-1409303 and the Margaretand Herman Sokol Faculty Award.

REFERENCES [1] Martín Abadi et al. 2016. TensorFlow: Large-Scale Machine Learning on Hetero-geneous Distributed Systems.

CoRR abs/1603.04467 (2016). http://arxiv.org/abs/1603.04467[2] Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Max-imilian Schleich. 2018. AC/DC: In-Database Learning Thunderstruck.

CoRR abs/1803.07480 (2018). https://arxiv.org/abs/1803.07480[3] Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maxi-milian Schleich. 2018. In-Database Learning with Sparse Tensors. In

PODS .[4] Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: QuestionsAsked Frequently. In

PODS . 13–28.[5] Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Závodný. 2013.Aggregation and Ordering in Factorised Databases.

PVLDB

6, 14 (2013), 1990–2001.[6] Jonathan Barzilai and Jonathan M. Borwein. 1988. Two-point step size gradientmethods.

IMA J. Numer. Anal.

8, 1 (1988), 141–148.[7] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vectormachines.

ACM Trans. Intell. Syst. Tech.

CIDR .[9] Rong-En Fan et al. 2008. LIBLINEAR: A Library for Large Linear Classification.

J. Mach. Learn. Res.

SIGMOD . 325–336.[11] J. G. F. Francis. 1961. The QR transformation: A unitary analogue to the LRtransformation–Part 1.

Comput. J.

4, 3 (1961), 265–271.[12] Gaël Guennebaud et al. 2010. Eigen v3. http://eigen.tuxfamily.org. (2010).[13] Joseph M. Hellerstein and et al. 2012. The MADlib Analytics Library or MADSkills, the SQL.

PVLDB

5, 12 (2012), 1700–1711.[14] Arun Kumar, Jeffrey F. Naughton, and Jignesh M. Patel. 2015. Learning General-ized Linear Models Over Normalized Data. In

SIGMOD . 1969–1984.[15] Arun Kumar, Jeffrey F. Naughton, Jignesh M. Patel, and Xiaojin Zhu. 2016. ToJoin or Not to Join?: Thinking Twice about Joins before Feature Selection. In

SIGMOD . 19–34.[16] H. Brendan McMahan and et al. 2013. Ad Click Prediction: A View from theTrenches. In

KDD . 1222–1230.[17] Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2013. Skew Strikes Back: NewDevelopments in the Theory of Join Algorithms. In

SIGMOD Rec.

45, 2 (2016), 5–16.[19] Dan Olteanu and Jakub Závodný. 2015. Size Bounds for Factorised Representa-tions of Query Results.

ACM TODS

40, 1 (2015), 2.[20] R Core Team. 2013.

R: A Language and Environment for Statistical Computing

ACM Trans. Intell. Syst.Technol.

3, 3 (2012), 57:1–57:22.[22] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. 2016. Learning LinearRegression Models over Factorized Joins. In

SIGMOD . 3–18. v v v Join Representation Listing 774M 3.614G 3.614G 3.614G( × × × × Join Computation (PSQL) for R, TensorFlow, libFM 50.63 216.56 216.56 216.56Factorized Computation of 43 Counts over Join 8.02 34.15 34.15 34.15

Linear regression

Features without FDs 33 + 55 33+55 33+1340 33+3702(continuous+categorical) with FDs same as above, there are no FDs 33+3653Aggregates without FDs 595+2,418 595+2,421 595+111,549 595+157,735(scalar+group-by) with FDs same as above, there are no FDs 595+144,589MADLib (ols) Learn 1,898.35 8,855.11 > , .

00 –R (QR) Export/Import 308.83 – – –Learn 490.13 – – –TensorFlow (FTLR) Export/Import 74.72 372.70 372.70 372.70(1 epoch, batch size 100K) Learn 2,762.50 11,866.37 11,808.66 11,817.05F Aggregate 93.31 424.81 OOM OOMConverge (runs) 0.01 (359) 0.01 (359)

AC/DC

Aggregate 25.51 116.64 117.94 895.22Converge (runs) 0.02 (343) 0.02 (367) 0.42 (337) 0.66 (365)

AC/DC+FD

Aggregate same as AC AC/DC+FD over MADlib 74.36 × × > . × ∞ R 33.28 × ∞ ∞ ∞

TensorFlow 113.12 × × × × F 3.65 × × ∞ ∞ AC/DC same as

AC/DC , there are no FDs 2.30 × Polynomial regression degree > , . > , . > , .

00 –

AC/DC

Aggregate 131.86 512.00 820.57 7,012.84Converge (runs) 2.01 (211) 2.04 (214) 208.87 (247) 115.65 (200)

AC/DC+FD

Aggregate same as

AC/DC

AC/DC+FD over MADlib > . × > . × > , × ∞ AC/DC same as

AC/DC , there are no FDs 3.50 × Factorization machine degree rank > , .

00 (300) > , .

00 (300)

AC/DC

Aggregate 128.97 498.79 772.42 6,869.47Converge (runs) 3.03 (300) 3.05 (300) 262.54 (300) 166.60 (300)

AC/DC+FD

Aggregate same as

AC/DC

AC/DC+FD over libFM 152.70 × >175.51 × >86.68 × >49.53 × AC/DC same as

AC/DC , there are no FDs 3.87 × Table 1: Time performance (seconds) for learning LR , PR , and FaMa models over increasingly larger fragments ( v to v ) ofRetailer. (–) means that the system failed to compute due to design limitations. The timeout is set to 24 hours (86,400 seconds).MADlib cannot compute any model on v since the one-hot encoding requires more than 1600 columns. R and MADlib do notsupport FaMa models. TensorFlow does not support PR and FaMa models.models.