[PDF] Towards Expectation-Maximization by SQL in RDBMS

Abstract

Integrating machine learning techniques into RDBMSs is an important task since there are many real applications that require modeling (e.g., business intelligence, strategic analysis) as well as querying data in RDBMSs. In this paper, we provide an SQL solution that has the potential to support different machine learning modelings. As an example, we study how to support unsupervised probabilistic modeling, that has a wide range of applications in clustering, density estimation and data summarization, and focus on Expectation-Maximization (EM) algorithms, which is a general technique for finding maximum likelihood estimators. To train a model by EM, it needs to update the model parameters by an E-step and an M-step in a while-loop iteratively until it converges to a level controled by some threshold or repeats a certain number of iterations. To support EM in RDBMSs, we show our answers to the matrix/vectors representations in RDBMSs, the relational algebra operations to support the linear algebra operations required by EM, parameters update by relational algebra, and the support of a while-loop. It is important to note that the SQL'99 recursion cannot be used to handle such a while-loop since the M-step is non-monotonic. In addition, assume that a model has been trained by an EM algorithm, we further design an automatic in-database model maintenance mechanism to maintain the model when the underlying training data changes.We have conducted experimental studies and will report our findings in this paper.

Full PDF

TTowards Expectation-Maximization by SQL in RDBMS

Kangfei Zhao

The Chinese University of Hong [email protected]

Jeffrey Xu Yu

The Chinese University of Hong [email protected]

Yu Rong

Tecent AI [email protected]

Ming Liao

The Chinese University of Hong [email protected]

Junzhou Huang

University of Texas at [email protected]

ABSTRACT

Integrating machine learning techniques into

RDBMS s is an im-portant task since there are many real applications that requiremodeling (e.g., business intelligence, strategic analysis) as well asquerying data in

RDBMS s. Without integration, it needs to exportthe data from

RDBMS s to build a model using specialized machinelearning toolkits and frameworks, and import the model trainedback to

RDBMS s for further querying. Such a process is not de-sirable since it is time-consuming and needs to repeat when datais changed. To support machine learning in

RDBMS s, there areproposals that are platform-specific with limited functionalities tosupport certain modeling. In this paper, we provide an

SQL solu-tion that has the potential to support different machine learningmodelings. As an example, we study how to support unsupervisedprobabilistic modeling, that has a wide range of applications inclustering, density estimation and data summarization, and focuson Expectation-Maximization (EM) algorithms, which is a generaltechnique for finding maximum likelihood estimators. To train amodel by EM, it needs to update the model parameters by an E-stepand an M-step in a while-loop iteratively until it converges to alevel controled by some threshold or repeats a certain number ofiterations. To support EM in

RDBMS s, we show our answers tothe matrix/vectors representations in

RDBMS s, the relational alge-bra operations to support the linear algebra operations requiredby EM, parameters update by relational algebra, and the supportof a while-loop. It is important to note that the

SQL ’99 recursioncannot be used to handle such a while-loop since the M-step isnon-monotonic. In addition, assume that a model has been trainedby an EM algorithm, we further design an automatic in-databasemodel maintenance mechanism to maintain the model when theunderlying training data changes. We have conducted experimentalstudies and will report our findings in this paper.

Nowadays, integrating advanced data analytical techniques into

RDBMS s is an urgent requirement for data integration [19], businessintelligence and strategic analysis [14, 41]. Among these techniquesthat need to be integrated into

RDBMS s, machine learning modelsplay a leading role in predictive and estimation tasks. Althoughmany specialized machine learning toolkits and frameworks (e.g.,scikit [1] and TensorFlow [5]) are designed and developed, theapproaches of building, utilizing and managing machine learning

Conference’17, July 2017, Washington, DC, USA models in

RDBMS s still need a comprehensive exploration. First, inmost enterprise applications, data are stored in a database system.It is cumbersome and time-consuming of exporting the data fromthe database system and then feeding it into models, as well as im-porting the prediction and estimation results back to the databasesystem. Second, it is highly desirable that users can build a modelas to query data in

RDBMS s, and query their data by exploiting theanalysis result of the models trained as a part of query in a seamlesssimilar in

RDBMS s. What we need is a flexible way to train/query amachine learning model together with data querying by a high-levelquery language (e.g.,

SQL ). Third, the data maintained in

RDBMS sis supposed to change, and there is more data collected from timeto time frequently. The analysis result of the models trained ina machine learning toolkit/framework may be out-dated, whichrequires to repeat the process of exporting data from

RDBMS s fol-lowed by importing the model trained into

RDBMS s. Given the factthat

RDBMS s have the techniques (e.g., trigger) to manage data up-dating automatically when data changes, a further consideration ishow to manage the machine learning models update automaticallyusing the database techniques available.There are efforts to support machine learning in

RDBMS s. Earlyin-database machine learning is developed based on

UDF s or specificlibraries like MADlib [28] for

PostgreSQL , Oracle Data Mining [50],DB2 Intelligent Miner, etc. On one hand, these functions and li-braries can achieve optimized performance. On the other hand,they are platform-specific and have limited functionalities from thehigh-level syntax to the low-level implementation. It is difficult fordatabase end-users to extend these libraries to support their ownmodels that are not available in the libraries. To fulfill logical andphysical data isolation, model-based views [18, 33] are proposed tosupport classification and regression analysis in database systems.Like regular views, model-based views support querying, material-ization, and maintenance strategies. In brief, [18, 33] allow using anad-hoc create view statement to declare a classification view. In this create view statement, [18] specifies the model by an as ... fit ... bases clause, and the training data is fed by an SQL query, while [33]specifies a model explicitly with using svm clause, where the fea-tures and labels are fed by feature function and labels , respectively.Here, feature function takes database attributes as the input fea-tures, and labels are database attributes. Although these approachesprovide optimized implementation for classification models, their create view statement is lack of generality and deviating from theregular

SQL syntax. In addition, the models supported are limitedand implemented in a low-level form in a database system, whichmakes it difficult for ordinary database end-users to develop newmodels swiftly. In this work, we demonstrate our SQL recursive a r X i v : . [ c s . D B ] J a n uery can define a model-based view in an explicit fashion and canbe used to support many machine learning models. Different from[18, 33], we focus on unsupervised models in the application ofin-database clustering, density estimation and data summarization.We take a two-step approach to support machine learning in RDBMS s. The first step is to design in-database machine learningframework based on

SQL , in particular

SQL recursive queries, toshow that

SQL has its capability of doing machine learning by

SQL in RDBMS s. The second step is to further find an efficient way tosupport queries for machine learning in

RDBMS s. In this paper,we concentrate on the first step, following our previous work tosupport graph analytics in

RDBMS s using

SQL recursive queries [57].The focus of this paper is on how to train machine learning modelsin

RDBMS s, given that the core computations of model training arelinear algebra operations and parameter updating.Consider training a machine learning model. In brief, it has aninitial parameter setting for the model, and will update the parame-ters in a while-loop iteratively until it converges to a level controledby some threshold or repeats a certain number of iterations. Themodel trained is the model with the parameters obtained at theend of the while-loop. To do so in

RDBMS s, there are several thingsthat need to be handled: the ways to represent matrix/vector in

RDBMS s, the relational algebra operations to support the linearalgebra operations required, the way to update parameters, and thesupport of a while-loop. In this paper, we provide our answer tosuch needs.The main contributions of this work are summarized below. First,we study in

RDBMS s how to support unsupervised probabilisticmodeling, that has a wide range of applications in clustering, den-sity estimation and data summarization, and focus on Expectation-Maximization (EM) algorithms [39], which is a general techniquefor finding maximum likelihood estimators. In EM, the parametersto be trained are means, covariances, and mixing coefficients; thereare two main steps in a while-loop, namely, E-step for expecta-tion and M-step for maximization, and the parameters are updatedin the while-loop. Second, we discuss how to represent data in

RDBMS s in different ways, how to compute E-step and M-step us-ing relational algebra operations (e.g., natural join, group-by andaggregation), how to update parameters using relational algebraoperations, and how to support the while-loop using

SQL recur-sive queries. It is worth mentioning that the recursion for EM isa mutual recursion of E-step and M-step. Recall that the E-step isto compute the conditional posterior probability by Bayesian in-ference, which can be supported by

SQL as a monotonic operation,whereas the M-step is to compute and update the parameters ofthe model given a closed-form updating formula, which cannotbe monotonic. This fact suggests that

SQL ’99 recursion cannot beused to support EM, since

SQL ’99 recursion (e.g., recursive with )only supports stratified negation, and therefore cannot supportnon-monotonic operations. We use XY -stratified [7, 55, 56], andprovide an enhanced SQL recursion (e.g., with +), which can handlenon-monotonic operations. We have implemented our approachas an

SQL layer on top of

PostgreSQL , and process our with + using

PostgreSQL . We show how to train a batch of classical statisticalmodels [8], including Gaussian Mixture model, Bernoulli Mixturemodel, mixture of linear regression, Hidden Markov model, Mix-tures of Experts, using the recursive

SQL queries. Third, Given a model trained by an EM algorithm, we further design an automaticin-database model maintenance mechanism to maintain the modelwhen the underlying training data changes. Inspired by the onlineand incremental EM algorithms [37, 42], we show how to obtainthe sufficient statistics of the models to achieve the incrementaleven decremental model updating, without re-building the modelusing all data. It is worth mentioning that our setting is different.Different from the incremental EM algorithms which are designedto train the model during its iterative processing, we re-train themodel by sufficient statistics using partial data being used to buildthe previous model in addition to the new data. Fourth, we haveconducted experimental studies and will report our findings in thispaper.

Organization

Section 2 discusses the related works. In Section 3,we introduce the preliminaries including the EM algorithm andthe requirements to support it in database systems. Then, our solu-tion is sketched in Section 4 and the

SQL recursive query and theimplementation details are introduced in Section 5. In Section 6,we design a view update mechanism, which is facilitated by trig-gers. We conduct extensive experimental studies in Section 7 andconclude the paper in Section 8.

Our study is closely related to the following research topics:

Machine Learning with

SQL . There are a board discussion onthe approaches to using

SQL for ML, which are at different levelsof abstraction in the long-term research. In early years, Ordonezet al. presents pure

SQL implementation of the EM algorithm in

RDBMS , including K-means [46] and Gaussian Mixture Model [48].Their approach,

SQLEM [48], presents three strategies to imple-ment EM in

SQL : horizontal, vertical and a hybrid one. Howeverthese implementations cannot support high dimensional data anda large cluster number effectively. Computations involving matrixand vector primitives are expressible in

SQL with the aid of

UDF s.For example,

MAD [16, 28] is a in-database analytics library formatrix and vector operators. Luo et. al. [38] extend

SimSQL [12], aHadoop-based relational database system to enable linear algebracomputations. Taking the functions manipulating matrix/vectordata type as a set of building blocks, a

SQL query can support basicmachine learning task, e.g., least square regression. Furthermore,

UDF and

UDA can be used to implement gradient methods [10, 24].

MAD and

BISMARCK [24] use python

UDF and

UDA to supportstochastic gradient decent, respectively. To deploy applicationsof stochastic models and analytics techniques, the monte carlodatabase system (MCDB) [12, 29] provides stochastic models to bedirectly used with the data stored in a large database.In MCDB, a

UDF called value-generating (VG) function is used to draws samplesfrom databases and bayesian learning can be performed by

SQL queries subsequently.

Query Optimization for Machine Learning . Using databasetechniques to improve the efficiency of machine learning appli-cation is a research focus currently. These techniques mainly aimat minimizing the computational redundancy incurred by the extrastorage for normalized data and sparse data. [34, 49] learn linearmodels over multi-table normalized data, introducing the idea ofearning on factorized database.

Morpheus [15], a database mid-dleware, can automatically convert the linear algebra operators ofdenormalized matrix/vector to normalized data by a set of rewritingrules. [35] utilizes tuple-oriented compression to reduct the dataredundancy for mini-batch stochastic gradient descent. In addition,in the ML system design, logical and physical plan optimizationtechniques, like plan simplification and rewriting [9, 30, 31], oper-ator selection [9], physical operator fusion [9, 23], delta updatingrules [44] are widely used to improve the performance of the sys-tem. These optimization techniques for linear algebra and matrixcalculus provides a large potential to improve the performance ofmodel-based view in different scenarios.

Declarative Language for Machine Learning . Apart from pure

SQL , there have been some efforts aimed at building statistical andmachine learning applications by a declarative language in databaseand data processing systems. Microsoft Azure Data Lake Analyticsprovides an extension of

SQL , named

U-SQL [32] with a tight C

Sys-temML [9] expresses machine learning algorithms by a simplified Rand python like-language, and automatically translates the programinto execution plan on top of

Spark . BUDS [26] is a language forBayesian machine learning, specifically, Markov chain simulation,allowing distributed computation on types of sets, maps, vectorsand matrices. In addition, Datalog and its extensions [10, 22, 36]are also used to integrate statistical and machine learning into datamanagement systems. In this paper, we focus on

SQL query, as itis the most widely-used declarative query language in database.We show that users can build model-based view by

SQL recursivequery with limited enhancement and the support of vector/matrixdata type.

Model-based View in

RDBMS . To provide adequate support formodeling data in database system, the abstraction of model-basedview is proposed to persist the model in the database and data min-ing area.

MauveDB [18] is an architecture which supports a SQL-based declarative language to define views for regression model andinterpolation. Koc et. al. [33] define and maintain statistical modelfor classification in

RDBMS s, including least square regression, lo-gistic regression, ridge regression and SVM. Nikolic et. al. [45]propose a unified incremental view maintenance approach for fac-torized database, which can support liner regression model withgradient methods. Most of the existing studies focus on the super-vised statistical model while few attention is paid on supportingand maintaining unsupervised modeling in database system.

In this paper, we focus on unsupervised probabilistic modeling,which has broad applications in clustering, density estimation anddata summarization in database and data mining area. Specifically,the unsupervised models aim to reveal the relationship betweenthe observed data and some latent variables by maximizing the datalikelihood. The expectation-maximization (EM) algorithm, first in-troduced in [17], is a general technique for finding maximum likeli-hood estimators. It has a solid statistical basis, robust to noisy dataand its complexity is linear in data size. Here, we use the Gaussianmixture model [8], a widely used model in data mining, pattern

Algorithm 1:

EM Algorithm for Mixture Gaussian Model Initialize the means 𝝁 , covariances 𝝈 and mixing coefficients 𝝅 ; Compute the initial log-likelihood 𝐿 , 𝑖 ← while Δ 𝐿 > 𝜖 or 𝑖 < maxrecursion do E-step: compute the responsibilities 𝑝 ( 𝑧 𝑖𝑘 ) based on current 𝝁 , 𝝈 and 𝝅 by Eq. (3); M-step: re-estimate 𝝁 , 𝝈 and 𝝅 by Eq. (4)-(6); re-compute the log-likelihood 𝐿 ; 𝑖 ← 𝑖 + end while return 𝝁 , 𝝈 , 𝝅 ;recognition, and machine learning, as an example to illustrate theEM algorithm and our approach throughout this paper.Suppose we have an observed dataset 𝑿 = { 𝑥 , 𝑥 , · · · , 𝑥 𝑛 } ofn data points where 𝑥 𝑖 ∈ R 𝑑 . Given N ( 𝑥 | 𝝁 , 𝝈 ) is the probabilitydensity function of a Gaussian distribution with mean 𝝁 ∈ R 𝑑 andcovariance 𝝈 ∈ R 𝑑 × 𝑑 , the density of Gaussian mixture model is asimple linear super-position of 𝐾 different Gaussian componentsin the form of Eq. (1). 𝑝 ( 𝑥 𝑖 ) = 𝐾 ∑︁ 𝑘 = 𝜋 𝑘 N ( 𝑥 𝑖 | 𝝁 𝑘 , 𝝈 𝑘 ) (1)Here, 𝜋 𝑘 ∈ R is the mixing coefficient, i.e., the prior of a data pointbelonging to component 𝑘 and satisfies (cid:205) 𝐾𝑖 = 𝜋 𝑘 =

1. To model thisdataset 𝑋 using a mixture of Gaussians, the objective is to maximizethe log of the likelihood function in Eq. (2). 𝑙𝑛𝑝 ( 𝑿 | 𝝅 , 𝝁 , 𝝈 ) = 𝑛 ∑︁ 𝑖 = 𝑙𝑛 [ 𝐾 ∑︁ 𝑘 = 𝜋 𝑘 N ( 𝑥 𝑖 | 𝝁 𝑘 , 𝝈 𝑘 )] (2)Algorithm 1 sketches the EM algorithm for training the GaussianMixture Model. First, in line 1-2, the means 𝝁 𝑘 , covariances 𝝈 𝑘 and the mixing coefficients 𝝅 𝑘 of 𝐾 Gaussian distributions areinitialized, and the initial value of the log-likelihood (Eq. (2)) iscomputed. In the while loop of line 3-7, the Expectation-step (E-step) and Maximization-step (M-step) are executed alternatively.In the E-step, we compute the responsibilities, i.e., the conditionalprobability that 𝑥 𝑖 belongs to component 𝑘 , denoted as 𝑝 ( 𝑧 𝑖𝑘 ) byfixing the parameters based on the Bayes rule in Eq. (3). 𝑝 ( 𝑧 𝑖𝑘 ) = 𝜋 𝑘 N ( 𝑥 𝑖 | 𝝁 𝑘 , 𝝈 𝑘 ) (cid:205) 𝐾𝑗 = 𝜋 𝑗 N ( 𝑥 𝑖 | 𝝁 𝑗 , 𝝈 𝑗 ) (3)In M-step, we re-estimate a new set of parameters using the currentresponsibilities by maximizing the log-likelihood (Eq. (2)) as follows. 𝝁 𝑛𝑒𝑤𝑘 = 𝑛 𝑘 𝑛 ∑︁ 𝑖 = 𝑝 ( 𝑧 𝑖𝑘 ) 𝑥 𝑖 (4) 𝝈 𝑛𝑒𝑤𝑘 = 𝑛 𝑘 𝑛 ∑︁ 𝑖 = 𝑝 ( 𝑧 𝑖𝑘 )( 𝑥 𝑖 − 𝝁 𝑛𝑒𝑤𝑘 )( 𝑥 𝑖 − 𝝁 𝑛𝑒𝑤𝑘 ) 𝑇 (5) 𝝅 𝑛𝑒𝑤𝑘 = 𝑛 𝑘 𝑛 (6)where 𝑛 𝑘 = (cid:205) 𝑛𝑖 = 𝑝 ( 𝑧 𝑖𝑘 ) . At the end of each iteration, the new valueof log-likelihood is evaluated and used for checking convergence.The algorithm ends when the log-likelihood converges or a given 𝝅 𝝁 𝝈 𝜋 𝜇 𝜎 𝜋 𝜇 𝜎 (a) Relation GMM 𝐼𝐷 𝑥 𝑥 𝑥 (b) Relation 𝑋 𝐼𝐷 𝐾 𝑝 𝑝 ( 𝑧 ) 𝑝 ( 𝑧 ) 𝑝 ( 𝑧 ) 𝑝 ( 𝑧 ) (c) Relation 𝑅 Table 1: The relation representations iteration time is reached. In

RDBMS , the learnt model, the parame-ters of 𝐾 components, can be persisted in a relation of 𝐾 rows asshown in Table 1(a). Suppose 1-dimensional dataset 𝑋 as Table 1(b)is given, the posterior probability of 𝑥 𝑖 belongs to component 𝑘 can be computed as Table 1(c) and clustering can be conducted byassigning 𝑥 𝑖 to component with the maximum 𝑝 ( 𝑧 𝑖𝑘 ) .To fulfil the EM algorithm in database systems, there are severalimportant issues need to be concerned, including (1) the represen-tation and storage of high dimensional data in database. (2) therelation algebra operation used to perform linear algebra computa-tion in EM. (3) the approach for iterative parameter updating. (4)the way to express and control the iteration of EM algorithm. (5)the mechanism to maintain the existing model when underlyingdata involves.As an early attempt, Ordonez et. al. [48] proposed a SQL imple-mentation,

SQLEM . Their implementation is based on two strate-gies: horizontal and vertical. The horizontal approach organizesthe data points as the horizontal representation as shown in Ta-ble 2(b), where the relation has 𝑛 rows and 𝑑 attributes. The verticalapproach organizes the data points as the coordinate represen-tation as shown in Table 2(a), where the relation has 𝑛𝑑 rows, 3attributes for the identity of data, index and value. The horizontalapproach has efficient performance while the vertical approach hasflexible expression power. Thereby, [48] further proposes a hybridapproach, which persists both vertical and horizontal representa-tions simultaneously to achieve a tradeoff. Although it provides afeasible solution for EM algorithm in RDBMS , their approach hassome drawbacks. First, the hybrid and vertical approaches only havelimited flexibility in expressing linear algebra operations. For thehybrid approach, the users still need to specify the computation foreach dimension 𝑑 in the E-step and for each 𝑘 in the M-step, leadingto many repetitive and complicated expressions in the SQL queries.These queries also limit the supported models. For example, thecovariance matrix 𝝈 must be diagonal. Second, it has not providedan effective parameter updating and iteration control mechanisminside the database system. The SQL queries for one iteration E-stepand M-step are wrapped in a while-loop of a host language, i.e.,Java, python. This will undermine the overall performance due tolarge communication and I/O cost. Furthermore, their approachdoes not support model updating.

In this section, we propose a complete solution to deal with aboveissues in applying the EM algorithm and building model-basedviews inside

RDBMS . 𝐼𝐷 𝐶 𝑥 . . . . (a) coordinaterepresentation 𝑋 𝑎 𝐼𝐷 𝑑 𝑑 . . (b) horizontalrepresentation 𝑋 𝑏 𝐼𝐷 𝑥 [ . , . ] [ . , . ] (c) row-majorrepresentation 𝑋 𝑐 Table 2: The relation representations for matrixHigh Dimensional Representation . Regarding the issue of highdimensional data, different from [48], we adopt the row-major rep-resentation, as shown in Table 2(c), which is endorsed by allowingarray/vector data type in database. For one thing, this keeps theefficient performance of horizontal representation by reducing I/Ocost, especially for dense vectors. For the other, we can use thevector/matrix operations to support complicated linear algebracomputation in a concise

SQL query. As most

RDBMS s have pro-vided the array/vector datatype internally, apart from the build-inarray functions, many extended libraries of database [16, 50] alsoprovide additional statistical function and vector/matrix operationsfor multivariable statistical analysis and basic linear algebra calcu-lus. These high-level abstractions avoid letting end-users specifythe arithmetic operations on each dimension of the data point sothat serve as a set of building blocks of machine learning algorithms.Consider computing the means 𝝁 in the M-step (Eq. (4)) withthe 3 different representations 𝑋 𝑎 , 𝑋 𝑏 and 𝑋 𝑐 in Table 2. Supposethe responsibilities are in relation 𝑅 ( 𝐼𝐷, 𝐾, 𝑝 ) , where 𝐼𝐷 , 𝐾 and 𝑝 isthe identifier of data point and component, and the value of 𝑝 ( 𝑧 𝑖𝑘 ) .The relational algebra expressions to compute Eq. (4) are shown inEq. (7)-(9), respectively. 𝑀 𝑎 ← 𝜌 ( 𝐾,𝐶, mean ) ( 𝐾,𝐶 G sum ( 𝑝 ∗ 𝑥 ) ( 𝑅 (cid:90) 𝑅.𝐼𝐷 = 𝑋 𝑎 .𝐼𝐷 𝑋 𝑎 )) (7) 𝑀 𝑏 ← 𝜌 ( 𝐾,𝑑 ,𝑑 ) ( 𝐾 G sum ( 𝑝 ∗ 𝑑 ) , sum ( 𝑝 ∗ 𝑑 ) ( 𝑅 (cid:90) 𝑅.𝐼𝐷 = 𝑋 𝑏 .𝐼𝐷 𝑋 𝑏 )) (8) 𝑀 𝑐 ← 𝜌 ( 𝐾, mean ) ( 𝐾 G sum ( 𝑝 · 𝑥 ) ( 𝑅 (cid:90) 𝑅.𝐼𝐷 = 𝑋 𝑐 .𝐼𝐷 𝑋 𝑐 )) (9)We elaborate on these expressions. In Eq. (7) and Eq. (8), the binaryoperator ∗ is the arithmetic multiplication while in Eq. (9), theoperator · denotes a scalar-vector multiplication. First, all of these 3representations need to join 𝑋 and 𝑅 on the 𝐼𝐷 attribute to compute 𝑝 ( 𝑧 𝑖𝑘 ) 𝑥 𝑖 . The differences lay in the group and aggregation for eachcomponent 𝑘 . For the coordinate representation 𝑋 𝑎 in Eq. (7), apartfrom 𝐾 , we also need to group the index 𝐶 . For the horizontalrepresentation 𝑋 𝑏 in Eq. (8), we need to define the computation ofeach dimension 𝑑 in the aggregation expression. As the dimensionincreases, Eq. (7) faces the problem of high I/O cost while Eq. (8)leads to a verbose query. Consider the row-major representationwhich nesting separate dimension attributes into one vector-typeattribute. By introducing the · operator for vector computation,Eq. (4) is expressed in an efficient and clear way (Eq. (9)). Relational Algebra to Linear Algebra . On the basis of array/vec-tor data type and the derived statistical function and linear alge-bra operations, the complicated linear algebra computation canbe expressed by basic relational algebra operations (selection ( 𝜎 ),projection ( Π ), union ( ∪ ), Cartesian product ( × ), and rename ( 𝜌 )),together with group-by & aggregation. Let 𝑉 and 𝐸 ( 𝐸 ′ ) be theelation representation of vector and matrix, such that 𝑉 ( 𝐼𝐷, 𝑣 ) and 𝐸 ( 𝐹,𝑇, 𝑒 ) . Here 𝐼𝐷 is the tuple identifier in 𝑉 . 𝐹 and 𝑇 , standing forthe two indices of a matrix. [57] introduces two new operations tosupport the multiplication between a matrix and a vector (Eq. (10))and between two matrices (Eq. (11)) in their relation representation. 𝐸 ⊕( ⊙) (cid:90) 𝑇 = 𝐼𝐷 𝑉 = 𝐹 G ⊕( ⊙) ( 𝐸 (cid:90) 𝑇 = 𝐼𝐷 𝑉 ) (10) 𝐸 ⊕( ⊙) (cid:90) 𝐸.𝑇 = 𝐸 ′ .𝐹 𝐸 ′ = 𝐸.𝐹,𝐸 ′ .𝑇 G ⊕( ⊙) ( 𝐸 (cid:90) 𝐸.𝑇 = 𝐸 ′ .𝐹 𝐸 ′ ) (11)The matrix-vector multiplication (Eq. (10)) consists of two steps.The first step is computing 𝑣 ⊙ 𝑒 between a tuple in 𝐸 and a tuple in 𝑉 under the join condition 𝐸.𝑇 = 𝑉 .𝐼𝐷 . The second step is aggregatingall the ⊙ results by the operation of ⊕ for every group-by groupingby the attribute 𝐸.𝐹 . Similarly, the matrix-matrix multiplication(Eq. (11)) is done in two steps. The first step computes ⊙ betweena tuple in 𝐸 and a tuple in 𝐸 ′ under the join condition 𝐸.𝑇 = 𝐸 ′ .𝐹 .The second step aggregates all the ⊙ results by the operation of ⊕ for every group by grouping-by the attributes 𝐸.𝐹 and 𝐸 ′ .𝑇 . Theformula of re-estimating the means 𝝁 (Eq. (4)) is a matrix-vectormultiplication if data is 1-dimensional or a matrix-matrix multi-plication otherwise. When high dimensional data is in coordinaterepresentation (Table 2(a)), Eq. (7) is the application of Eq. (11).When high dimensional data is nested as the row-major represen-tation (Table 2(c)), the matrix-matrix multiplication is reduced tomatrix-vector multiplication, as shown in Eq. (9).Re-estimating the covariance/standard deviation 𝝈 (Eq. (5)) in-volves the element-wise matrix multiplication if data is 1-dimensionalor a tensor-matrix multiplication otherwise. The element-wise ma-trix multiplication can be expressed by join two matrices on theirtwo indices to compute 𝐸.𝑒 ⊙ 𝐸 ′ .𝑒 . An extra aggregation is requiredto aggregate on each component 𝑘 as shown in Eq. (12). 𝐸 ⊕( ⊙) (cid:90) 𝐸.𝐹 = 𝐸 ′ .𝐹𝐸.𝑇 = 𝐸 ′ .𝑇 𝐸 ′ = 𝐸.𝐹 G ⊕( ⊙) ( 𝐸 (cid:90) 𝐸.𝐹 = 𝐸 ′ .𝐹𝐸.𝑇 = 𝐸 ′ .𝑇 𝐸 ′ ) (12)Similarly, when ⊙ and ⊕ are vector operation and vector aggrega-tion, Eq. (12) is reduced to high dimensional tensor-matrix multi-plication. Value Updating . So far, we still need to deal with parameter up-date when training the model in multiple iterations. There is anew relational operation, union by update , denoted as ⊎ and firstdefined in [57] (Eq. (13)) to address value update in iterative graphcomputation. 𝑅 ⊎ 𝐴 𝑆 = ( 𝑅 − ( 𝑅 ⋉ 𝑅.𝐴 = 𝑆.𝐴 𝑆 )) ∪ 𝑆 (13)Suppose 𝑡 𝑟 is a tuple in 𝑅 and 𝑡 𝑠 is a tuple in 𝑆 . Different from theconventional union ( ∪ ), the union by update updates 𝑡 𝑟 by 𝑡 𝑠 if 𝑡 𝑟 and 𝑡 𝑠 are identical by some attributes 𝐴 . If 𝑡 𝑠 does not match any 𝑡 𝑟 , 𝑡 𝑠 is merged into the resulting relation. Given the relation ofparameters as shown in Table 1(a), the union by update updatesthe set of old parameters by the set of new parameters if they havethe identical component identifier 𝐾 . Iterative Evaluation . In the following, we elaborate on our ap-proach supporting the iterative model training by

SQL recursivequeries. Over decades,

RDBMS s have provided the functionality tosupport recursive queries, based on

SQL ’99 [25, 40]. The recursive with 𝑇𝐶 ( 𝐹,𝑇 ) as (3. ( select 𝐹 , 𝑇 from 𝐸 )4. union all

5. ( select

𝑇𝐶.𝐹 , 𝐸.𝑇 from 𝑇𝐶 , 𝐸 where 𝑇𝐶.𝑇 = 𝐸.𝐹 )) Figure 1: The recursive with statement with 𝑅 asselect · · · from 𝑅 ,𝑗 , · · · computed by · · · ( 𝑄 ) union by updateselect · · · from 𝑅 ,𝑗 , · · · computed by · · · ( 𝑄 ) Figure 2: The general form of the enhanced recursive with queries are expressed using with clause in

SQL . We introduce the with clause following the discussions given in [27]. with 𝑅 as ⟨ 𝑅 initialization ⟩ ⟨ recursive querying involving 𝑅 ⟩ Here, the recursive with clause defines a temporary recursive re-lation 𝑅 in the initialization step, and queries by referring the re-cursive relation 𝑅 iteratively in the recursive step until 𝑅 cannot bechanged. As an example, the edge transitive closure can be com-puted using with over the edge relation 𝐸 ( 𝐹,𝑇 ) , where 𝐹 and 𝑇 arefor “From” and “To”. As shown in Fig. 1, the recursive relation isnamed 𝑇𝐶 . Initially, the recursive relation 𝑇𝐶 is defined to projectthe two attributes, 𝐹 and 𝑇 , from the relation 𝐸 (line 3). Then, thequery in every iteration is to union 𝑇𝐶 computed and a relationwith two attributes 𝑇𝐶.𝐹 and

𝐸.𝑇 by joining the two relations, 𝑇𝐶 and 𝐸 , over the join condition 𝑇𝐶.𝑇 = 𝐸.𝐹 (line 5). However,

SQL ’99defines the recursion in a limited form. This with clause restrictsthe recursion to be a stratified program, where non-monotonicoperation, e.g., union by update is not allowed.To support iterative model update, we extend the existing loopingstructure followed the “algebra + while” given in [6].initialize 𝑅 while ( 𝑅 changes) { · · · ; 𝑅 ← · · · }In brief, in the looping, 𝑅 may change by the relational algebra inthe body of the looping. The looping will terminate until 𝑅 becomesstable. As discussed in [6], there are two semantics for “algebra +while”, namely, noninflationary and inflationary. The assignment, 𝑅 ← E , is to assign relation 𝑅 by evaluating the relational algebraexpression E . For the inflationary semantic, the assignment needs tobe cumulative, like the conventional union ( ∪ ). For the noninflation-ary semantic, the assignment can be destructive, i.g., the new valueoverwrites the old value, like union by update ( ⊎ ). In [57], it provesthat under the framework “algebra + while”, ⊎ leads to a fixpoint inthe enhanced recursive SQL queries by XY -stratification. The vec-tor/matrix data type, just as a nested attribute, does not violate thisproperty in this paper. Meanwhile, the Datalog program of Eq. (12)and other relational algebra expressions that consist of natural join,group-by and aggregation can be proved to be XY –stratification ina similar way as Eq. (10) and Eq. (11) in [57].To this end, the general syntax of the enhanced recursive with is sketched in Fig. 2. In the main body of the enhanced with , apartfrom union all , we also allow union by update to union the resultof initial query 𝑄 and recursive query 𝑄 . As the discussion in57], when union by update is used, it cannot be used more thanonce, and cannot be used with other union all together. Here, the computed by statement, as a new feature of enhanced with , allowsusers to specify how a relation 𝑅 𝑖,𝑗 is computed by a sequenceof queries. The queries wrapped in computed by must be non-recursive. In this paper, we mainly use the union by update of therecursive query for parameters update instead of union all . In thefollowing sections, we elaborate on the recursive query for EM andmodel updating in detail. We show the details of supporting the model-based view by the"algebra + while" approach, using

SQL recursive query. First, wepresent the relational algebra expressions needed, followed by theenhanced recursive query and our implementation. Second, thequeries for model inference are introduced.

For simplicity, here we consider the training data point 𝑥 𝑖 is 1-dimensional scalar. It is natural to extend the query to high dimen-sional input data when matrix/vector data type and functions aresupported by the database system. We represent the input data bya relation 𝑋 ( 𝐼𝐷, 𝑥 ) , where 𝐼𝐷 is the tuple identifier for data point 𝑥 𝑖 and 𝑥 is the numeric value. The model-based view, which ispersisted in the relation GMM(K, pie, mean, cov), where 𝐾 is theidentifier of the 𝑘 -th component, and ’pie’, ’mean’, and ’cov’ denotethe corresponding parameters, i.e., mixing coefficients, means andcovariances (standard deviations), respectively. The relation repre-sentations are shown in Table 1. The following relational algebraexpressions describe the E-step (Eq. (14)), M-step (Eq. (15)-(18)),and parameter updating (Eq. (19)) in one iteration. 𝑅 ← 𝜌 ( 𝐼𝐷,𝐾,𝑝 ) Π ( 𝐼𝐷,𝐾,𝑓 ) ( 𝐺𝑀𝑀 × 𝑋 ) (14) 𝑁 ← 𝜌 ( 𝐾, pie ) ( 𝑅 sum ( 𝑝 ) (cid:90) 𝑅.𝐼𝐷 = 𝑋 .𝐼𝐷 𝑋 ) (15) 𝑀 ← 𝜌 ( 𝐾, mean ) ( 𝑅 sum ( 𝑝 ∗ 𝑥 )/ sum ( 𝑝 ) (cid:90) 𝑅.𝐼𝐷 = 𝑋 .𝐼𝐷 𝑋 ) (16) 𝑇 ← Π 𝐼𝐷,𝐾, pow ( 𝑥 − mean ) ( 𝑋 × 𝑁 ) (17) 𝐶 ← 𝜌 ( 𝐾, cov ) 𝐾 G sum ( 𝑝 ∗ 𝑡 ) ( 𝑇 (cid:90) 𝑅.𝐼𝐷 = 𝑇 .𝐼𝐷𝑅.𝐾 = 𝑇 .𝐾 𝑅 ) (18) 𝐺𝑀𝑀 ← 𝜌 ( 𝐾, pie , mean , cov ) ( 𝑁 (cid:90) 𝑁 .𝐾 = 𝑀.𝐾 𝑀 (cid:90) 𝑀.𝐾 = 𝐶.𝐾 𝐶 ) (19)First, in Eq. (14), by performing a Cartesian product of GMMand 𝑋 , each data point is associated with the parameters of eachcomponent. The responsibilities are evaluated by applying an ana-lytical function 𝑓 to compute the normalized probability density(Eq. (3)) for each tuple, which is the E-step. The resulted relation 𝑅 ( 𝐼𝐷, 𝐾, 𝑝 ) is shown in Fig. 1(c). For the M-step, the mixing coeffi-cients ’pie’ (Eq. (15)), the means ’mean’ (Eq. (16)) and the covariances’cov’ (Eq. (17)-(18)) are re-estimated based on their update formulasin Eq. (4)-(6), respectively. In the end, in Eq. (19), the temporary rela-tions 𝑁 , 𝑀 and 𝐶 are joined on attribute 𝐾 to merge the parameters.The result is assigned to the recursive relation GMM.Fig. 3 shows the enhanced with query to support Gaussian Mix-ture Model by EM algorithm. The recursive relation GMM specifies with

2. GMM( 𝐾 , pie, mean, cov) as (3. ( select 𝐾 , pie, mean, cov from INIT_PARA)4. union by update 𝐾

5. ( select

𝑁 .𝐾 , pie/n, mean, sqrt (cov/pie)6. from 𝑁 , 𝐶 where 𝑁 .𝐾 = 𝐶.𝐾 computed by 𝑅 ( 𝐼𝐷 , 𝐾 , 𝑝 ) as select 𝐼𝐷 , k, norm (x, mean, cov) * pie /9. ( sum ( norm (x, mean, cov) * pie) over ( partition by 𝐼𝐷 ))10. from GMM, 𝑋 𝑁 ( 𝐾 , pie, mean) as select 𝐾 , sum ( 𝑝 ), sum ( 𝑝 * x) / sum ( 𝑝 )12. from 𝑅 , 𝑋 where 𝑅.𝐼𝐷 = 𝑋 .𝐼𝐷 group by 𝐾 𝐶 ( 𝐾 , cov) as select 𝑅.𝐾 , sum ( 𝑝 * 𝑇 .val) from

15. ( select 𝐼𝐷 , 𝐾 , pow(x-mean) as val from 𝑋 , 𝑁 ) as 𝑇 , 𝑅 where 𝑇 .𝐼𝐷 = 𝑅.𝐼𝐷 and

𝑇 .𝐾 = 𝑅.𝐾 group by

𝑅.𝐾 )18. maxrecursion select * from GMM

Figure 3: The enhanced recursive

SQL for Gaussian Mixtures the parameters of 𝑘 Gaussian distributions. In line 3, the initial queryloads the initial parameters from relation INI_PARA. The new pa-rameters are selected by the recursive query (line 5-6) evaluated bythe computed by statement and update the recursive relation by union by update w.r.t. the component index 𝐾 . It wraps the queriesto compute E-step and M-step of one iteration EM.We elaborate on the queries in the computed by statement (line8-17). Specifically, the query in line 8-10 performs the E-step, as therelational algebra in Eq. (14). Here, norm is the Gaussian (Normal)probability density function of data point 𝑥 given the mean andcovariance as input. We can use the window function, introduced in SQL ’03 to compute the responsibility by Bayes rule in Eq. (3). In line9, sum () over ( partition by ()) is the window function performingcalculation across a set of rows that are related to the current row.As it does not group rows, where each row retains its separateidentity, many RDBMS s allow to use it in the recursive query, e.g.,

PostgreSQL and

Oracle . The window function partitions rows of theCartesian product results in partitions of the same 𝐼𝐷 and computesthe denominator of Eq. (3). In line 11-13, the query computes themeans (Eq. (4)) and the mixing coefficients together by a matrix-matrix multiplication due to their common join of 𝑅 and 𝑋 . Then,line 14-17 computes the covariances of Eq. (5). First, we computethe square of 𝑥 𝑖 − 𝝁 𝑘 for each 𝑥 𝑖 and 𝑘 , which requires a Cartesianproduct of 𝑁 and 𝑅 (Eq. (17)). Second, the value is weighted bythe responsibility and aggregated as specified in Eq. (18). The newparameters in the temporary relation 𝑁 and 𝐶 will be merged byjoining on the component index 𝐾 in line 6.An acute reader may find that in Fig. 3, the recursive querydoes not compute and check the convergence of the log-likelihoodexplicitly. That is because the existing recursive query does notsupport the functionality of checking value convergence as well asearly stopping. However, the depth of recursion can be controlledby maxrecursion clause, which is adapted from SQL Server [3]. The maxrecursion clause can effectively prevent infinite recursion be-cause of infinite fix point, e.g., ’ with 𝑅 ( 𝑛 ) as (( select values (0)) union all ( select 𝑛 + from 𝑅 )) ’ , a standard SQL ’99 recursion.Users can check the convergence after training for a fixed numberof recursion and resume the training from current parameters ifnecessary. he implementation : We sketch how to support recursive queriesusing the enhanced with in RDBMS s. First, for each subquery 𝑄 𝑖 used in 𝑄 including those defined by the computed by statement, weconstruct a local dependency graph 𝐺 𝑖 . The graph 𝐺 𝑖 constructedmust be cycle free. We ensure that it is XY -stratified. Second, wecreate a PSM (Persistent Stored Model) in the recent

SQL standard.With

PSM , we create a unique procedure 𝐹 𝑄 for the recursive query 𝑄 to be processed, as illustrated below. create procedure 𝐹 𝑄 ( declare 𝐶 , · · · , 𝐶 𝑖 , · · · ; create table 𝑅 𝑖,𝑗 for all tables defined by as in a subquery 𝑄 𝑖 ;create SQL statement to compute the initial 𝑅 by union ofall initial subqueries; loopinsert into 𝑅 𝑖,𝑗 select · · · for every 𝑅 𝑖,𝑗 used in 𝑄 𝑖 ;compute condition 𝐶 𝑖 for each recursive subquery 𝑄 𝑖 ; if all 𝐶 𝑖 for the recursive subqueries are false then exit compute the recursive relation 𝑅 for the current iteration;union the current 𝑅 with the previous 𝑅 computed; end loop )In the procedure, 𝐹 𝑄 , first we declare variables 𝐶 , · · · , 𝐶 𝑖 , · · · forevery subquery 𝑄 𝑖 , which are used to check the condition to exitfrom the looping. Second, we create the temporary tables for therelations defined by as in the computed by statements. Third, weinclude SQL statements to compute the initial recursive relation 𝑅 .Fourth, we create a looping. In the looping, we generate an insert for 𝑅 𝑖,𝑗 , and check whether 𝑄 𝑖 is empty. If so, 𝐶 𝑖 is set to 0, indicat-ing 𝑄 𝑖 generates 0 tuple, the loop will be terminated. Otherwise,the recursive relation computed in this iteration will union withthe one computed in the previous iteration by either union all or union by update . An extra counter will be maintained in the loopif maxrecursion is used. With the procedure defined, we can runthe statements in the procedure 𝐹 𝑄 by issuing “ call 𝐹 𝑄 ”. Once the model is trained by the recursive query in Fig. 3, it can bematerialized in a view for online inference. In the phase of inference,users can query the view by

SQL to perform clustering, classificationand density estimation. Given a batch of data in relation 𝑋 and aview GMM computed by Fig. 3. The query below computes theposterior probability that the component 𝐾 generated the data withindex 𝐼𝐷 . The query is similar to computing the E-step (Eq. (3)) inline 5-7 of Fig. 3. create table 𝑅 as select 𝐼𝐷 , 𝐾 , norm (x, mean, cov) * pie / ( sum ( norm (x, mean, cov) * pie) over ( partition by 𝐼𝐷 )) from GMM, 𝑋 Based on relation 𝑅 ( 𝐼𝐷, 𝐾, 𝑝 ) above, we can further assign thedata into 𝐾 clusters, where 𝑥 𝑖 is assigned to cluster 𝑘 if the posteriorprobability 𝑝 ( 𝑧 𝑖𝑘 ) is the maximum among the { 𝑝 ( 𝑧 𝑖 , · · · 𝑝 ( 𝑧 𝑖𝐾 )} .The query below creates a relation CLU( 𝐼𝐷 , 𝐾 ) to persist the clus-tering result where 𝐼𝐷 and 𝐾 are the attributes of data point andits assigned cluster, respectively. It first finds the maximum 𝑝 ( 𝑧 𝑖𝑘 ) for each data point by a subquery on relation 𝑅 . The result is re-named as 𝑇 and is joined with 𝑅 on the condition of 𝑅.𝐼𝐷 = 𝑇 .𝐼𝐷 and

𝑅.𝑝 = 𝑇 .𝑝 to find the corresponding 𝑘 . ! " ! ! $ ! % ! ! % ! & ! ' ( ) ( * Data Selected Data Model Sufficient StatisticsModelInserted Data ! & ! ' InsertTriggerInsertTriggerInsertTrigger + , -

Figure 4: Overview of Model Maintenance create trigger T1 before insert on 𝑋 for each statement execute procedure DATA _ SELECTION create trigger T2 before insert on 𝑋 for each row execute procedure DATA _ INSERTION create trigger T3 after insert on 𝑋 for each statement execute procedure MODEL _ UPDATE

Figure 5: The triggers for incremental updatecreate table

CLU as select 𝐼𝐷 , 𝐾 from 𝑅 ,( select 𝐼𝐷 , max ( 𝑝 ) as 𝑝 from 𝑅 group by 𝐼𝐷 ) as 𝑇 , where 𝑅.𝐼𝐷 = 𝑇 .𝐼𝐷 and

𝑅.𝑝 = 𝑇 .𝑝

It is worth nothing that both of the queries above only accessthe data exactly once. Thereby, it is possible to perform the infer-ence on-the-fly and only for interested data. Besides from densityestimation and clustering, result evaluation, e.g., computing thepurity, normalized mutual information (NMI) and Rand Index canbe conducted in database by

SQL queries.

In this section, we investigate the automatic model/view updating.When the underlying data 𝑋 changes, a straightforward way is tore-estimate the model over the updated data. However, when onlya small portion of the training data are updated, the changes of thecorresponding model are slight, it is inefficient to re-estimate themodel on-the-fly. Hence, a natural idea is arose that whether wecan update existing model by exploring the ’incremental variant’ ofthe EM algorithm. And this variant can be maintained by the newlyarriving data and a small portion of data extracted from originaldataset. As the statistical model trained by the SQL queries canbe represented by its sufficient statistics, the model is updated bymaintaining the model and sufficient statistics.The sufficient statistic is a function of data 𝑋 that contains allof the information relevant to estimate the model parameters. Thesufficient statistics of these data are computed to update the sta-tistics of model. As the model is updated, the statistics of data isalso updated followed by the changing of the posterior probability 𝑝 ( 𝑧 𝑖𝑘 ) . This process repeats until the statistics converge. It is worthmentioning that this maintenance mechanism can support all themodels with have sufficient statistics. We elaborate the sufficientstatistics updating rules in the following.uppose the training dataset of model 𝜽 is { 𝑥 , 𝑥 , · · · , 𝑥 𝑛 } . Let 𝒔 be the sufficient statistics of 𝜽 , based on the Factorization Theo-rem [21], we can obtain 𝒔 = 𝑛 ∑︁ 𝑖 = ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) (20)where 𝒛 is the unobserved variable, 𝜙 denotes the mapping functionfrom an instance ( 𝑥 𝑖 , 𝒛 ) to the sufficient statistics contributed by 𝑥 𝑖 .The inserted data is { 𝑥 𝑛 + , 𝑥 𝑛 + , · · · , 𝑥 𝑚 } . Let the new model foroverall data { 𝑥 , · · · , 𝑥 𝑛 , 𝑥 𝑛 + , · · · , 𝑥 𝑚 } be (cid:101) 𝜽 and the correspondingsufficient statistics be (cid:101) 𝒔 . The difference of (cid:101) 𝒔 − 𝒔 , denoted as Δ 𝒔 is Δ 𝒔 = 𝑛 + 𝑚 ∑︁ 𝑖 = ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , (cid:101) 𝜽 ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) − 𝑛 ∑︁ 𝑖 = ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) = 𝑛 + 𝑚 ∑︁ 𝑖 = ∑︁ 𝒛 [ 𝑝 ( 𝒛 | 𝑥 𝑖 , (cid:101) 𝜽 ) − 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 )] 𝜙 ( 𝑥 𝑖 , 𝒛 ) (21) + 𝑚 ∑︁ 𝑖 = 𝑛 + ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) (22)According to above equations, we observe that the delta part ofthe sufficient statistics Δ 𝒔 consists of two parts: (1) changes ofthe sufficient statistics for the overall data points { 𝑥 , 𝑥 · · · 𝑥 𝑚 } in Eq. (21), and (2) the additional sufficient statistics for the newlyinserted data points { 𝑥 𝑛 + , · · · 𝑥 𝑚 } in Eq. (22). Consider to retrain anew model (cid:101) 𝜽 over { 𝑥 , 𝑥 , · · · , 𝑥 𝑚 } in 𝑇 iterations by taking 𝜽 asthe initial parameter, i.e., 𝜽 ( ) = 𝜽 and 𝜽 ( 𝑇 ) = (cid:101) 𝜽 . We have Δ 𝒔 = 𝑛 + 𝑚 ∑︁ 𝑖 = ∑︁ 𝒛 [ 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( 𝑇 ) ) − 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( ) )] 𝜙 ( 𝑥 𝑖 , 𝒛 ) (23) + 𝑚 ∑︁ 𝑖 = 𝑛 + ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( ) ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) (24) = 𝑇 ∑︁ 𝑡 = 𝑛 + 𝑚 ∑︁ 𝑖 = ∑︁ 𝒛 [ 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( 𝑡 ) ) − 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( 𝑡 − ) )] 𝜙 ( 𝑥 𝑖 , 𝒛 ) (25) + 𝑚 ∑︁ 𝑖 = 𝑛 + ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( ) ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) (26)Above equations indicts how to compute Δ 𝒔 . For the inserted data { 𝑥 𝑛 + , · · · 𝑥 𝑚 } , the delta can be directly computed by evaluatingthe original model 𝜽 ( ) as Eq. (26). while for original data, the deltacan be computed by updating the model 𝜽 ( 𝑡 ) iteratively using allthe data { 𝑥 , 𝑥 · · · 𝑥 𝑚 } as Eq. (25). Since most of the computationalcost is concentrated on the iteration of Eq. (25), we use two tricksto approximate the computation. First, we use the stochastic ap-proximation algorithm, where the parameters are updated after thesufficient statistics of each new data point 𝑥 𝑖 is computed, instead ofthe full batch dataset. This approach is widely used in many onlineand incremental EM algorithm variations [37, 43, 51]. The secondis discarding the data points which are not likely to change theircluster in the future, as the scaling clustering algorithms adopt forspeedup [11]. In other words, due to the slight changes of data andmodels, only a small portion of the original data points with unsta-ble membership need to be retrained. We discuss our strategy ofselecting partial original data in { 𝑥 , 𝑥 , · · · , 𝑥 𝑛 } for model update.It is a tradeoff between the accuracy of the model and the updating cost. The more data we use to update the model, the more accuracythe model we can achieve while the larger computation and I/Ocost to spend. The data points which are unlikely to change clustermembership can be discarded in the future model update. Thereare two strategies for the discard, a distance-based and a density-based strategy. For the distance-based strategy, we use Mahalanobisdistance [20] to measure the distance between a data point and adistribution. For each data 𝑥 𝑖 , we compute the Mahalanobis dis-tance, 𝐷 𝑘 ( 𝑥 𝑖 ) , to the 𝑘 -th component with mean 𝝁 𝑘 and covariance 𝝈 𝑘 . 𝐷 𝑘 ( 𝑥 𝑖 ) = √︃ ( 𝑥 𝑖 − 𝝁 𝑘 ) 𝑇 𝝈 − 𝑘 ( 𝑥 𝑖 − 𝝁 𝑘 ) (27)We can filter the data within a given thresholding radius with anycomponent. Another measurement is the entropy of the posteriorprobability for data 𝑥 𝑖 as in Eq. (28), where 𝑝 ( 𝑧 𝑖𝑘 ) is evaluated byparameter 𝜽 ( ) . The larger the entropy, the lower the possibility ofassigning 𝑥 𝑖 to any one of the component. 𝐸 ( 𝑥 𝑖 ) = − 𝐾 ∑︁ 𝑘 = 𝑝 ( 𝑧 𝑖𝑘 ) 𝑙𝑛 𝑝 ( 𝑧 𝑖𝑘 ) (28)It is worth mentioning that the data selection in trigger T1 canbe performed offline, i.e., persisting a subset of training data withfixed budget size for model updating in the future. In addition, thesufficient statistics for original model 𝜽 (line 1 of Algorithm 2)can be precomputed. Those will improve the efficiency of onlinemodel maintenance significantly. The alternative fine-grained dataselection strategies are beyond the scope of this paper.Similarly, considering deleting 𝑚 data points { 𝑥 𝑛 − 𝑚 + , · · · 𝑥 𝑛 } from { 𝑥 , 𝑥 · · · 𝑥 𝑛 } , the difference of the sufficient statistics, Δ 𝒔 is Δ 𝒔 = 𝑇 ∑︁ 𝑡 = 𝑛 − 𝑚 ∑︁ 𝑖 = ∑︁ 𝒛 [ 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( 𝑡 ) ) − 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( 𝑡 − ) )] 𝜙 ( 𝑥 𝑖 , 𝒛 ) (29) − 𝑛 ∑︁ 𝑖 = 𝑛 − 𝑚 + ∑︁ 𝒛 𝑝 ( 𝒛 | 𝑥 𝑖 , 𝜽 ( ) ) 𝜙 ( 𝑥 𝑖 , 𝒛 ) In RDBMS s, the automatic model updating mechanism is enabledby triggers build on the relation of the input data 𝑋 . Fig. 4 illus-trates the overview of our model/view updating mechanism. Thereare three triggers built on the relation of training data 𝑋 , whosedefinitions are shown in Fig. 5. Before executing the insertion oper-ation, two triggers T1 (line 1-3 in Fig. 5) and T2 (line 4-6 in Fig. 5)prepare the data for model updating in a temporary relation 𝑋 ′ .Here, T1 performs on each row to select a subset from original datain { 𝑥 , 𝑥 , · · · , 𝑥 𝑛 } based on a selection criterion. Additionally, T2 inserts all the newly arrived data { 𝑥 + 𝑛 , 𝑥 , · · · , 𝑥 𝑚 } to relation 𝑋 ′ .After the data preparation finished, another trigger T3 (line 7-9 inFig. 5) will call a PSM to compute the Δ 𝒔 by 𝑋 ′ . In the PSM , first,the delta of the newly inserted data (Eq. (25)) is computed as usedto reinitialize the parameters of the model. Then, 𝑇 iterations ofscanning relation 𝑋 ′ is performed. Where in each iteration. 𝑋 ′ israndomly shuffled and each data point is used to update the suf-ficient statistics it contributes as well as the model instantly. Theactions of these triggers are transparent to the database users. Fi-nally, we use Gaussian Mixture model an example to illustrate thisprocedure. lgorithm 2: MODEL _ UPDATE Initialize the original sufficient statistics 𝒔 by Eq. (30); 𝑠 𝑘 ← 𝑠 𝑘 + (cid:205) 𝑚𝑖 = 𝑛 + 𝑝 ( 𝑧 𝑖𝑘 ) 𝑥 𝑖 ; 𝑠 𝑘 ← 𝑠 𝑘 + (cid:205) 𝑚𝑖 = 𝑛 + 𝑝 ( 𝑧 𝑖𝑘 ) 𝑥 𝑖 𝑥 𝑇𝑖 ; Update model parameters 𝝁 , 𝝅 and 𝝈 by Eq. (31); for 𝑡 ← 𝑇 do for 𝑥 𝑖 ∈ 𝑋 ′ in random order do 𝑠 𝑘 ← 𝑠 𝑘 + 𝑝 ( 𝑧 𝑖𝑘 ) ( 𝑡 ) 𝑥 𝑖 − 𝑝 ( 𝑧 𝑖𝑘 ) ( 𝑡 − ) 𝑥 𝑖 ; 𝑠 𝑘 ← 𝑠 𝑘 + 𝑝 ( 𝑧 𝑖𝑘 ) ( 𝑡 ) 𝑥 𝑖 𝑥 𝑇𝑖 − 𝑝 ( 𝑧 𝑖𝑘 ) ( 𝑡 − ) 𝑥 𝑖 𝑥 𝑇𝑖 ; Update model parameters 𝝁 , 𝝅 and 𝝈 by Eq. (31); end for end for return 𝝁 , 𝝈 , 𝝅 ; Example 6.1.

For Gaussian Mixture model of 𝐾 components,the minimal sufficient statistics 𝒔 = ( 𝑠 , 𝑠 , · · · , 𝑠 𝐾 , 𝑠 𝐾 ) , where 𝑠 𝑘 ∈ R and 𝑠 𝑘 ∈ R 𝑑 × 𝑑 for each 𝑘 ∈ { , · · · , 𝐾 } are as below. 𝑠 𝑘 = 𝑛 ∑︁ 𝑖 = 𝑝 ( 𝑧 𝑖𝑘 ) 𝑥 𝑖 , 𝑠 𝑘 = 𝑛 ∑︁ 𝑖 = 𝑝 ( 𝑧 𝑖𝑘 ) 𝑥 𝑖 𝑥 𝑇𝑖 (30)And the parameter can be computed by the sufficient statistics asshown in Eq. (31). 𝝁 𝑘 = 𝑛 𝑠 𝑘 , 𝝅 𝑘 = 𝑛 𝑛 ∑︁ 𝑖 = 𝑝 ( 𝑧 𝑖𝑘 ) (31) 𝝈 𝑘 = 𝑛 𝑠 𝑘 − 𝝁 𝑘 𝝁 𝑇𝑘 When an insert command of relation 𝑋 is issued, at first, trigger T1 and T2 prepare the temporary relation 𝑋 ′ . T3 is triggered followedby the insertion command. The procedure T3 executed is illustratedin Algorithm 2. In line 2, tigger T3 first adds the sufficient statisticsof the inserted data into 𝒔 . Then it further updates 𝒔 by performing 𝑇 iterations over 𝑋 ′ , in each iteration, each data point 𝑥 𝑖 is usedto update 𝒔 as well as the model instantly. Here, 𝑝 ( 𝑧 𝑖𝑘 ) ( 𝑡 ) is theresponsibility of 𝑥 𝑖 (Eq. (3)) evaluated in 𝑡 -th iteration. □ In this section, we present our experimental studies of supportingmodel-based view training, inference, and maintenance in

RDBMS .We conduct extensive experiments to investigate the followingfacets: • compare the performance of our enhanced with and loopingcontrol by a host language. • test the scalability of the recursive queries for different mod-els on synthetic data. • conduct a case study on market segmentation on retail data. • validate the efficiency of our model maintenance mechanism. Experimental Setup : We report our performance studies on aPC with Intel(R) Xeon(R) CPU E5-2697 v3 (2.60GHz) with 96GBRAM running Linux CentOS 7.5 64 bit. We tested the enhancedrecursive query on

PostgreSQL

PostgreSQL instance. T i m e ( s ) denhanced with (a) varying d T i m e ( s ) nPsycopg2 (b) varying n T i m e ( s ) k (c) varying k T i m e ( s ) (d) varying number of iterations Figure 6: with + vs.

Psycopg2 with + vs.

Psycopg2

We compare the enhanced with , which translates the recursive

SQL query to

SQL / PSM with the implementation of using a hostlanguage to control the looping, which is adopted in pervious EMimplementation [48]. We implement the latter by

Psycopg2 [2], apopular

PostgreSQL adapter for the python language. Regardingthe EM algorithm, the E-step, M-step, and parameter updating arewrapped in a python for-loop, and executed by a cursor alternatively.We compare the running time of this two implementations, i.e.,enhanced with and

Psycopg2 for training Gaussian Mixture Modelby varying the dimension 𝑑 of data point (Fig. 6(a)), the scale of thetraining data 𝑛 (Fig. 6(b)), the number of components 𝑘 (Fig. 6(c))and the number of iterations (Fig. 6(d)). The training data is evenlygenerated from 10 Gaussian distributions.The evaluated time is the pure query execution time where thecosts of database connection, data loading and parameter initial-ization are excluded. The experiments show that enhanced with outperforms Psycopg2 significantly, not only for multiple iteratorsin Fig. 6(d) but also for per iteration in Fig. 6(d)- 6(c). For one thing,the implementation of

Psycopg2 calls the databases multiple timesper iteration, incurring much client-server communication and con-text switch costs. For the other, the issued queries from client toserver will be parsed, optimized and planned on-the-fly. These arethe general problems of calling

SQL queries by any host language.Meanwhile, we implement the hybrid strategy of

SQLEM [47] on

PostgreSQL . For Gaussian Mixture model, one iteration for 10,000data points with 10 dimensions fails to terminate within 1 hour. Intheir implementation, 2 𝑘 separate SQL queries evaluate the meansand variances of 𝑘 components respectively, which is a performancebottleneck. We train Gaussian Mixture model (

GMM ) [8], mixture of linearregression (

MLR ) [52] and a neural network model, mixture of ex-perts (

MOE ) [54] by evaluating

SQL recursive queries in

PostgreSQL . T i m e ( s ) d GMM (a) k = 8, n = 10000, varying d T i m e ( s ) nMLR (b) k = 8, d = 20, varying n T i m e ( s ) kMOE (c) d = 20, n = 10000, varying k Figure 7: Scalability Test -18k-16k-14k-12k1 2 3 4 5 6 7 8 9 101112131415

Log L i k e li hood Iterations (a)

GMM -6k-4k-2k01 2 3 4 5 6 7 8 9 101112131415

Log L i k e li hood Iterations (b)

MLR -7k-5k-3k-1k1 2 3 4 5 6 7 8 9 101112131415

Log L i k e li hood Iterations (c)

MOE

Figure 8: Convergence

Given the observed dataset as {( 𝑥 , 𝑦 ) , ( 𝑥 , 𝑦 ) , · · · , ( 𝑥 𝑛 , 𝑦 𝑛 )} , where 𝑥 𝑖 ∈ R 𝑑 and 𝑦 𝑖 ∈ R , the MLR models the density of 𝑦 given x as 𝑝 ( 𝑦 𝑖 | 𝑥 𝑖 ) = 𝐾 ∑︁ 𝑘 = 𝜋 𝑘 N ( 𝑦 𝑖 | 𝑥 𝑇𝑖 𝜷 𝑘 , 𝝈 𝑘 ) (32)And the MOE models the density of 𝑦 given 𝑥 as 𝑝 ( 𝑦 𝑖 | 𝑥 𝑖 ) = 𝐾 ∑︁ 𝑘 = 𝑔 𝑘 ( 𝑥 𝑖 ) N ( 𝑦 𝑖 | 𝑥 𝑇𝑖 𝜷 𝑘 , 𝝈 𝑘 ) (33)where 𝜷 𝑘 ∈ R 𝑑 is the parameters of a linear transformer, N is theprobability density function of a Gaussian given mean 𝑥 𝑇𝑖 𝜷 𝑘 ∈ R and standard deviation 𝝈 𝑘 ∈ R . In Eq. (33), 𝑔 𝑘 ( 𝑥 ) is called thegating function, given by computing the softmax in Eq. (34) where 𝜽 ∈ R 𝑑 is a set of linear weights on 𝑥 𝑖 . 𝑔 𝑘 ( 𝑥 𝑖 ) = 𝑒 𝑥 𝑖 𝜽 𝑘 (cid:205) 𝐾𝑗 = 𝑒 𝑥 𝑖 𝜽 𝑗 (34)The intuition behind the gating functions is a set of ’soft’ learnableweights which determine the mixture of 𝐾 local models. We adoptthe single loop EM algorithm [53] to estimate the parameters of MOE , which uses least square regression to compute the gatingnetwork directly. For

GMM , the training data is evenly drawn from10 Gaussian distributions. For

MLR and

MOE , the training datais generated from 10 linear functions with Gaussian noise. Theparameters of the Gaussians and the linear functions are drawnfrom the uniform distribution [ , ] . And the initial parametersare also randomly drawn from [ , ] .Fig. 7 displays the training time per iteration of the 3 models byvarying the data dimension 𝑑 (Fig. 7(a)), the scale of the trainingdata 𝑛 (Fig. 7(b)) and the number of clusters 𝑘 (Fig. 7(c)). In gen-eral, for the 3 models, the training time grows linearly as 𝑛 and 𝑘 increase, while the increment of data dimension 𝑑 has a moreremarkable impact on the training time. When increasing 𝑛 and 𝑘 , the size of intermediate relations, e.g., relation 𝑅 for computingthe responsibilities in Eq. (14) grow linearly. Therefore the train-ing cost grows linearly with regards to 𝑛 and 𝑘 . However, in the3 models, we need to deal with 𝑑 × 𝑑 dimensional matrices in theM-step. For GMM , it needs to compute the probability density ofthe multivariable Gaussians and reestimate the covariance matrices.For

MLR and

MOE , they need to compute the matrix inversion andleast square regression. The training cost grows with regard to thesize of matrix. The comparison shows it is still hard to scale high-dimensional analysis in a traditional database system. However, theefficiency can be improved on a parallel/distributed platform andnew hardware.

We study building model-based view in

PostgreSQL for a real appli-cation, i.e., market segmentation, which partitions the consumersinto sub-groups based on their features to analyze their purchase be-havior and identify potential market. The data is collected from anonline retailer, containing 541,908 transactions of 4,308 consumers.Based on the RFM model [13] for consumer value analysis, for eachconsumers, a three dimensional feature of real value including re-cency, frequency, monetary is extracted. The feature is normalizedby corresponding means and variances.We apply the

GMM (Eq. (1)),

MLR (Eq. (32)) and

MOE (Eq. (33))over the 4,308 consumers and the clustering result is visualizedin Fig. 9. For

MLR and

MOE , the recency and frequency are thefeatures 𝑥 and monetary is 𝑦 . These two models assume it existsa linear relationship between the recency, frequency and mone-tary of a consumer. All the initial parameters are randomly drawnfrom the uniform distribution [ , ] . In Fig. 9, the consumers aresegmented into 4 clusters by different colors. The segmentationof GMM (Fig. 9(a)) fits data points by Gaussian distributions of 3variables, while

MLR (Fig. 9(b)) and

MOE (Fig. 9(c)) fit them byplanes with linearity. The neural model

MOE tends to generate asharper cluster boundary than

MLR . Fig. 8 shows the convergenceof the log-likelihood of the 3 models during 15 training iterations. Infact, the convergence can be fast and achieved within 5 iterations. e c e n c y −2 −1 0 1 F r e q u e n c y −2 0 2 4 6 M o n e t a r y −4−20246 (a) GMM R e c e n c y −2 −1 0 1 F r e q u e n c y −2 0 2 4 6 M o n e t a r y −4−20246 (b) MLR R e c e n c y −2 −1 0 1 F r e q u e n c y −2 0 2 4 6 M o n e t a r y −4−20246 (c) MOE Figure 9: Case Study: Market Segmentation T i m e ( s ) budget sizebatch 10 (a) n = 100K T i m e ( s ) budget size20 30 (b) n = 1M T i m e ( s ) budget size40 50 (c) n = 10M Figure 10: Insert maintenance

Finally, we test the performance of our trigger-based model up-dating mechanism. First, we train

GMM for 1-dimensional datagenerated from 2 Gaussian distributions. The original models aretrained over 100k, 1M and 10M data points, respectively with 15iterations. The overall training time is recorded as the ’batch’ modetraining time, which is 54s, 501s and 4,841s respectively. After themodel is trained and persisted. We insert 10, 20, 30, 40, 50 datapoints to the underlying data by varying the budget size of selecteddata from 0 to 1,000.Fig. 10 shows the insertion time w.r.t. the budget size of theselected data for the 3 models. The insertion time is the collapsedtime from the insert command issuing to the transaction commit,including the cost of data selection with the density-based strategyand computing initial sufficient statistics (line 1 of Algorithm 2). Asthe number of processed tuples increases, the insertion time growslinearly. Compare to the retraining cost, i.e., the batch training time,it is not always efficient to update the existing model. The choicedepends on two factors, the size of overall data points, and thebudget size plus insert size, i.e., the numbers of data points to beprocessed in the updating. The updating mechanism may not beefficient and effective when the overall data size is small or there isa large volume of insertion. That is because, for the batch trainingmode, computation of parameter evaluation dominates the cost.While for the model updating, since the sufficient statistics and themodel are updated when processing each data point, the updatingoverhead becomes a main overhead. Meanwhile, we notice that the collapsed time of data selection and computing initial sufficientstatistics take about 10s, 100s and 1,000s for data size of 100k, 1Mand 10M, respectively. Precomputing and persisting these resultswill benefit for a larger dataset.In this paper, we focus on testing the efficiency of the approxi-mation for model updating. The convergence and approximationguarantee involves a wide range of research topics in statistical andmachine learning area, which we leave it as future investigation.

Integrating machine learning techniques into database systems facil-itates a wide range of applications in industrial and academic fields.In this paper, we focus on supporting EM algorithm in

RDBMS . Dif-ferent from the previous approach, our approach wraps the E-stepand M-step in an enhanced

SQL recursive query, which is ensured toreach an iterative fix point. The learned model can be materializedas a database view and queries as conventional views. Furthermore,to tackle the slight changes of underlying training data, we proposean automatic view updating mechanism by exploiting the incre-mental variant of the EM algorithm. The extensive experimentswe conducted validate our enhanced recursive query outperformsprevious approach significantly and can support multiple mixturemodels by EM algorithm, as well as the efficiency of the incrementalmodel update. It is worth mentioning that the

SQL recursive queryis not only suitable for unsupervised learning like EM algorithm,but also has the potential to support supervised learning like classi-fication and regression. And the implementation of the query cane migrated to parallel and distributed platforms, e.g.,

Hadoop and

Spark , to deploy large scale machine learning applications. Thesedirections and convergency and performance guarantee deservefuture explorations.

ACKNOWLEDGEMENT

This work is supported by the Research Grants Council of HongKong, China under No. 14203618, No. 14202919 and No. 14205520.

REFERENCES

Proc. of OSDI’16. , pages265–283, 2016.[6] S. Abiteboul, R. Hull, and V. Vianu.

Foundations of Databases . Addison-Wesley,1995.[7] F. Arni, K. Ong, S. Tsur, H. Wang, and C. Zaniolo. The deductive database systemLDL++.

TPLP , 3(1), 2003.[8] C. M. Bishop.

Pattern recognition and machine learning, 5th Edition . Informationscience and statistics. Springer, 2007.[9] M. Boehm, M. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi,N. Pansare, B. Reinwald, F. Reiss, P. Sen, A. Surve, and S. Tatikonda. Systemml:Declarative machine learning on spark.

PVLDB , 9(13):1425–1436, 2016.[10] V. R. Borkar, Y. Bu, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, andR. Ramakrishnan. Declarative systems for large-scale machine learning.

IEEEData Eng. Bull. , 35(2):24–32, 2012.[11] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to largedatabases. In

Proc. of KDD’98 , pages 9–15, 1998.[12] Z. Cai, Z. Vagena, L. L. Perez, S. Arumugam, P. J. Haas, and C. M. Jermaine.Simulation of database-valued markov chains using simsql. In

Proc. of SIGMOD’13 ,pages 637–648, 2013.[13] D. Chen, S. L. Sain, and K. Guo. Data mining for the online retail industry: A casestudy of rfm model-based customer segmentation using data mining.

Journal ofDatabase Marketing & Customer Strategy Management , 19(3):197–208, 2012.[14] H. Chen, R. H. L. Chiang, and V. C. Storey. Business intelligence and analytics:From big data to big impact.

MIS Quarterly , 36(4):1165–1188, 2012.[15] L. Chen, A. Kumar, J. F. Naughton, and J. M. Patel. Towards linear algebra overnormalized data.

PVLDB , 10(11):1214–1225, 2017.[16] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: Newanalysis practices for big data.

PVLDB , 2(2):1481–1492, 2009.[17] A. P. Dempster. Maximum likelihood estimation from incomplete data via the emalgorithm.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,39:1–38, 1977.[18] A. Deshpande and S. Madden. Mauvedb: supporting model-based user views indatabase systems. In

Proc. of SIGMOD’06 , pages 73–84, 2006.[19] X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complexinformation spaces. In

Proc. of SIGMOD’05 , pages 85–96, 2005.[20] R. O. Duda and P. E. Hart.

Pattern classification and scene analysis . A Wiley-Interscience publication. Wiley, 1973.[21] R. O. Duda, P. E. Hart, and D. G. Stork.

Pattern Classification . Wiley, New York, 2edition, 2001.[22] J. Eisner. Dyna: A non -probabilistic programming language for probabilisticAI. Extended abstract for talk at the NeurIPS*2008 Workshop on ProbabilisticProgramming, Dec. 2008.[23] T. Elgamal, S. Luo, M. Boehm, A. V. Evfimievski, S. Tatikonda, B. Reinwald, andP. Sen. SPOOF: sum-product optimization and operator fusion for large-scalemachine learning. In

Proc. of CIDR’17 , 2017.[24] X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a unified architecture forin-rdbms analytics. In

Proc. of SIGMOD’12 , pages 325–336, 2012.[25] S. J. Finkelstein, N. Mattos, I. Mumick, and H. Pirahesh. Expressing recursivequeries in SQL.

ISO-IEC JTC1/SC21 WG3 DBL MCI , (X3H2-96-075), 1996.[26] Z. J. Gao, S. Luo, L. L. Perez, and C. Jermaine. The BUDS language for distributedbayesian machine learning. In

Proc. of SIGMOD’17 , pages 961–976, 2017.[27] H. Garcia-Molina, J. D. Ullman, and J. Widom.

Database Systems The CompleteBook . Prentice Hall, 2002.[28] J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S.Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library orMAD skills, the SQL.

PVLDB , 5(12):1700–1711, 2012. [29] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. The montecarlo database system: Stochastic analysis close to the data.

ACM Trans. DatabaseSyst. , 36(3):18:1–18:41, 2011.[30] D. Kernert, F. Köhler, and W. Lehner. Spmacho - optimizing sparse linear algebraexpressions with probabilistic density estimation. In

Proc. of EDBT’15 , pages289–300, 2015.[31] M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: questions asked frequently. In

Proc.of PODS’16 , pages 13–28, 2016.[32] S. Klein. U-sql. In

IoT Solutions in Microsoft’s Azure IoT Suite , pages 173–190.Springer, 2017.[33] M. L. Koc and C. Ré. Incrementally maintaining classification using an RDBMS.

PVLDB , 4(5):302–313, 2011.[34] A. Kumar, J. F. Naughton, and J. M. Patel. Learning generalized linear modelsover normalized data. In

Proc. of SIGMOD’15 , pages 1969–1984, 2015.[35] F. Li, L. Chen, Y. Zeng, A. Kumar, X. Wu, J. F. Naughton, and J. M. Patel. Tuple-oriented compression for large-scale mini-batch stochastic gradient descent. In

Proc. of SIGMOD’19 , pages 1517–1534, 2019.[36] X. Li, B. Cui, Y. Chen, W. Wu, and C. Zhang. Mlog: Towards declarative in-database machine learning.

PVLDB , 10(12):1933–1936, 2017.[37] P. Liang and D. Klein. Online EM for unsupervised models. In

Proc. of NAACL’09 ,pages 611–619, 2009.[38] S. Luo, Z. J. Gao, M. N. Gubanov, L. L. Perez, and C. M. Jermaine. Scalable linearalgebra on a relational database system. In

Proc. of ICDE’17 , pages 523–534, 2017.[39] G. McLachlan and T. Krishnan.

The EM algorithm and extensions , volume 382.John Wiley & Sons, 2007.[40] J. Melton and A. R. Simon.

SQL: 1999: understanding relational language compo-nents . Morgan Kaufmann, 2001.[41] A. B. M. Moniruzzaman and S. A. Hossain. Nosql database: New era of databasesfor big data analytics - classification, characteristics and comparison.

CoRR ,abs/1307.0191, 2013.[42] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental,sparse, and other variants. In

Learning in graphical models , pages 355–368.Springer, 1998.[43] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental,sparse, and other variants. In

Learning in Graphical Models , pages 355–368. 1998.[44] M. Nikolic, M. Elseidy, and C. Koch. LINVIEW: incremental view maintenancefor complex analytical queries. In

Proc. of SIGMOD’14 , pages 253–264, 2014.[45] M. Nikolic and D. Olteanu. Incremental view maintenance with triple lockfactorization benefits. In

Proc. of SIGMOD’18 , pages 365–380, 2018.[46] C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL.

IEEE TKDE , 18(2):188–201, 2006.[47] C. Ordonez. Optimization of linear recursive queries in SQL.

IEEE Trans. Knowl.Data Eng. , 22(2), 2010.[48] C. Ordonez and P. Cereghini. SQLEM: fast clustering in SQL using the EMalgorithm. In

Proc. of SIGMOD , pages 559–570, 2000.[49] M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models overfactorized joins. In

Proc. of SIGMOD’16 , pages 3–18, 2016.[50] P. Tamayo, C. Berger, M. M. Campos, J. Yarmus, B. L. Milenova, A. Mozes, M. Taft,M. F. Hornick, R. Krishnan, S. Thomas, M. Kelly, D. Mukhin, R. Haberstroh,S. Stephens, and J. Myczkowsji. Oracle data mining - data mining in the databaseenvironment. In

The Data Mining and Knowledge Discovery Handbook. , pages1315–1329. 2005.[51] D. M. Titterington. Recursive parameter estimation using incomplete data.

Jour-nal of the Royal Statistical Society: Series B (Methodological) , 46(2):257–267, 1984.[52] K. Viele and B. Tong. Modeling with mixtures of linear regressions.

Statisticsand Computing , 12(4):315–330, 2002.[53] Y. Yang and J. Ma. A single loop EM algorithm for the mixture of expertsarchitecture. In

Advances in Neural Networks - ISNN 2009, 6th InternationalSymposium on Neural Networks, ISNN 2009, Proceedings, Part II , pages 959–968,2009.[54] S. E. Yuksel, J. N. Wilson, and P. D. Gader. Twenty years of mixture of experts.

IEEE Trans. Neural Netw. Learning Syst. , 23(8):1177–1193, 2012.[55] C. Zaniolo, N. Arni, and K. Ong. Negation and aggregates in recursive rules: theLDL++ approach. In

Proc. of DOOD , 1993.[56] C. Zaniolo, S. Stefano, Ceri, C. Faloutsos, R. T. Snodgrass, V. S. Subrahmanian,and R. Zicari.

Advanced database systems . Morgan Kaufmann, 1997.[57] K. Zhao and J. X. Yu. All-in-one: Graph processing in rdbmss revisited. In