Towards Expectation-Maximization by SQL in RDBMS
Kangfei Zhao, Jeffrey Xu Yu, Yu Rong, Ming Liao, Junzhou Huang
TTowards Expectation-Maximization by SQL in RDBMS
Kangfei Zhao
The Chinese University of Hong [email protected]
Jeffrey Xu Yu
The Chinese University of Hong [email protected]
Yu Rong
Tecent AI [email protected]
Ming Liao
The Chinese University of Hong [email protected]
Junzhou Huang
University of Texas at [email protected]
ABSTRACT
Integrating machine learning techniques into
RDBMS s is an im-portant task since there are many real applications that requiremodeling (e.g., business intelligence, strategic analysis) as well asquerying data in
RDBMS s. Without integration, it needs to exportthe data from
RDBMS s to build a model using specialized machinelearning toolkits and frameworks, and import the model trainedback to
RDBMS s for further querying. Such a process is not de-sirable since it is time-consuming and needs to repeat when datais changed. To support machine learning in
RDBMS s, there areproposals that are platform-specific with limited functionalities tosupport certain modeling. In this paper, we provide an
SQL solu-tion that has the potential to support different machine learningmodelings. As an example, we study how to support unsupervisedprobabilistic modeling, that has a wide range of applications inclustering, density estimation and data summarization, and focuson Expectation-Maximization (EM) algorithms, which is a generaltechnique for finding maximum likelihood estimators. To train amodel by EM, it needs to update the model parameters by an E-stepand an M-step in a while-loop iteratively until it converges to alevel controled by some threshold or repeats a certain number ofiterations. To support EM in
RDBMS s, we show our answers tothe matrix/vectors representations in
RDBMS s, the relational alge-bra operations to support the linear algebra operations requiredby EM, parameters update by relational algebra, and the supportof a while-loop. It is important to note that the
SQL β99 recursioncannot be used to handle such a while-loop since the M-step isnon-monotonic. In addition, assume that a model has been trainedby an EM algorithm, we further design an automatic in-databasemodel maintenance mechanism to maintain the model when theunderlying training data changes. We have conducted experimentalstudies and will report our findings in this paper.
Nowadays, integrating advanced data analytical techniques into
RDBMS s is an urgent requirement for data integration [19], businessintelligence and strategic analysis [14, 41]. Among these techniquesthat need to be integrated into
RDBMS s, machine learning modelsplay a leading role in predictive and estimation tasks. Althoughmany specialized machine learning toolkits and frameworks (e.g.,scikit [1] and TensorFlow [5]) are designed and developed, theapproaches of building, utilizing and managing machine learning
Conferenceβ17, July 2017, Washington, DC, USA models in
RDBMS s still need a comprehensive exploration. First, inmost enterprise applications, data are stored in a database system.It is cumbersome and time-consuming of exporting the data fromthe database system and then feeding it into models, as well as im-porting the prediction and estimation results back to the databasesystem. Second, it is highly desirable that users can build a modelas to query data in
RDBMS s, and query their data by exploiting theanalysis result of the models trained as a part of query in a seamlesssimilar in
RDBMS s. What we need is a flexible way to train/query amachine learning model together with data querying by a high-levelquery language (e.g.,
SQL ). Third, the data maintained in
RDBMS sis supposed to change, and there is more data collected from timeto time frequently. The analysis result of the models trained ina machine learning toolkit/framework may be out-dated, whichrequires to repeat the process of exporting data from
RDBMS s fol-lowed by importing the model trained into
RDBMS s. Given the factthat
RDBMS s have the techniques (e.g., trigger) to manage data up-dating automatically when data changes, a further consideration ishow to manage the machine learning models update automaticallyusing the database techniques available.There are efforts to support machine learning in
RDBMS s. Earlyin-database machine learning is developed based on
UDF s or specificlibraries like MADlib [28] for
PostgreSQL , Oracle Data Mining [50],DB2 Intelligent Miner, etc. On one hand, these functions and li-braries can achieve optimized performance. On the other hand,they are platform-specific and have limited functionalities from thehigh-level syntax to the low-level implementation. It is difficult fordatabase end-users to extend these libraries to support their ownmodels that are not available in the libraries. To fulfill logical andphysical data isolation, model-based views [18, 33] are proposed tosupport classification and regression analysis in database systems.Like regular views, model-based views support querying, material-ization, and maintenance strategies. In brief, [18, 33] allow using anad-hoc create view statement to declare a classification view. In this create view statement, [18] specifies the model by an as ... fit ... bases clause, and the training data is fed by an SQL query, while [33]specifies a model explicitly with using svm clause, where the fea-tures and labels are fed by feature function and labels , respectively.Here, feature function takes database attributes as the input fea-tures, and labels are database attributes. Although these approachesprovide optimized implementation for classification models, their create view statement is lack of generality and deviating from theregular
SQL syntax. In addition, the models supported are limitedand implemented in a low-level form in a database system, whichmakes it difficult for ordinary database end-users to develop newmodels swiftly. In this work, we demonstrate our SQL recursive a r X i v : . [ c s . D B ] J a n uery can define a model-based view in an explicit fashion and canbe used to support many machine learning models. Different from[18, 33], we focus on unsupervised models in the application ofin-database clustering, density estimation and data summarization.We take a two-step approach to support machine learning in RDBMS s. The first step is to design in-database machine learningframework based on
SQL , in particular
SQL recursive queries, toshow that
SQL has its capability of doing machine learning by
SQL in RDBMS s. The second step is to further find an efficient way tosupport queries for machine learning in
RDBMS s. In this paper,we concentrate on the first step, following our previous work tosupport graph analytics in
RDBMS s using
SQL recursive queries [57].The focus of this paper is on how to train machine learning modelsin
RDBMS s, given that the core computations of model training arelinear algebra operations and parameter updating.Consider training a machine learning model. In brief, it has aninitial parameter setting for the model, and will update the parame-ters in a while-loop iteratively until it converges to a level controledby some threshold or repeats a certain number of iterations. Themodel trained is the model with the parameters obtained at theend of the while-loop. To do so in
RDBMS s, there are several thingsthat need to be handled: the ways to represent matrix/vector in
RDBMS s, the relational algebra operations to support the linearalgebra operations required, the way to update parameters, and thesupport of a while-loop. In this paper, we provide our answer tosuch needs.The main contributions of this work are summarized below. First,we study in
RDBMS s how to support unsupervised probabilisticmodeling, that has a wide range of applications in clustering, den-sity estimation and data summarization, and focus on Expectation-Maximization (EM) algorithms [39], which is a general techniquefor finding maximum likelihood estimators. In EM, the parametersto be trained are means, covariances, and mixing coefficients; thereare two main steps in a while-loop, namely, E-step for expecta-tion and M-step for maximization, and the parameters are updatedin the while-loop. Second, we discuss how to represent data in
RDBMS s in different ways, how to compute E-step and M-step us-ing relational algebra operations (e.g., natural join, group-by andaggregation), how to update parameters using relational algebraoperations, and how to support the while-loop using
SQL recur-sive queries. It is worth mentioning that the recursion for EM isa mutual recursion of E-step and M-step. Recall that the E-step isto compute the conditional posterior probability by Bayesian in-ference, which can be supported by
SQL as a monotonic operation,whereas the M-step is to compute and update the parameters ofthe model given a closed-form updating formula, which cannotbe monotonic. This fact suggests that
SQL β99 recursion cannot beused to support EM, since
SQL β99 recursion (e.g., recursive with )only supports stratified negation, and therefore cannot supportnon-monotonic operations. We use XY -stratified [7, 55, 56], andprovide an enhanced SQL recursion (e.g., with +), which can handlenon-monotonic operations. We have implemented our approachas an
SQL layer on top of
PostgreSQL , and process our with + using
PostgreSQL . We show how to train a batch of classical statisticalmodels [8], including Gaussian Mixture model, Bernoulli Mixturemodel, mixture of linear regression, Hidden Markov model, Mix-tures of Experts, using the recursive
SQL queries. Third, Given a model trained by an EM algorithm, we further design an automaticin-database model maintenance mechanism to maintain the modelwhen the underlying training data changes. Inspired by the onlineand incremental EM algorithms [37, 42], we show how to obtainthe sufficient statistics of the models to achieve the incrementaleven decremental model updating, without re-building the modelusing all data. It is worth mentioning that our setting is different.Different from the incremental EM algorithms which are designedto train the model during its iterative processing, we re-train themodel by sufficient statistics using partial data being used to buildthe previous model in addition to the new data. Fourth, we haveconducted experimental studies and will report our findings in thispaper.
Organization
Section 2 discusses the related works. In Section 3,we introduce the preliminaries including the EM algorithm andthe requirements to support it in database systems. Then, our solu-tion is sketched in Section 4 and the
SQL recursive query and theimplementation details are introduced in Section 5. In Section 6,we design a view update mechanism, which is facilitated by trig-gers. We conduct extensive experimental studies in Section 7 andconclude the paper in Section 8.
Our study is closely related to the following research topics:
Machine Learning with
SQL . There are a board discussion onthe approaches to using
SQL for ML, which are at different levelsof abstraction in the long-term research. In early years, Ordonezet al. presents pure
SQL implementation of the EM algorithm in
RDBMS , including K-means [46] and Gaussian Mixture Model [48].Their approach,
SQLEM [48], presents three strategies to imple-ment EM in
SQL : horizontal, vertical and a hybrid one. Howeverthese implementations cannot support high dimensional data anda large cluster number effectively. Computations involving matrixand vector primitives are expressible in
SQL with the aid of
UDF s.For example,
MAD [16, 28] is a in-database analytics library formatrix and vector operators. Luo et. al. [38] extend
SimSQL [12], aHadoop-based relational database system to enable linear algebracomputations. Taking the functions manipulating matrix/vectordata type as a set of building blocks, a
SQL query can support basicmachine learning task, e.g., least square regression. Furthermore,
UDF and
UDA can be used to implement gradient methods [10, 24].
MAD and
BISMARCK [24] use python
UDF and
UDA to supportstochastic gradient decent, respectively. To deploy applicationsof stochastic models and analytics techniques, the monte carlodatabase system (MCDB) [12, 29] provides stochastic models to bedirectly used with the data stored in a large database.In MCDB, a
UDF called value-generating (VG) function is used to draws samplesfrom databases and bayesian learning can be performed by
SQL queries subsequently.
Query Optimization for Machine Learning . Using databasetechniques to improve the efficiency of machine learning appli-cation is a research focus currently. These techniques mainly aimat minimizing the computational redundancy incurred by the extrastorage for normalized data and sparse data. [34, 49] learn linearmodels over multi-table normalized data, introducing the idea ofearning on factorized database.
Morpheus [15], a database mid-dleware, can automatically convert the linear algebra operators ofdenormalized matrix/vector to normalized data by a set of rewritingrules. [35] utilizes tuple-oriented compression to reduct the dataredundancy for mini-batch stochastic gradient descent. In addition,in the ML system design, logical and physical plan optimizationtechniques, like plan simplification and rewriting [9, 30, 31], oper-ator selection [9], physical operator fusion [9, 23], delta updatingrules [44] are widely used to improve the performance of the sys-tem. These optimization techniques for linear algebra and matrixcalculus provides a large potential to improve the performance ofmodel-based view in different scenarios.
Declarative Language for Machine Learning . Apart from pure
SQL , there have been some efforts aimed at building statistical andmachine learning applications by a declarative language in databaseand data processing systems. Microsoft Azure Data Lake Analyticsprovides an extension of
SQL , named
U-SQL [32] with a tight C
Sys-temML [9] expresses machine learning algorithms by a simplified Rand python like-language, and automatically translates the programinto execution plan on top of
Spark . BUDS [26] is a language forBayesian machine learning, specifically, Markov chain simulation,allowing distributed computation on types of sets, maps, vectorsand matrices. In addition, Datalog and its extensions [10, 22, 36]are also used to integrate statistical and machine learning into datamanagement systems. In this paper, we focus on
SQL query, as itis the most widely-used declarative query language in database.We show that users can build model-based view by
SQL recursivequery with limited enhancement and the support of vector/matrixdata type.
Model-based View in
RDBMS . To provide adequate support formodeling data in database system, the abstraction of model-basedview is proposed to persist the model in the database and data min-ing area.
MauveDB [18] is an architecture which supports a SQL-based declarative language to define views for regression model andinterpolation. Koc et. al. [33] define and maintain statistical modelfor classification in
RDBMS s, including least square regression, lo-gistic regression, ridge regression and SVM. Nikolic et. al. [45]propose a unified incremental view maintenance approach for fac-torized database, which can support liner regression model withgradient methods. Most of the existing studies focus on the super-vised statistical model while few attention is paid on supportingand maintaining unsupervised modeling in database system.
In this paper, we focus on unsupervised probabilistic modeling,which has broad applications in clustering, density estimation anddata summarization in database and data mining area. Specifically,the unsupervised models aim to reveal the relationship betweenthe observed data and some latent variables by maximizing the datalikelihood. The expectation-maximization (EM) algorithm, first in-troduced in [17], is a general technique for finding maximum likeli-hood estimators. It has a solid statistical basis, robust to noisy dataand its complexity is linear in data size. Here, we use the Gaussianmixture model [8], a widely used model in data mining, pattern
Algorithm 1:
EM Algorithm for Mixture Gaussian Model Initialize the means π , covariances π and mixing coefficients π ; Compute the initial log-likelihood πΏ , π β while Ξ πΏ > π or π < maxrecursion do E-step: compute the responsibilities π ( π§ ππ ) based on current π , π and π by Eq. (3); M-step: re-estimate π , π and π by Eq. (4)-(6); re-compute the log-likelihood πΏ ; π β π + end while return π , π , π ;recognition, and machine learning, as an example to illustrate theEM algorithm and our approach throughout this paper.Suppose we have an observed dataset πΏ = { π₯ , π₯ , Β· Β· Β· , π₯ π } ofn data points where π₯ π β R π . Given N ( π₯ | π , π ) is the probabilitydensity function of a Gaussian distribution with mean π β R π andcovariance π β R π Γ π , the density of Gaussian mixture model is asimple linear super-position of πΎ different Gaussian componentsin the form of Eq. (1). π ( π₯ π ) = πΎ βοΈ π = π π N ( π₯ π | π π , π π ) (1)Here, π π β R is the mixing coefficient, i.e., the prior of a data pointbelonging to component π and satisfies (cid:205) πΎπ = π π =
1. To model thisdataset π using a mixture of Gaussians, the objective is to maximizethe log of the likelihood function in Eq. (2). πππ ( πΏ | π , π , π ) = π βοΈ π = ππ [ πΎ βοΈ π = π π N ( π₯ π | π π , π π )] (2)Algorithm 1 sketches the EM algorithm for training the GaussianMixture Model. First, in line 1-2, the means π π , covariances π π and the mixing coefficients π π of πΎ Gaussian distributions areinitialized, and the initial value of the log-likelihood (Eq. (2)) iscomputed. In the while loop of line 3-7, the Expectation-step (E-step) and Maximization-step (M-step) are executed alternatively.In the E-step, we compute the responsibilities, i.e., the conditionalprobability that π₯ π belongs to component π , denoted as π ( π§ ππ ) byfixing the parameters based on the Bayes rule in Eq. (3). π ( π§ ππ ) = π π N ( π₯ π | π π , π π ) (cid:205) πΎπ = π π N ( π₯ π | π π , π π ) (3)In M-step, we re-estimate a new set of parameters using the currentresponsibilities by maximizing the log-likelihood (Eq. (2)) as follows. π πππ€π = π π π βοΈ π = π ( π§ ππ ) π₯ π (4) π πππ€π = π π π βοΈ π = π ( π§ ππ )( π₯ π β π πππ€π )( π₯ π β π πππ€π ) π (5) π πππ€π = π π π (6)where π π = (cid:205) ππ = π ( π§ ππ ) . At the end of each iteration, the new valueof log-likelihood is evaluated and used for checking convergence.The algorithm ends when the log-likelihood converges or a given π π π π π π π π π (a) Relation GMM πΌπ· π₯ π₯ π₯ (b) Relation π πΌπ· πΎ π π ( π§ ) π ( π§ ) π ( π§ ) π ( π§ ) (c) Relation π Table 1: The relation representations iteration time is reached. In
RDBMS , the learnt model, the parame-ters of πΎ components, can be persisted in a relation of πΎ rows asshown in Table 1(a). Suppose 1-dimensional dataset π as Table 1(b)is given, the posterior probability of π₯ π belongs to component π can be computed as Table 1(c) and clustering can be conducted byassigning π₯ π to component with the maximum π ( π§ ππ ) .To fulfil the EM algorithm in database systems, there are severalimportant issues need to be concerned, including (1) the represen-tation and storage of high dimensional data in database. (2) therelation algebra operation used to perform linear algebra computa-tion in EM. (3) the approach for iterative parameter updating. (4)the way to express and control the iteration of EM algorithm. (5)the mechanism to maintain the existing model when underlyingdata involves.As an early attempt, Ordonez et. al. [48] proposed a SQL imple-mentation,
SQLEM . Their implementation is based on two strate-gies: horizontal and vertical. The horizontal approach organizesthe data points as the horizontal representation as shown in Ta-ble 2(b), where the relation has π rows and π attributes. The verticalapproach organizes the data points as the coordinate represen-tation as shown in Table 2(a), where the relation has ππ rows, 3attributes for the identity of data, index and value. The horizontalapproach has efficient performance while the vertical approach hasflexible expression power. Thereby, [48] further proposes a hybridapproach, which persists both vertical and horizontal representa-tions simultaneously to achieve a tradeoff. Although it provides afeasible solution for EM algorithm in RDBMS , their approach hassome drawbacks. First, the hybrid and vertical approaches only havelimited flexibility in expressing linear algebra operations. For thehybrid approach, the users still need to specify the computation foreach dimension π in the E-step and for each π in the M-step, leadingto many repetitive and complicated expressions in the SQL queries.These queries also limit the supported models. For example, thecovariance matrix π must be diagonal. Second, it has not providedan effective parameter updating and iteration control mechanisminside the database system. The SQL queries for one iteration E-stepand M-step are wrapped in a while-loop of a host language, i.e.,Java, python. This will undermine the overall performance due tolarge communication and I/O cost. Furthermore, their approachdoes not support model updating.
In this section, we propose a complete solution to deal with aboveissues in applying the EM algorithm and building model-basedviews inside
RDBMS . πΌπ· πΆ π₯ . . . . (a) coordinaterepresentation π π πΌπ· π π . . (b) horizontalrepresentation π π πΌπ· π₯ [ . , . ] [ . , . ] (c) row-majorrepresentation π π Table 2: The relation representations for matrixHigh Dimensional Representation . Regarding the issue of highdimensional data, different from [48], we adopt the row-major rep-resentation, as shown in Table 2(c), which is endorsed by allowingarray/vector data type in database. For one thing, this keeps theefficient performance of horizontal representation by reducing I/Ocost, especially for dense vectors. For the other, we can use thevector/matrix operations to support complicated linear algebracomputation in a concise
SQL query. As most
RDBMS s have pro-vided the array/vector datatype internally, apart from the build-inarray functions, many extended libraries of database [16, 50] alsoprovide additional statistical function and vector/matrix operationsfor multivariable statistical analysis and basic linear algebra calcu-lus. These high-level abstractions avoid letting end-users specifythe arithmetic operations on each dimension of the data point sothat serve as a set of building blocks of machine learning algorithms.Consider computing the means π in the M-step (Eq. (4)) withthe 3 different representations π π , π π and π π in Table 2. Supposethe responsibilities are in relation π ( πΌπ·, πΎ, π ) , where πΌπ· , πΎ and π isthe identifier of data point and component, and the value of π ( π§ ππ ) .The relational algebra expressions to compute Eq. (4) are shown inEq. (7)-(9), respectively. π π β π ( πΎ,πΆ, mean ) ( πΎ,πΆ G sum ( π β π₯ ) ( π (cid:90) π .πΌπ· = π π .πΌπ· π π )) (7) π π β π ( πΎ,π ,π ) ( πΎ G sum ( π β π ) , sum ( π β π ) ( π (cid:90) π .πΌπ· = π π .πΌπ· π π )) (8) π π β π ( πΎ, mean ) ( πΎ G sum ( π Β· π₯ ) ( π (cid:90) π .πΌπ· = π π .πΌπ· π π )) (9)We elaborate on these expressions. In Eq. (7) and Eq. (8), the binaryoperator β is the arithmetic multiplication while in Eq. (9), theoperator Β· denotes a scalar-vector multiplication. First, all of these 3representations need to join π and π on the πΌπ· attribute to compute π ( π§ ππ ) π₯ π . The differences lay in the group and aggregation for eachcomponent π . For the coordinate representation π π in Eq. (7), apartfrom πΎ , we also need to group the index πΆ . For the horizontalrepresentation π π in Eq. (8), we need to define the computation ofeach dimension π in the aggregation expression. As the dimensionincreases, Eq. (7) faces the problem of high I/O cost while Eq. (8)leads to a verbose query. Consider the row-major representationwhich nesting separate dimension attributes into one vector-typeattribute. By introducing the Β· operator for vector computation,Eq. (4) is expressed in an efficient and clear way (Eq. (9)). Relational Algebra to Linear Algebra . On the basis of array/vec-tor data type and the derived statistical function and linear alge-bra operations, the complicated linear algebra computation canbe expressed by basic relational algebra operations (selection ( π ),projection ( Ξ ), union ( βͺ ), Cartesian product ( Γ ), and rename ( π )),together with group-by & aggregation. Let π and πΈ ( πΈ β² ) be theelation representation of vector and matrix, such that π ( πΌπ·, π£ ) and πΈ ( πΉ,π, π ) . Here πΌπ· is the tuple identifier in π . πΉ and π , standing forthe two indices of a matrix. [57] introduces two new operations tosupport the multiplication between a matrix and a vector (Eq. (10))and between two matrices (Eq. (11)) in their relation representation. πΈ β( β) (cid:90) π = πΌπ· π = πΉ G β( β) ( πΈ (cid:90) π = πΌπ· π ) (10) πΈ β( β) (cid:90) πΈ.π = πΈ β² .πΉ πΈ β² = πΈ.πΉ,πΈ β² .π G β( β) ( πΈ (cid:90) πΈ.π = πΈ β² .πΉ πΈ β² ) (11)The matrix-vector multiplication (Eq. (10)) consists of two steps.The first step is computing π£ β π between a tuple in πΈ and a tuple in π under the join condition πΈ.π = π .πΌπ· . The second step is aggregatingall the β results by the operation of β for every group-by groupingby the attribute πΈ.πΉ . Similarly, the matrix-matrix multiplication(Eq. (11)) is done in two steps. The first step computes β betweena tuple in πΈ and a tuple in πΈ β² under the join condition πΈ.π = πΈ β² .πΉ .The second step aggregates all the β results by the operation of β for every group by grouping-by the attributes πΈ.πΉ and πΈ β² .π . Theformula of re-estimating the means π (Eq. (4)) is a matrix-vectormultiplication if data is 1-dimensional or a matrix-matrix multi-plication otherwise. When high dimensional data is in coordinaterepresentation (Table 2(a)), Eq. (7) is the application of Eq. (11).When high dimensional data is nested as the row-major represen-tation (Table 2(c)), the matrix-matrix multiplication is reduced tomatrix-vector multiplication, as shown in Eq. (9).Re-estimating the covariance/standard deviation π (Eq. (5)) in-volves the element-wise matrix multiplication if data is 1-dimensionalor a tensor-matrix multiplication otherwise. The element-wise ma-trix multiplication can be expressed by join two matrices on theirtwo indices to compute πΈ.π β πΈ β² .π . An extra aggregation is requiredto aggregate on each component π as shown in Eq. (12). πΈ β( β) (cid:90) πΈ.πΉ = πΈ β² .πΉπΈ.π = πΈ β² .π πΈ β² = πΈ.πΉ G β( β) ( πΈ (cid:90) πΈ.πΉ = πΈ β² .πΉπΈ.π = πΈ β² .π πΈ β² ) (12)Similarly, when β and β are vector operation and vector aggrega-tion, Eq. (12) is reduced to high dimensional tensor-matrix multi-plication. Value Updating . So far, we still need to deal with parameter up-date when training the model in multiple iterations. There is anew relational operation, union by update , denoted as β and firstdefined in [57] (Eq. (13)) to address value update in iterative graphcomputation. π β π΄ π = ( π β ( π β π .π΄ = π.π΄ π )) βͺ π (13)Suppose π‘ π is a tuple in π and π‘ π is a tuple in π . Different from theconventional union ( βͺ ), the union by update updates π‘ π by π‘ π if π‘ π and π‘ π are identical by some attributes π΄ . If π‘ π does not match any π‘ π , π‘ π is merged into the resulting relation. Given the relation ofparameters as shown in Table 1(a), the union by update updatesthe set of old parameters by the set of new parameters if they havethe identical component identifier πΎ . Iterative Evaluation . In the following, we elaborate on our ap-proach supporting the iterative model training by
SQL recursivequeries. Over decades,
RDBMS s have provided the functionality tosupport recursive queries, based on
SQL β99 [25, 40]. The recursive with ππΆ ( πΉ,π ) as (3. ( select πΉ , π from πΈ )4. union all
5. ( select
ππΆ.πΉ , πΈ.π from ππΆ , πΈ where ππΆ.π = πΈ.πΉ )) Figure 1: The recursive with statement with π asselect Β· Β· Β· from π ,π , Β· Β· Β· computed by Β· Β· Β· ( π ) union by updateselect Β· Β· Β· from π ,π , Β· Β· Β· computed by Β· Β· Β· ( π ) Figure 2: The general form of the enhanced recursive with queries are expressed using with clause in
SQL . We introduce the with clause following the discussions given in [27]. with π as β¨ π initialization β© β¨ recursive querying involving π β© Here, the recursive with clause defines a temporary recursive re-lation π in the initialization step, and queries by referring the re-cursive relation π iteratively in the recursive step until π cannot bechanged. As an example, the edge transitive closure can be com-puted using with over the edge relation πΈ ( πΉ,π ) , where πΉ and π arefor βFromβ and βToβ. As shown in Fig. 1, the recursive relation isnamed ππΆ . Initially, the recursive relation ππΆ is defined to projectthe two attributes, πΉ and π , from the relation πΈ (line 3). Then, thequery in every iteration is to union ππΆ computed and a relationwith two attributes ππΆ.πΉ and
πΈ.π by joining the two relations, ππΆ and πΈ , over the join condition ππΆ.π = πΈ.πΉ (line 5). However,
SQL β99defines the recursion in a limited form. This with clause restrictsthe recursion to be a stratified program, where non-monotonicoperation, e.g., union by update is not allowed.To support iterative model update, we extend the existing loopingstructure followed the βalgebra + whileβ given in [6].initialize π while ( π changes) { Β· Β· Β· ; π β Β· Β· Β· }In brief, in the looping, π may change by the relational algebra inthe body of the looping. The looping will terminate until π becomesstable. As discussed in [6], there are two semantics for βalgebra +whileβ, namely, noninflationary and inflationary. The assignment, π β E , is to assign relation π by evaluating the relational algebraexpression E . For the inflationary semantic, the assignment needs tobe cumulative, like the conventional union ( βͺ ). For the noninflation-ary semantic, the assignment can be destructive, i.g., the new valueoverwrites the old value, like union by update ( β ). In [57], it provesthat under the framework βalgebra + whileβ, β leads to a fixpoint inthe enhanced recursive SQL queries by XY -stratification. The vec-tor/matrix data type, just as a nested attribute, does not violate thisproperty in this paper. Meanwhile, the Datalog program of Eq. (12)and other relational algebra expressions that consist of natural join,group-by and aggregation can be proved to be XY βstratification ina similar way as Eq. (10) and Eq. (11) in [57].To this end, the general syntax of the enhanced recursive with is sketched in Fig. 2. In the main body of the enhanced with , apartfrom union all , we also allow union by update to union the resultof initial query π and recursive query π . As the discussion in57], when union by update is used, it cannot be used more thanonce, and cannot be used with other union all together. Here, the computed by statement, as a new feature of enhanced with , allowsusers to specify how a relation π π,π is computed by a sequenceof queries. The queries wrapped in computed by must be non-recursive. In this paper, we mainly use the union by update of therecursive query for parameters update instead of union all . In thefollowing sections, we elaborate on the recursive query for EM andmodel updating in detail. We show the details of supporting the model-based view by the"algebra + while" approach, using
SQL recursive query. First, wepresent the relational algebra expressions needed, followed by theenhanced recursive query and our implementation. Second, thequeries for model inference are introduced.
For simplicity, here we consider the training data point π₯ π is 1-dimensional scalar. It is natural to extend the query to high dimen-sional input data when matrix/vector data type and functions aresupported by the database system. We represent the input data bya relation π ( πΌπ·, π₯ ) , where πΌπ· is the tuple identifier for data point π₯ π and π₯ is the numeric value. The model-based view, which ispersisted in the relation GMM(K, pie, mean, cov), where πΎ is theidentifier of the π -th component, and βpieβ, βmeanβ, and βcovβ denotethe corresponding parameters, i.e., mixing coefficients, means andcovariances (standard deviations), respectively. The relation repre-sentations are shown in Table 1. The following relational algebraexpressions describe the E-step (Eq. (14)), M-step (Eq. (15)-(18)),and parameter updating (Eq. (19)) in one iteration. π β π ( πΌπ·,πΎ,π ) Ξ ( πΌπ·,πΎ,π ) ( πΊππ Γ π ) (14) π β π ( πΎ, pie ) ( π sum ( π ) (cid:90) π .πΌπ· = π .πΌπ· π ) (15) π β π ( πΎ, mean ) ( π sum ( π β π₯ )/ sum ( π ) (cid:90) π .πΌπ· = π .πΌπ· π ) (16) π β Ξ πΌπ·,πΎ, pow ( π₯ β mean ) ( π Γ π ) (17) πΆ β π ( πΎ, cov ) πΎ G sum ( π β π‘ ) ( π (cid:90) π .πΌπ· = π .πΌπ·π .πΎ = π .πΎ π ) (18) πΊππ β π ( πΎ, pie , mean , cov ) ( π (cid:90) π .πΎ = π.πΎ π (cid:90) π.πΎ = πΆ.πΎ πΆ ) (19)First, in Eq. (14), by performing a Cartesian product of GMMand π , each data point is associated with the parameters of eachcomponent. The responsibilities are evaluated by applying an ana-lytical function π to compute the normalized probability density(Eq. (3)) for each tuple, which is the E-step. The resulted relation π ( πΌπ·, πΎ, π ) is shown in Fig. 1(c). For the M-step, the mixing coeffi-cients βpieβ (Eq. (15)), the means βmeanβ (Eq. (16)) and the covariancesβcovβ (Eq. (17)-(18)) are re-estimated based on their update formulasin Eq. (4)-(6), respectively. In the end, in Eq. (19), the temporary rela-tions π , π and πΆ are joined on attribute πΎ to merge the parameters.The result is assigned to the recursive relation GMM.Fig. 3 shows the enhanced with query to support Gaussian Mix-ture Model by EM algorithm. The recursive relation GMM specifies with
2. GMM( πΎ , pie, mean, cov) as (3. ( select πΎ , pie, mean, cov from INIT_PARA)4. union by update πΎ
5. ( select
π .πΎ , pie/n, mean, sqrt (cov/pie)6. from π , πΆ where π .πΎ = πΆ.πΎ computed by π ( πΌπ· , πΎ , π ) as select πΌπ· , k, norm (x, mean, cov) * pie /9. ( sum ( norm (x, mean, cov) * pie) over ( partition by πΌπ· ))10. from GMM, π π ( πΎ , pie, mean) as select πΎ , sum ( π ), sum ( π * x) / sum ( π )12. from π , π where π .πΌπ· = π .πΌπ· group by πΎ πΆ ( πΎ , cov) as select π .πΎ , sum ( π * π .val) from
15. ( select πΌπ· , πΎ , pow(x-mean) as val from π , π ) as π , π where π .πΌπ· = π .πΌπ· and
π .πΎ = π .πΎ group by
π .πΎ )18. maxrecursion select * from GMM
Figure 3: The enhanced recursive
SQL for Gaussian Mixtures the parameters of π Gaussian distributions. In line 3, the initial queryloads the initial parameters from relation INI_PARA. The new pa-rameters are selected by the recursive query (line 5-6) evaluated bythe computed by statement and update the recursive relation by union by update w.r.t. the component index πΎ . It wraps the queriesto compute E-step and M-step of one iteration EM.We elaborate on the queries in the computed by statement (line8-17). Specifically, the query in line 8-10 performs the E-step, as therelational algebra in Eq. (14). Here, norm is the Gaussian (Normal)probability density function of data point π₯ given the mean andcovariance as input. We can use the window function, introduced in SQL β03 to compute the responsibility by Bayes rule in Eq. (3). In line9, sum () over ( partition by ()) is the window function performingcalculation across a set of rows that are related to the current row.As it does not group rows, where each row retains its separateidentity, many RDBMS s allow to use it in the recursive query, e.g.,
PostgreSQL and
Oracle . The window function partitions rows of theCartesian product results in partitions of the same πΌπ· and computesthe denominator of Eq. (3). In line 11-13, the query computes themeans (Eq. (4)) and the mixing coefficients together by a matrix-matrix multiplication due to their common join of π and π . Then,line 14-17 computes the covariances of Eq. (5). First, we computethe square of π₯ π β π π for each π₯ π and π , which requires a Cartesianproduct of π and π (Eq. (17)). Second, the value is weighted bythe responsibility and aggregated as specified in Eq. (18). The newparameters in the temporary relation π and πΆ will be merged byjoining on the component index πΎ in line 6.An acute reader may find that in Fig. 3, the recursive querydoes not compute and check the convergence of the log-likelihoodexplicitly. That is because the existing recursive query does notsupport the functionality of checking value convergence as well asearly stopping. However, the depth of recursion can be controlledby maxrecursion clause, which is adapted from SQL Server [3]. The maxrecursion clause can effectively prevent infinite recursion be-cause of infinite fix point, e.g., β with π ( π ) as (( select values (0)) union all ( select π + from π )) β , a standard SQL β99 recursion.Users can check the convergence after training for a fixed numberof recursion and resume the training from current parameters ifnecessary. he implementation : We sketch how to support recursive queriesusing the enhanced with in RDBMS s. First, for each subquery π π used in π including those defined by the computed by statement, weconstruct a local dependency graph πΊ π . The graph πΊ π constructedmust be cycle free. We ensure that it is XY -stratified. Second, wecreate a PSM (Persistent Stored Model) in the recent
SQL standard.With
PSM , we create a unique procedure πΉ π for the recursive query π to be processed, as illustrated below. create procedure πΉ π ( declare πΆ , Β· Β· Β· , πΆ π , Β· Β· Β· ; create table π π,π for all tables defined by as in a subquery π π ;create SQL statement to compute the initial π by union ofall initial subqueries; loopinsert into π π,π select Β· Β· Β· for every π π,π used in π π ;compute condition πΆ π for each recursive subquery π π ; if all πΆ π for the recursive subqueries are false then exit compute the recursive relation π for the current iteration;union the current π with the previous π computed; end loop )In the procedure, πΉ π , first we declare variables πΆ , Β· Β· Β· , πΆ π , Β· Β· Β· forevery subquery π π , which are used to check the condition to exitfrom the looping. Second, we create the temporary tables for therelations defined by as in the computed by statements. Third, weinclude SQL statements to compute the initial recursive relation π .Fourth, we create a looping. In the looping, we generate an insert for π π,π , and check whether π π is empty. If so, πΆ π is set to 0, indicat-ing π π generates 0 tuple, the loop will be terminated. Otherwise,the recursive relation computed in this iteration will union withthe one computed in the previous iteration by either union all or union by update . An extra counter will be maintained in the loopif maxrecursion is used. With the procedure defined, we can runthe statements in the procedure πΉ π by issuing β call πΉ π β. Once the model is trained by the recursive query in Fig. 3, it can bematerialized in a view for online inference. In the phase of inference,users can query the view by
SQL to perform clustering, classificationand density estimation. Given a batch of data in relation π and aview GMM computed by Fig. 3. The query below computes theposterior probability that the component πΎ generated the data withindex πΌπ· . The query is similar to computing the E-step (Eq. (3)) inline 5-7 of Fig. 3. create table π as select πΌπ· , πΎ , norm (x, mean, cov) * pie / ( sum ( norm (x, mean, cov) * pie) over ( partition by πΌπ· )) from GMM, π Based on relation π ( πΌπ·, πΎ, π ) above, we can further assign thedata into πΎ clusters, where π₯ π is assigned to cluster π if the posteriorprobability π ( π§ ππ ) is the maximum among the { π ( π§ π , Β· Β· Β· π ( π§ ππΎ )} .The query below creates a relation CLU( πΌπ· , πΎ ) to persist the clus-tering result where πΌπ· and πΎ are the attributes of data point andits assigned cluster, respectively. It first finds the maximum π ( π§ ππ ) for each data point by a subquery on relation π . The result is re-named as π and is joined with π on the condition of π .πΌπ· = π .πΌπ· and
π .π = π .π to find the corresponding π . ! " ! ! $ ! % ! ! % ! & ! ' ( ) ( * Data Selected Data Model Sufficient StatisticsModelInserted Data ! & ! ' InsertTriggerInsertTriggerInsertTrigger + , -
Figure 4: Overview of Model Maintenance create trigger T1 before insert on π for each statement execute procedure DATA _ SELECTION create trigger T2 before insert on π for each row execute procedure DATA _ INSERTION create trigger T3 after insert on π for each statement execute procedure MODEL _ UPDATE
Figure 5: The triggers for incremental updatecreate table
CLU as select πΌπ· , πΎ from π ,( select πΌπ· , max ( π ) as π from π group by πΌπ· ) as π , where π .πΌπ· = π .πΌπ· and
π .π = π .π
It is worth nothing that both of the queries above only accessthe data exactly once. Thereby, it is possible to perform the infer-ence on-the-fly and only for interested data. Besides from densityestimation and clustering, result evaluation, e.g., computing thepurity, normalized mutual information (NMI) and Rand Index canbe conducted in database by
SQL queries.
In this section, we investigate the automatic model/view updating.When the underlying data π changes, a straightforward way is tore-estimate the model over the updated data. However, when onlya small portion of the training data are updated, the changes of thecorresponding model are slight, it is inefficient to re-estimate themodel on-the-fly. Hence, a natural idea is arose that whether wecan update existing model by exploring the βincremental variantβ ofthe EM algorithm. And this variant can be maintained by the newlyarriving data and a small portion of data extracted from originaldataset. As the statistical model trained by the SQL queries canbe represented by its sufficient statistics, the model is updated bymaintaining the model and sufficient statistics.The sufficient statistic is a function of data π that contains allof the information relevant to estimate the model parameters. Thesufficient statistics of these data are computed to update the sta-tistics of model. As the model is updated, the statistics of data isalso updated followed by the changing of the posterior probability π ( π§ ππ ) . This process repeats until the statistics converge. It is worthmentioning that this maintenance mechanism can support all themodels with have sufficient statistics. We elaborate the sufficientstatistics updating rules in the following.uppose the training dataset of model π½ is { π₯ , π₯ , Β· Β· Β· , π₯ π } . Let π be the sufficient statistics of π½ , based on the Factorization Theo-rem [21], we can obtain π = π βοΈ π = βοΈ π π ( π | π₯ π , π½ ) π ( π₯ π , π ) (20)where π is the unobserved variable, π denotes the mapping functionfrom an instance ( π₯ π , π ) to the sufficient statistics contributed by π₯ π .The inserted data is { π₯ π + , π₯ π + , Β· Β· Β· , π₯ π } . Let the new model foroverall data { π₯ , Β· Β· Β· , π₯ π , π₯ π + , Β· Β· Β· , π₯ π } be (cid:101) π½ and the correspondingsufficient statistics be (cid:101) π . The difference of (cid:101) π β π , denoted as Ξ π is Ξ π = π + π βοΈ π = βοΈ π π ( π | π₯ π , (cid:101) π½ ) π ( π₯ π , π ) β π βοΈ π = βοΈ π π ( π | π₯ π , π½ ) π ( π₯ π , π ) = π + π βοΈ π = βοΈ π [ π ( π | π₯ π , (cid:101) π½ ) β π ( π | π₯ π , π½ )] π ( π₯ π , π ) (21) + π βοΈ π = π + βοΈ π π ( π | π₯ π , π½ ) π ( π₯ π , π ) (22)According to above equations, we observe that the delta part ofthe sufficient statistics Ξ π consists of two parts: (1) changes ofthe sufficient statistics for the overall data points { π₯ , π₯ Β· Β· Β· π₯ π } in Eq. (21), and (2) the additional sufficient statistics for the newlyinserted data points { π₯ π + , Β· Β· Β· π₯ π } in Eq. (22). Consider to retrain anew model (cid:101) π½ over { π₯ , π₯ , Β· Β· Β· , π₯ π } in π iterations by taking π½ asthe initial parameter, i.e., π½ ( ) = π½ and π½ ( π ) = (cid:101) π½ . We have Ξ π = π + π βοΈ π = βοΈ π [ π ( π | π₯ π , π½ ( π ) ) β π ( π | π₯ π , π½ ( ) )] π ( π₯ π , π ) (23) + π βοΈ π = π + βοΈ π π ( π | π₯ π , π½ ( ) ) π ( π₯ π , π ) (24) = π βοΈ π‘ = π + π βοΈ π = βοΈ π [ π ( π | π₯ π , π½ ( π‘ ) ) β π ( π | π₯ π , π½ ( π‘ β ) )] π ( π₯ π , π ) (25) + π βοΈ π = π + βοΈ π π ( π | π₯ π , π½ ( ) ) π ( π₯ π , π ) (26)Above equations indicts how to compute Ξ π . For the inserted data { π₯ π + , Β· Β· Β· π₯ π } , the delta can be directly computed by evaluatingthe original model π½ ( ) as Eq. (26). while for original data, the deltacan be computed by updating the model π½ ( π‘ ) iteratively using allthe data { π₯ , π₯ Β· Β· Β· π₯ π } as Eq. (25). Since most of the computationalcost is concentrated on the iteration of Eq. (25), we use two tricksto approximate the computation. First, we use the stochastic ap-proximation algorithm, where the parameters are updated after thesufficient statistics of each new data point π₯ π is computed, instead ofthe full batch dataset. This approach is widely used in many onlineand incremental EM algorithm variations [37, 43, 51]. The secondis discarding the data points which are not likely to change theircluster in the future, as the scaling clustering algorithms adopt forspeedup [11]. In other words, due to the slight changes of data andmodels, only a small portion of the original data points with unsta-ble membership need to be retrained. We discuss our strategy ofselecting partial original data in { π₯ , π₯ , Β· Β· Β· , π₯ π } for model update.It is a tradeoff between the accuracy of the model and the updating cost. The more data we use to update the model, the more accuracythe model we can achieve while the larger computation and I/Ocost to spend. The data points which are unlikely to change clustermembership can be discarded in the future model update. Thereare two strategies for the discard, a distance-based and a density-based strategy. For the distance-based strategy, we use Mahalanobisdistance [20] to measure the distance between a data point and adistribution. For each data π₯ π , we compute the Mahalanobis dis-tance, π· π ( π₯ π ) , to the π -th component with mean π π and covariance π π . π· π ( π₯ π ) = βοΈ ( π₯ π β π π ) π π β π ( π₯ π β π π ) (27)We can filter the data within a given thresholding radius with anycomponent. Another measurement is the entropy of the posteriorprobability for data π₯ π as in Eq. (28), where π ( π§ ππ ) is evaluated byparameter π½ ( ) . The larger the entropy, the lower the possibility ofassigning π₯ π to any one of the component. πΈ ( π₯ π ) = β πΎ βοΈ π = π ( π§ ππ ) ππ π ( π§ ππ ) (28)It is worth mentioning that the data selection in trigger T1 canbe performed offline, i.e., persisting a subset of training data withfixed budget size for model updating in the future. In addition, thesufficient statistics for original model π½ (line 1 of Algorithm 2)can be precomputed. Those will improve the efficiency of onlinemodel maintenance significantly. The alternative fine-grained dataselection strategies are beyond the scope of this paper.Similarly, considering deleting π data points { π₯ π β π + , Β· Β· Β· π₯ π } from { π₯ , π₯ Β· Β· Β· π₯ π } , the difference of the sufficient statistics, Ξ π is Ξ π = π βοΈ π‘ = π β π βοΈ π = βοΈ π [ π ( π | π₯ π , π½ ( π‘ ) ) β π ( π | π₯ π , π½ ( π‘ β ) )] π ( π₯ π , π ) (29) β π βοΈ π = π β π + βοΈ π π ( π | π₯ π , π½ ( ) ) π ( π₯ π , π ) In RDBMS s, the automatic model updating mechanism is enabledby triggers build on the relation of the input data π . Fig. 4 illus-trates the overview of our model/view updating mechanism. Thereare three triggers built on the relation of training data π , whosedefinitions are shown in Fig. 5. Before executing the insertion oper-ation, two triggers T1 (line 1-3 in Fig. 5) and T2 (line 4-6 in Fig. 5)prepare the data for model updating in a temporary relation π β² .Here, T1 performs on each row to select a subset from original datain { π₯ , π₯ , Β· Β· Β· , π₯ π } based on a selection criterion. Additionally, T2 inserts all the newly arrived data { π₯ + π , π₯ , Β· Β· Β· , π₯ π } to relation π β² .After the data preparation finished, another trigger T3 (line 7-9 inFig. 5) will call a PSM to compute the Ξ π by π β² . In the PSM , first,the delta of the newly inserted data (Eq. (25)) is computed as usedto reinitialize the parameters of the model. Then, π iterations ofscanning relation π β² is performed. Where in each iteration. π β² israndomly shuffled and each data point is used to update the suf-ficient statistics it contributes as well as the model instantly. Theactions of these triggers are transparent to the database users. Fi-nally, we use Gaussian Mixture model an example to illustrate thisprocedure. lgorithm 2: MODEL _ UPDATE Initialize the original sufficient statistics π by Eq. (30); π π β π π + (cid:205) ππ = π + π ( π§ ππ ) π₯ π ; π π β π π + (cid:205) ππ = π + π ( π§ ππ ) π₯ π π₯ ππ ; Update model parameters π , π and π by Eq. (31); for π‘ β π do for π₯ π β π β² in random order do π π β π π + π ( π§ ππ ) ( π‘ ) π₯ π β π ( π§ ππ ) ( π‘ β ) π₯ π ; π π β π π + π ( π§ ππ ) ( π‘ ) π₯ π π₯ ππ β π ( π§ ππ ) ( π‘ β ) π₯ π π₯ ππ ; Update model parameters π , π and π by Eq. (31); end for end for return π , π , π ; Example 6.1.
For Gaussian Mixture model of πΎ components,the minimal sufficient statistics π = ( π , π , Β· Β· Β· , π πΎ , π πΎ ) , where π π β R and π π β R π Γ π for each π β { , Β· Β· Β· , πΎ } are as below. π π = π βοΈ π = π ( π§ ππ ) π₯ π , π π = π βοΈ π = π ( π§ ππ ) π₯ π π₯ ππ (30)And the parameter can be computed by the sufficient statistics asshown in Eq. (31). π π = π π π , π π = π π βοΈ π = π ( π§ ππ ) (31) π π = π π π β π π π ππ When an insert command of relation π is issued, at first, trigger T1 and T2 prepare the temporary relation π β² . T3 is triggered followedby the insertion command. The procedure T3 executed is illustratedin Algorithm 2. In line 2, tigger T3 first adds the sufficient statisticsof the inserted data into π . Then it further updates π by performing π iterations over π β² , in each iteration, each data point π₯ π is usedto update π as well as the model instantly. Here, π ( π§ ππ ) ( π‘ ) is theresponsibility of π₯ π (Eq. (3)) evaluated in π‘ -th iteration. β‘ In this section, we present our experimental studies of supportingmodel-based view training, inference, and maintenance in
RDBMS .We conduct extensive experiments to investigate the followingfacets: β’ compare the performance of our enhanced with and loopingcontrol by a host language. β’ test the scalability of the recursive queries for different mod-els on synthetic data. β’ conduct a case study on market segmentation on retail data. β’ validate the efficiency of our model maintenance mechanism. Experimental Setup : We report our performance studies on aPC with Intel(R) Xeon(R) CPU E5-2697 v3 (2.60GHz) with 96GBRAM running Linux CentOS 7.5 64 bit. We tested the enhancedrecursive query on
PostgreSQL
PostgreSQL instance. T i m e ( s ) denhanced with (a) varying d T i m e ( s ) nPsycopg2 (b) varying n T i m e ( s ) k (c) varying k T i m e ( s ) (d) varying number of iterations Figure 6: with + vs.
Psycopg2 with + vs.
Psycopg2
We compare the enhanced with , which translates the recursive
SQL query to
SQL / PSM with the implementation of using a hostlanguage to control the looping, which is adopted in pervious EMimplementation [48]. We implement the latter by
Psycopg2 [2], apopular
PostgreSQL adapter for the python language. Regardingthe EM algorithm, the E-step, M-step, and parameter updating arewrapped in a python for-loop, and executed by a cursor alternatively.We compare the running time of this two implementations, i.e.,enhanced with and
Psycopg2 for training Gaussian Mixture Modelby varying the dimension π of data point (Fig. 6(a)), the scale of thetraining data π (Fig. 6(b)), the number of components π (Fig. 6(c))and the number of iterations (Fig. 6(d)). The training data is evenlygenerated from 10 Gaussian distributions.The evaluated time is the pure query execution time where thecosts of database connection, data loading and parameter initial-ization are excluded. The experiments show that enhanced with outperforms Psycopg2 significantly, not only for multiple iteratorsin Fig. 6(d) but also for per iteration in Fig. 6(d)- 6(c). For one thing,the implementation of
Psycopg2 calls the databases multiple timesper iteration, incurring much client-server communication and con-text switch costs. For the other, the issued queries from client toserver will be parsed, optimized and planned on-the-fly. These arethe general problems of calling
SQL queries by any host language.Meanwhile, we implement the hybrid strategy of
SQLEM [47] on
PostgreSQL . For Gaussian Mixture model, one iteration for 10,000data points with 10 dimensions fails to terminate within 1 hour. Intheir implementation, 2 π separate SQL queries evaluate the meansand variances of π components respectively, which is a performancebottleneck. We train Gaussian Mixture model (
GMM ) [8], mixture of linearregression (
MLR ) [52] and a neural network model, mixture of ex-perts (
MOE ) [54] by evaluating
SQL recursive queries in
PostgreSQL . T i m e ( s ) d GMM (a) k = 8, n = 10000, varying d T i m e ( s ) nMLR (b) k = 8, d = 20, varying n T i m e ( s ) kMOE (c) d = 20, n = 10000, varying k Figure 7: Scalability Test -18k-16k-14k-12k1 2 3 4 5 6 7 8 9 101112131415
Log L i k e li hood Iterations (a)
GMM -6k-4k-2k01 2 3 4 5 6 7 8 9 101112131415
Log L i k e li hood Iterations (b)
MLR -7k-5k-3k-1k1 2 3 4 5 6 7 8 9 101112131415
Log L i k e li hood Iterations (c)
MOE
Figure 8: Convergence
Given the observed dataset as {( π₯ , π¦ ) , ( π₯ , π¦ ) , Β· Β· Β· , ( π₯ π , π¦ π )} , where π₯ π β R π and π¦ π β R , the MLR models the density of π¦ given x as π ( π¦ π | π₯ π ) = πΎ βοΈ π = π π N ( π¦ π | π₯ ππ π· π , π π ) (32)And the MOE models the density of π¦ given π₯ as π ( π¦ π | π₯ π ) = πΎ βοΈ π = π π ( π₯ π ) N ( π¦ π | π₯ ππ π· π , π π ) (33)where π· π β R π is the parameters of a linear transformer, N is theprobability density function of a Gaussian given mean π₯ ππ π· π β R and standard deviation π π β R . In Eq. (33), π π ( π₯ ) is called thegating function, given by computing the softmax in Eq. (34) where π½ β R π is a set of linear weights on π₯ π . π π ( π₯ π ) = π π₯ π π½ π (cid:205) πΎπ = π π₯ π π½ π (34)The intuition behind the gating functions is a set of βsoftβ learnableweights which determine the mixture of πΎ local models. We adoptthe single loop EM algorithm [53] to estimate the parameters of MOE , which uses least square regression to compute the gatingnetwork directly. For
GMM , the training data is evenly drawn from10 Gaussian distributions. For
MLR and
MOE , the training datais generated from 10 linear functions with Gaussian noise. Theparameters of the Gaussians and the linear functions are drawnfrom the uniform distribution [ , ] . And the initial parametersare also randomly drawn from [ , ] .Fig. 7 displays the training time per iteration of the 3 models byvarying the data dimension π (Fig. 7(a)), the scale of the trainingdata π (Fig. 7(b)) and the number of clusters π (Fig. 7(c)). In gen-eral, for the 3 models, the training time grows linearly as π and π increase, while the increment of data dimension π has a moreremarkable impact on the training time. When increasing π and π , the size of intermediate relations, e.g., relation π for computingthe responsibilities in Eq. (14) grow linearly. Therefore the train-ing cost grows linearly with regards to π and π . However, in the3 models, we need to deal with π Γ π dimensional matrices in theM-step. For GMM , it needs to compute the probability density ofthe multivariable Gaussians and reestimate the covariance matrices.For
MLR and
MOE , they need to compute the matrix inversion andleast square regression. The training cost grows with regard to thesize of matrix. The comparison shows it is still hard to scale high-dimensional analysis in a traditional database system. However, theefficiency can be improved on a parallel/distributed platform andnew hardware.
We study building model-based view in
PostgreSQL for a real appli-cation, i.e., market segmentation, which partitions the consumersinto sub-groups based on their features to analyze their purchase be-havior and identify potential market. The data is collected from anonline retailer, containing 541,908 transactions of 4,308 consumers.Based on the RFM model [13] for consumer value analysis, for eachconsumers, a three dimensional feature of real value including re-cency, frequency, monetary is extracted. The feature is normalizedby corresponding means and variances.We apply the
GMM (Eq. (1)),
MLR (Eq. (32)) and
MOE (Eq. (33))over the 4,308 consumers and the clustering result is visualizedin Fig. 9. For
MLR and
MOE , the recency and frequency are thefeatures π₯ and monetary is π¦ . These two models assume it existsa linear relationship between the recency, frequency and mone-tary of a consumer. All the initial parameters are randomly drawnfrom the uniform distribution [ , ] . In Fig. 9, the consumers aresegmented into 4 clusters by different colors. The segmentationof GMM (Fig. 9(a)) fits data points by Gaussian distributions of 3variables, while
MLR (Fig. 9(b)) and
MOE (Fig. 9(c)) fit them byplanes with linearity. The neural model
MOE tends to generate asharper cluster boundary than
MLR . Fig. 8 shows the convergenceof the log-likelihood of the 3 models during 15 training iterations. Infact, the convergence can be fast and achieved within 5 iterations. e c e n c y β2 β1 0 1 F r e q u e n c y β2 0 2 4 6 M o n e t a r y β4β20246 (a) GMM R e c e n c y β2 β1 0 1 F r e q u e n c y β2 0 2 4 6 M o n e t a r y β4β20246 (b) MLR R e c e n c y β2 β1 0 1 F r e q u e n c y β2 0 2 4 6 M o n e t a r y β4β20246 (c) MOE Figure 9: Case Study: Market Segmentation T i m e ( s ) budget sizebatch 10 (a) n = 100K T i m e ( s ) budget size20 30 (b) n = 1M T i m e ( s ) budget size40 50 (c) n = 10M Figure 10: Insert maintenance
Finally, we test the performance of our trigger-based model up-dating mechanism. First, we train
GMM for 1-dimensional datagenerated from 2 Gaussian distributions. The original models aretrained over 100k, 1M and 10M data points, respectively with 15iterations. The overall training time is recorded as the βbatchβ modetraining time, which is 54s, 501s and 4,841s respectively. After themodel is trained and persisted. We insert 10, 20, 30, 40, 50 datapoints to the underlying data by varying the budget size of selecteddata from 0 to 1,000.Fig. 10 shows the insertion time w.r.t. the budget size of theselected data for the 3 models. The insertion time is the collapsedtime from the insert command issuing to the transaction commit,including the cost of data selection with the density-based strategyand computing initial sufficient statistics (line 1 of Algorithm 2). Asthe number of processed tuples increases, the insertion time growslinearly. Compare to the retraining cost, i.e., the batch training time,it is not always efficient to update the existing model. The choicedepends on two factors, the size of overall data points, and thebudget size plus insert size, i.e., the numbers of data points to beprocessed in the updating. The updating mechanism may not beefficient and effective when the overall data size is small or there isa large volume of insertion. That is because, for the batch trainingmode, computation of parameter evaluation dominates the cost.While for the model updating, since the sufficient statistics and themodel are updated when processing each data point, the updatingoverhead becomes a main overhead. Meanwhile, we notice that the collapsed time of data selection and computing initial sufficientstatistics take about 10s, 100s and 1,000s for data size of 100k, 1Mand 10M, respectively. Precomputing and persisting these resultswill benefit for a larger dataset.In this paper, we focus on testing the efficiency of the approxi-mation for model updating. The convergence and approximationguarantee involves a wide range of research topics in statistical andmachine learning area, which we leave it as future investigation.
Integrating machine learning techniques into database systems facil-itates a wide range of applications in industrial and academic fields.In this paper, we focus on supporting EM algorithm in
RDBMS . Dif-ferent from the previous approach, our approach wraps the E-stepand M-step in an enhanced
SQL recursive query, which is ensured toreach an iterative fix point. The learned model can be materializedas a database view and queries as conventional views. Furthermore,to tackle the slight changes of underlying training data, we proposean automatic view updating mechanism by exploiting the incre-mental variant of the EM algorithm. The extensive experimentswe conducted validate our enhanced recursive query outperformsprevious approach significantly and can support multiple mixturemodels by EM algorithm, as well as the efficiency of the incrementalmodel update. It is worth mentioning that the
SQL recursive queryis not only suitable for unsupervised learning like EM algorithm,but also has the potential to support supervised learning like classi-fication and regression. And the implementation of the query cane migrated to parallel and distributed platforms, e.g.,
Hadoop and
Spark , to deploy large scale machine learning applications. Thesedirections and convergency and performance guarantee deservefuture explorations.
ACKNOWLEDGEMENT
This work is supported by the Research Grants Council of HongKong, China under No. 14203618, No. 14202919 and No. 14205520.
REFERENCES
Proc. of OSDIβ16. , pages265β283, 2016.[6] S. Abiteboul, R. Hull, and V. Vianu.
Foundations of Databases . Addison-Wesley,1995.[7] F. Arni, K. Ong, S. Tsur, H. Wang, and C. Zaniolo. The deductive database systemLDL++.
TPLP , 3(1), 2003.[8] C. M. Bishop.
Pattern recognition and machine learning, 5th Edition . Informationscience and statistics. Springer, 2007.[9] M. Boehm, M. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi,N. Pansare, B. Reinwald, F. Reiss, P. Sen, A. Surve, and S. Tatikonda. Systemml:Declarative machine learning on spark.
PVLDB , 9(13):1425β1436, 2016.[10] V. R. Borkar, Y. Bu, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, andR. Ramakrishnan. Declarative systems for large-scale machine learning.
IEEEData Eng. Bull. , 35(2):24β32, 2012.[11] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to largedatabases. In
Proc. of KDDβ98 , pages 9β15, 1998.[12] Z. Cai, Z. Vagena, L. L. Perez, S. Arumugam, P. J. Haas, and C. M. Jermaine.Simulation of database-valued markov chains using simsql. In
Proc. of SIGMODβ13 ,pages 637β648, 2013.[13] D. Chen, S. L. Sain, and K. Guo. Data mining for the online retail industry: A casestudy of rfm model-based customer segmentation using data mining.
Journal ofDatabase Marketing & Customer Strategy Management , 19(3):197β208, 2012.[14] H. Chen, R. H. L. Chiang, and V. C. Storey. Business intelligence and analytics:From big data to big impact.
MIS Quarterly , 36(4):1165β1188, 2012.[15] L. Chen, A. Kumar, J. F. Naughton, and J. M. Patel. Towards linear algebra overnormalized data.
PVLDB , 10(11):1214β1225, 2017.[16] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: Newanalysis practices for big data.
PVLDB , 2(2):1481β1492, 2009.[17] A. P. Dempster. Maximum likelihood estimation from incomplete data via the emalgorithm.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,39:1β38, 1977.[18] A. Deshpande and S. Madden. Mauvedb: supporting model-based user views indatabase systems. In
Proc. of SIGMODβ06 , pages 73β84, 2006.[19] X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complexinformation spaces. In
Proc. of SIGMODβ05 , pages 85β96, 2005.[20] R. O. Duda and P. E. Hart.
Pattern classification and scene analysis . A Wiley-Interscience publication. Wiley, 1973.[21] R. O. Duda, P. E. Hart, and D. G. Stork.
Pattern Classification . Wiley, New York, 2edition, 2001.[22] J. Eisner. Dyna: A non -probabilistic programming language for probabilisticAI. Extended abstract for talk at the NeurIPS*2008 Workshop on ProbabilisticProgramming, Dec. 2008.[23] T. Elgamal, S. Luo, M. Boehm, A. V. Evfimievski, S. Tatikonda, B. Reinwald, andP. Sen. SPOOF: sum-product optimization and operator fusion for large-scalemachine learning. In
Proc. of CIDRβ17 , 2017.[24] X. Feng, A. Kumar, B. Recht, and C. RΓ©. Towards a unified architecture forin-rdbms analytics. In
Proc. of SIGMODβ12 , pages 325β336, 2012.[25] S. J. Finkelstein, N. Mattos, I. Mumick, and H. Pirahesh. Expressing recursivequeries in SQL.
ISO-IEC JTC1/SC21 WG3 DBL MCI , (X3H2-96-075), 1996.[26] Z. J. Gao, S. Luo, L. L. Perez, and C. Jermaine. The BUDS language for distributedbayesian machine learning. In
Proc. of SIGMODβ17 , pages 961β976, 2017.[27] H. Garcia-Molina, J. D. Ullman, and J. Widom.
Database Systems The CompleteBook . Prentice Hall, 2002.[28] J. M. Hellerstein, C. RΓ©, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S.Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The madlib analytics library orMAD skills, the SQL.
PVLDB , 5(12):1700β1711, 2012. [29] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. The montecarlo database system: Stochastic analysis close to the data.
ACM Trans. DatabaseSyst. , 36(3):18:1β18:41, 2011.[30] D. Kernert, F. KΓΆhler, and W. Lehner. Spmacho - optimizing sparse linear algebraexpressions with probabilistic density estimation. In
Proc. of EDBTβ15 , pages289β300, 2015.[31] M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: questions asked frequently. In
Proc.of PODSβ16 , pages 13β28, 2016.[32] S. Klein. U-sql. In
IoT Solutions in Microsoftβs Azure IoT Suite , pages 173β190.Springer, 2017.[33] M. L. Koc and C. RΓ©. Incrementally maintaining classification using an RDBMS.
PVLDB , 4(5):302β313, 2011.[34] A. Kumar, J. F. Naughton, and J. M. Patel. Learning generalized linear modelsover normalized data. In
Proc. of SIGMODβ15 , pages 1969β1984, 2015.[35] F. Li, L. Chen, Y. Zeng, A. Kumar, X. Wu, J. F. Naughton, and J. M. Patel. Tuple-oriented compression for large-scale mini-batch stochastic gradient descent. In
Proc. of SIGMODβ19 , pages 1517β1534, 2019.[36] X. Li, B. Cui, Y. Chen, W. Wu, and C. Zhang. Mlog: Towards declarative in-database machine learning.
PVLDB , 10(12):1933β1936, 2017.[37] P. Liang and D. Klein. Online EM for unsupervised models. In
Proc. of NAACLβ09 ,pages 611β619, 2009.[38] S. Luo, Z. J. Gao, M. N. Gubanov, L. L. Perez, and C. M. Jermaine. Scalable linearalgebra on a relational database system. In
Proc. of ICDEβ17 , pages 523β534, 2017.[39] G. McLachlan and T. Krishnan.
The EM algorithm and extensions , volume 382.John Wiley & Sons, 2007.[40] J. Melton and A. R. Simon.
SQL: 1999: understanding relational language compo-nents . Morgan Kaufmann, 2001.[41] A. B. M. Moniruzzaman and S. A. Hossain. Nosql database: New era of databasesfor big data analytics - classification, characteristics and comparison.
CoRR ,abs/1307.0191, 2013.[42] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental,sparse, and other variants. In
Learning in graphical models , pages 355β368.Springer, 1998.[43] R. M. Neal and G. E. Hinton. A view of the em algorithm that justifies incremental,sparse, and other variants. In
Learning in Graphical Models , pages 355β368. 1998.[44] M. Nikolic, M. Elseidy, and C. Koch. LINVIEW: incremental view maintenancefor complex analytical queries. In
Proc. of SIGMODβ14 , pages 253β264, 2014.[45] M. Nikolic and D. Olteanu. Incremental view maintenance with triple lockfactorization benefits. In
Proc. of SIGMODβ18 , pages 365β380, 2018.[46] C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL.
IEEE TKDE , 18(2):188β201, 2006.[47] C. Ordonez. Optimization of linear recursive queries in SQL.
IEEE Trans. Knowl.Data Eng. , 22(2), 2010.[48] C. Ordonez and P. Cereghini. SQLEM: fast clustering in SQL using the EMalgorithm. In
Proc. of SIGMOD , pages 559β570, 2000.[49] M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models overfactorized joins. In
Proc. of SIGMODβ16 , pages 3β18, 2016.[50] P. Tamayo, C. Berger, M. M. Campos, J. Yarmus, B. L. Milenova, A. Mozes, M. Taft,M. F. Hornick, R. Krishnan, S. Thomas, M. Kelly, D. Mukhin, R. Haberstroh,S. Stephens, and J. Myczkowsji. Oracle data mining - data mining in the databaseenvironment. In
The Data Mining and Knowledge Discovery Handbook. , pages1315β1329. 2005.[51] D. M. Titterington. Recursive parameter estimation using incomplete data.
Jour-nal of the Royal Statistical Society: Series B (Methodological) , 46(2):257β267, 1984.[52] K. Viele and B. Tong. Modeling with mixtures of linear regressions.
Statisticsand Computing , 12(4):315β330, 2002.[53] Y. Yang and J. Ma. A single loop EM algorithm for the mixture of expertsarchitecture. In
Advances in Neural Networks - ISNN 2009, 6th InternationalSymposium on Neural Networks, ISNN 2009, Proceedings, Part II , pages 959β968,2009.[54] S. E. Yuksel, J. N. Wilson, and P. D. Gader. Twenty years of mixture of experts.
IEEE Trans. Neural Netw. Learning Syst. , 23(8):1177β1193, 2012.[55] C. Zaniolo, N. Arni, and K. Ong. Negation and aggregates in recursive rules: theLDL++ approach. In
Proc. of DOOD , 1993.[56] C. Zaniolo, S. Stefano, Ceri, C. Faloutsos, R. T. Snodgrass, V. S. Subrahmanian,and R. Zicari.
Advanced database systems . Morgan Kaufmann, 1997.[57] K. Zhao and J. X. Yu. All-in-one: Graph processing in rdbmss revisited. In