A Relational Matrix Algebra and its Implementation in a Column Store
aa r X i v : . [ c s . D B ] A p r A Relational Matrix Algebra and its Implementationin a Column Store
Oksana Dolmatova
University of Z¨urichZ¨urich, Switzerlanddolmatova@ifi.uzh.ch
Nikolaus Augsten
University of SalzburgSalzburg, [email protected]
Michael H. B¨ohlen
University of Z¨urichZ¨urich, Switzerlandboehlen@ifi.uzh.ch
ABSTRACT
Analytical queries often require a mixture of relational andlinear algebra operations applied to the same data. Thisposes a challenge to analytic systems that must bridge thegap between relations and matrices. Previous work has mainlystrived to fix the problem at the implementation level. Thispaper proposes a principled solution at the logical level. Weintroduce the relational matrix algebra (RMA), which seam-lessly integrates linear algebra operations into the relationalmodel and eliminates the dichotomy between matrices andrelations. RMA is closed: All our relational matrix opera-tions are performed on relations and result in relations; noadditional data structure is required. Our implementation inMonetDB shows the feasibility of our approach, and empiri-cal evaluations suggest that in-database analytics performswell for mixed workloads.
ACM Reference format:
Oksana Dolmatova, Nikolaus Augsten, and Michael H. B¨ohlen. 2020.A Relational Matrix Algebra and its Implementation in a ColumnStore. In
Proceedings of Proc of the 2020 ACM SIGMOD InternationalConference on Management of Data, Portland, OR, USA, June 14-19,2020 (SIGMOD’20),
16 pages.DOI: 10.1145/3318464.3389747
Many data that are stored in relational databases includenumerical parts that must be analyzed, for example, sen-sor data from industrial plants, scientific observations, orpoint of sales data. The analysis of these data, which are notpurely numerical but also include important non-numericalvalues, demand mixed queries that apply relational and lin-ear algebra operations on the same data.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. Copyrights for compo-nents of this work owned by others than ACM must be honored. Abstract-ing with credit is permitted. To copy otherwise, or republish, to post onservers or to redistribute to lists, requires prior specific permission and/ora fee. Request permissions from [email protected].
SIGMOD’20, Portland, OR, USA © 2020 ACM. 978-1-4503-6735-6/20/06. . . $15.00DOI: 10.1145/3318464.3389747
Dealing with mixed workloads is challenging since thegap between relations and matrices must be bridged. Cur-rent relational systems are poorly equipped for this task.Previous attempts to deal with mixed workloads have fo-cused on the implementation level, for example, by intro-ducing ordered data types; by storing matrices in special re-lations or key-value structures; or by splitting queries intotheir relational and matrix parts. This paper resolves the gapbetween relations and matrices.We propose a principled solution for mixed workloadsand introduce the relational matrix algebra (RMA) to sup-port complex data analysis within the relational model. Thegoal is to (1) solve the integration of relations and linear al-gebra at the logical level, (2) so achieve independence fromthe implementation at the physical level, and (3) prove thefeasibility of our model by extending an existing system. Weare the first to achieve these goals: Other works focus on fa-cilitating the transition between the relational and the linearalgebra model. We eliminate the dichotomy between matri-ces and relations by seamlessly integrating linear algebrainto the relational model. Our implementation of RMA inMonetDB shows the feasibility of our approach.We define linear operations over relations and system-atically process and maintain non-numerical information.We show that the relational model is well-suited for com-plex data analysis if ordering and contextual information aredealt with properly. RMA is purely based on relations anddoes not introduce any ordered data structures. Instead, therelevant row order for matrix operations is computed from contextual information in the argument relations. All rela-tional matrix operations return relations with origins . Ori-gins are constructed from the contextual information (at-tribute names and non-numerical values) of the input re-lations and uniquely identify and describe each cell in theresult relation.We extend the syntax of SQL to support relational matrixoperations. As an example, consider a relation ratinд withschema ( U ser , Balto , Heat , N et ) , that stores users and theirratings for the three films (”Balto”, ”Heat”, and ”Net”, onecolumn per film). The SQL query SELECT * FROM INV ( rating BY User ); rders the relation by users and computes the inversion ofthe matrix formed by the values of the ordered numericalcolumns. The result is a relation with the same schema: Thevalues of attribute
U ser are preserved, and the values of theremaining three attributes are provided by matrix inversion(see Section 5 for details). The origin of a numerical resultvalue is given by the user name in its table row and the at-tribute name of its column.At the system level, we have integrated our solution intoMonetDB. Specifically, we extended the kernel with rela-tional matrix operations implemented over binary associa-tion tables (BATs). The physical implementation of matrixoperations is flexible and may be transparently delegatedto specialized libraries that leverage the underlying hard-ware (e.g., MKL [2] for CPUs or cuBLAS [1] for GPUs). Thenew functionality is introduced without changing the maindata structures and the processing pipeline of MonetDB, andwithout affecting existing functionality.Our technical contributions are as follows: • We propose the relational matrix algebra (RMA), whichextends the relational model with matrix operations.This is the first approach to show that the relationalmodel is sufficient to support matrix operations. Thenew set of operations is closed : All relational ma-trix operations are performed on relations and re-sult in relations, and no additional data structure isrequired. • We show that matrix operations are shape restricted ,which allows us to systematically define the resultsof matrix operations over relations. We define rowand column origins , the part of contextual informa-tion that describes values in the result relation, andprove that all our operations return relations withorigins. • We implement and evaluate our solution in detail.We show that our solution is feasible and leveragesexisting data structures and optimizations.RMA opens new opportunities for advanced data analyt-ics that combine relational and linear algebra functionality,speeds up analytical queries, and triggers the developmentof new logical and physical optimization techniques.The paper is organized as follows. Sect. 2 discusses relatedwork. We introduce basics in Sect. 3 and introduce relationalmatrix algebra (RMA) in Sect. 4. We show an applicationexample in Sect. 5, discuss important properties of RMA inSect. 6, and its implementation in MonetDB in Sect. 7. Weevaluate our solution in Sect. 8 and conclude in Sect. 9.
Relational DBMSs offer simple linear algebra operations, suchas the pair-wise addition of attribute values in a relation. Some operations, e.g., matrix multiplication, can be expressedvia syntactically complex and slow SQL queries. The set ofoperations is limited and does not include operations whoseresults depend on the row order. For instance, there areno SQL solutions for inversion or determinant computation.Complex operations must be programmed as UDFs. Ordonezet al. [25] suggest UDFs for linear regression with a matrix-like result type. The UTL NLA package [24] for Oracle DBMSoffers linear algebra operations defined over UTL NLA ARRAY.UDFs provide a technical interface but do not define matrixoperations over relations. No systematic approach to main-tain contextual information is provided.Luo et al. [19] extend SimSQL [8], a Hadoop-based rela-tional system, with linear algebra functionality. RasDaMan[6, 21] manages and processes image raster data. Both sys-tems introduce matrices as ordered numeric-only attributetypes. Although relations and matrices coexist, operationsare defined over different objects. Linear operations are notdefined over unordered objects and they do not support con-textual information for individual cells of a matrix.SciQL [32, 33] extends MonetDB [22] with a new datatype, ARRAY , as a first-class object. An array is stored as anobject on its own. Arrays have a fixed schema: The last at-tribute stores the data values of a matrix, all other attributesare dimension attributes and must be numeric. Arrays comewith a limited set of operations, such as addition, filtering,and aggregation, and they must be converted to relations toperform relational operations. The presence of contextualinformation and its inheritance are not addressed.The MADlib library [15] for in-database analytics offersa broad range of linear and statistical operations, definedas either UDFs with C++ implementations or Eigen librarycalls. Matrix operations require a specific input format: Ta-bles must have one attribute with a row id value and anotherarray-valued attribute for matrix rows. Matrix operationsreturn purely numeric results and cannot be nested.Hutchison et al. [16] propose LARA, an algebra with tuple-wise operations, attribute-wise operations, and tuple exten-sions. LARA defines linear and relational algebra opera-tions using the same set of primitives. This is a good ba-sis for inter-algebra optimizations that span linear and re-lational operations. LARA offers a strong theoretical basis,works out properties of the solution, and allows to store rowand column descriptions during the operations. The main-tenance of contextual information is not considered for op-erations that change the number of rows or columns.LevelHeaded [4, 5], an engine for relational and linear al-gebra operations, uses a special key-value structure: Eachobject has keys (dimension attributes) and annotations (value Some approaches support multi-dimensional arrays. Since we targetlinear algebra, we focus on two dimensions and use the term matrix throughout. ttributes). Dimension and value attributes are stored in atrie and a flat columnar buffer, respectively. Linear oper-ations are available through an extended SQL syntax. Keyvalues guarantee contextual information for rows. However,the trie key structure restricts relational operations: For ex-ample, aggregations of keys and join predicates over non-key attributes (i.e., subselects in SQL) are not allowed.SciDB [30] is a DBMS that is based on arrays. Matricesand relations are implemented as nested arrays. SciDB fo-cuses on efficient array processing and performs linear al-gebra operations over arrays. SciDB supports element-wiseoperations and selected linear operations, such as SVD. Thesystem also offers relational algebra operations on arraysbut cannot compete with relational DBMSs such as Mon-etDB in terms of performance. A systematic approach tomaintain contextual information is not considered.Statistical packages, such as R [26] and pandas [20], of-fer a broad range of linear and relational algebra operationsover arrays. Each cell may be associated with descriptiveinformation, but this information is not always inheritedas part of operations (e.g., usv ). No systematic solution forassociating contextual information with numeric results isprovided. The most important relational operations are sup-ported, but even basic optimizations (e.g., join ordering) aremissing.The R package RIOT-DB [31] uses MySQL as a backendand translates linear computations to SQL. RIOT-DB addressesthe main memory limitations of R, and the optimization ofSQL statements yields inter-operation optimization. How-ever, it is difficult (or sometimes impossible) to express lin-ear algebra operations in SQL, and only a few simple opera-tions, such as subtraction and multiplication, are discussed.AIDA [10] integrates MonetDB and NumPy [3] and ex-ploits the fact that both systems use C arrays as an internaldata structure: To avoid copying NumPy data to MonetDB,AIDA passes pointers to arrays. Data copying is still neededto pass MonetDB results to NumPy since MonetDB does notguarantee that multiple columns are contiguous in memory,which is required by NumPy. AIDA offers a Python-like pro-cedural language for relational and linear operations. Se-quences of relational operations are evaluated lazily, whichallows AIDA to combine and optimize sequences of rela-tional operations. The optimization does not include linearalgebra operations.SystemML [13] offers a set of linear algebra primitivesthat are expressed in a high-level, declarative language andare implemented on MapReduce. SystemML includes linearalgebra optimizations that are similar to relational optimiza-tions (e.g., selecting the order of execution of matrix multi-plications). The system considers only linear algebra opera-tions. r O V W
A 30 1C 22 5B 10 1 d e d (cid:3) e Figure 1: Relation r ; matrices d , e , and d (cid:3) e This section presents notation for relations and matrices,and introduces the basic matrix algebra operations.
A relation r is a set of tuples r i with schema R . A schema, R = ( A , B , . . . ) , is a finite, ordered set of attribute names. Atuple r i ∈ r has a value from the appropriate domain foreach attribute in the schema. We write r i . A to denote thevalue of attribute A in tuple r i and r . A to denote the set ofall values r i . A in relation r . Ordered subsets of a schema, U ⊆ R , are typeset in bold. | r | is the number of tuples inrelation r .Let r be a relation and U ⊆ R be attributes that form akey of R . We write r U , k to denote the k th tuple of relation r sorted by the values of attributes U (in ascending order): r i = r U , k ⇐⇒ r i ∈ r ∧|{ r j | r j ∈ r ∧ r j . U < r i . U }| = k − column cast ▽ U creates an ordered set L from thesorted values of an attribute U that forms a key in relation r : L = ▽ U ⇐⇒ | L | = | r | ∧ ∀ ≤ i ≤ | r |( L [ i ] = r U , i . U ) (2)The column cast is used to generate a schema from a set ofvalues. We use this for operations tra , usv , and opd (seeTable 2). The column cast is applicable if the cardinality ofa list of attributes U is one. Example 3.1.
Consider relation r in Figure 1. The third tu-ple of relation r sorted by the values of attribute V is r ( V ) , = ( A , , ) , the column cast of O is ▽ O = ( A , B , C ) , and the val-ues of attribute W are r . W = { , , } .We use set notation and apply it to bags. Bags can be or-dered or unordered. To emphasize the difference, parenthe-ses are used for ordered bags (or lists), e.g., (3,2,3), and curlybraces for unordered bags, e.g., { } . When transition-ing from unordered to ordered bags, the order is specifiedexplicitly. An n × k matrix m is a two-dimensional array with n rowsand k columns. | m | is the number of rows, m the numberf columns. The element in the i th row and the j th columnof matrix m is m [ i , j ] ; the i th row is m [ i , ∗] ; the j th column is m [∗ , j ] .We consider the operations from the R Matrix Algebra[27]: element-wise multiplication ( EMU ), matrix multiplica-tion (
MMU ), outer product (
OPD ), cross product (
CPD ), matrixaddition (
ADD ), matrix subtraction (
SUB ), transpose (
TRA ), solveequation (
SOL ), inversion (
INV ), eigenvectors (
EVC ), eigen-values (
EVL ), QR decomposition (
QQR , RQR ), SVD – singlevalue decomposition (
DSV , USV , VSV ), determinant (
DET ), rank(
RNK ), and Choleski factorization (
CHF ). Note that QR andSVD return more than one matrix, therefore we split theoperations:
QQR and
RQR return matrix Q and matrix R ofthe QR decomposition, respectively;
DSV , USV , and
VSV re-turn vector D with the singular values, matrix U with theleft singular vectors, and matrix V with the right singularvectors of SVD, respectively.The matrix concatenation of matrices m and n with k rowseach returns a matrix h with k rows. The i th row of h is theconcatenation of the i th row of m and the i th row of n . h = m (cid:3) n ⇐⇒ | h | = | m | ∧ ∀ ≤ i ≤ | h |( h [ i , ∗] = m [ i , ∗] ◦ n [ i , ∗]) (3)The schema cast ∆ U of attributes U creates a matrix m (with a single column) from the attribute names of U : m = ∆ U ⇐⇒ m = ∧ | m | = | U | ∧ ∀ ≤ i ≤ | U |( m [ i , ] = U [ i ]) (4) Example 3.2.
Consider attributes U = ( D , B ) . Matrix d inFigure 1 is the result of the schema cast d = ∆ U . The resultof concatenating matrix d and matrix e is d (cid:3) e . Note thatthe row and column numbers (cells shaded in gray) in thematrix illustrations are not part of the matrix.Matrix operations are shape restricted , i.e., the number ofresult rows is equal to the number of rows of one of the in-put matrices (r), the number of columns of one of the inputmatrices (c), or one (1). The same holds for the number ofresult columns.The dimensionality of result matrices defines the shapetype of matrix operations. We write r if the result dimen-sionality is equal to the number of rows in the first matrix,r if the result dimensionality is equal to the number of rowsin the second matrix, and r ∗ if the result dimensionality isequal to the number of rows in the first and second matrix(i.e., r = r ). The same notation holds for the number ofcolumns. Table 1 summarizes the shape types of matrix op-erations. Example 3.3.
Matrix multiplication has shape type (r ,c ),which states that the number of result rows is equal to thenumber of rows of the first argument matrix, and the num-ber of columns is equal to the number of columns of thesecond argument matrix. Matrix addition has shape type Table 1: Shape types of matrix operations
Cardinalities Shape type Operations | i × j | → | i × i | (r ,r ) USV | i × j | , | i × j | → | i × i | (r ,r ) OPD | i × i | → | i × i | (r ,c ) INV , EVC , CHF | i × j | → | i × j | (r ,c ) QQR | i × j | , | j × j | → | i × j | (r ,c ) MMU | i × i | → | i × | (r ,1) EVL | i × j | → | i × | (r ,1) VSV | i × j | → | j × i | (c ,r ) TRA | i × j | → | j × j | (c ,c ) RQR , DSV | i × j | , | i × j | → | j × j | (c ,c ) CPD | i × j | , | i × | → | j × | (c ,c ) SOL | i × j | , | i × j | → | i × j | (r ∗ ,c ∗ ) EMU , ADD , SUB | i × i | → | × | (1,1) DET | i × j | → | × | (1,1) RNK (r ∗ ,c ∗ ), which states that the number of result rows is equalto the number of rows of the first matrix and the number ofrows of the second matrix.All operations of the matrix algebra are shape restricted .This follows directly from the definitions of the matrix op-erations [14]. The first column of Table 1 lists the relevantcardinalities from these definitions. We use shape restric-tion to determine the inheritance of contextual information.It has also been used in size propagation techniques [7] forthe purpose of cost-based optimization of chains of matrixoperations. To seamlessly integrate matrix operations into the relationalmodel, we extend the relational algebra to the relational ma-trix algebra (RMA). For each of the matrix operations wedefine a corresponding relational matrix operation in RMA: emu , mmu , opd , cpd , add , sub , tra , sol , inv , evc , evl , qqr , rqr , dsv , usv , vsv , det , rnk , chf . We use upper case formatrix operations (e.g., TRA ) and lower case for RMA oper-ations (e.g., tra ). RMA includes both relational algebra andrelational matrix operations. The new operations behavelike regular operations with relations as input and output.For each argument relation, r , of a relational matrix oper-ation one parameter must be specified: The order schema U ⊆ R imposes an order on the tuples for the purposeof the operation. The attributes of the order schema mustform a key . The attributes of relation r that are not partof the order schema U , i.e., U = R − U form the applica-tion schema . The application schema identifies the attributeswith the data to which the matrix operation is applied. Attributes that neither belong to the order schema nor the applicationschema must be dropped explicitly with a projection (or added to the orderschema, thus, forming a super key). e order schema U ⊆ R splits relation r into four non-overlapping areas: Order schema U ; order part r . U ; applica-tion schema U ; and application part r . U . The parts of r thatdo not include matrix values, i.e., the order and applicationschemas ( U and U ) and the order part ( r . U ), form the con-textual information for application part r . U . Intuitively, theorder schema and application schema provide context forcolumns while the order part provides context for rows. r T H W applicationschemaapplicationpartorderpartorderschema
Figure 2: Structure of a relation instance
Example 4.1.
Order schema U = ( T ) splits relation r inFigure 2 into four parts: Order schema U = ( T ) , applicationschema U = ( H , W ) , order part r . U = r . ( T ) = { am , am , am , am } , and application part r . U = r . ( H , W ) = {( , ) , ( , ) , ( , ) , ( , )} . Figure 3 summarizes our approach for the inv operation andexample relation r ′ = σ T > am ( r ) : (1) Two matrix construc-tors define matrices m and n that correspond to order andapplication part of r , respectively; (2) INV inverses matrix n resulting in matrix h ; and (3) the relation constructor com-bines m (cid:3) h and R into result relation v . Definition 4.2. ( Matrix constructor ) Let r be a relation, U be an order schema. The matrix constructor µ U ( r ) returns amatrix that includes the values of r . U sorted by U : m = µ U ( r ) ⇐⇒ | m | = | r | ∧ ∀ ≤ i ≤ | r |( m [ i , ∗] = r U , i . U ) We use the complement notation µ U ( r ) to denote the ma-trix that includes the values of r . U sorted by U . Example 4.3.
Consider Figure 3 with relation instance r ′ = σ T > am ( r ) and schema R = ( T , H , W ) . The matrix construc-tor µ T ( r ′ ) returns matrix n . Definition 4.4. ( Relation constructor ) Let m be a matrix withunique rows, and R be a relation schema with m attributes.The relation constructor γ ( m , R) returns relation r with schema R : r = γ ( m , R) ⇐⇒ | m | = | r | ∧ ∀ t ( t ∈ r ⇐⇒ ∃ ≤ i ≤ | m |( t = m [ i , ∗])) Example 4.5.
In Figure 3, a relation constructor is appliedto schema R and the concatenated matrices m (cid:3) h to con-struct the result relation: v = γ ( m (cid:3) h , R) .Matrix and relation constructors map between relationsand matrices. We use constructors and matrices to definerelational matrix operations and to analyze their properties.At the implementation level, constructors are very efficientsince they split and combine lists of attribute names and donot access the data (cf. Section 7). Relational matrix operations offer the functionality of ma-trix operations in a relational context. The general form of aunary relational matrix operation is op U ( r ) , where U is theorder schema. A binary operation op U ; V ( r , s ) has an addi-tional order schema V for argument relation s .The result of a relational matrix operation is a relationthat consists of (a) the base result of the corresponding ma-trix operation, and (b) contextual information , appropriatelymorphed from the contextual information of the argumentrelations to reflect the semantics of the operation. Definition 4.6. ( Base result ) Consider a unary relationalmatrix operation op U ( r ) . The matrix that is the result of ma-trix operation OP ( µ U ( r )) is the base result of op U ( r ) . The baseresult for binary operations is defined analogously. Example 4.7.
Consider inv T ( σ T > am ( r )) in Fig. 3. The baseresult is matrix h , which results from INV ( µ T ( σ T > am ( r ))) .Table 2 defines the details of how contextual informationis maintained in relational matrix operations. All definitionsfollow the structure illustrated in Figure 3. A result rela-tion is composed from order parts, base result, and schemaswith the help of a relation constructor. For example, inv isdefined according to its shape tape in Table 2: inv U ( r ) = γ ( µ U ( r ) (cid:3) INV ( µ U ( r )) , U ◦ U ) , where µ U ( r ) are the rows of theorder part, INV ( µ U ( r )) is the base result, and U ◦ U is theresult schema.Operations that have a different number of rows thanany of the input relations add a new attribute C to the re-sult relation. This attribute C is for contextual information(cf. Example 4.8): Its values are either the attribute namesof the application schema of an input relation or the op-eration name. The operations add , sub , emu require unioncompatible application schemas and non-overlapping orderschemas. Operations usv , opd , and tra construct the appli-cation schema of the result from the order schema of an in-put relation. Therefore the cardinality of the order schemas U of tra and usv , and V of opd must be one. Example 4.8.
Consider Figures 2 and 4. Figure 4a illus-trates the result of qqr T ( r ) . The values of T define the or-dering of tuples for this operation. The values of H and W ′ = σ T > am ( r ) T H W R T H W m = µ T ( r ′ ) n = µ T ( r ′ ) h = INV ( n ) (cid:3) д = m (cid:3) h v = γ ( д , T ◦ T ) T H W
INV µ T µ T γ Figure 3: Structure of our solution for the inversion example, v = inv T ( σ T > am ( r )) Table 2: Splitting and morphing relations and matricesShape type Operations Definition (r ,r ) usv op U ( r ) = γ ( µ U ( r ) (cid:3) OP ( µ U ( r )) , U ◦ ▽ U ) (r ,r ) opd op U ; V ( r , s ) = γ ( µ U ( r ) (cid:3) OP ( µ U ( r ) , µ V ( s )) , U ◦ ▽ V ) (r ,c ) inv , evc , chf , qqr op U ( r ) = γ ( µ U ( r ) (cid:3) OP ( µ U ( r )) , U ◦ U ) (r ,c ) mmu op U ; V ( r , s ) = γ ( µ U ( r ) (cid:3) OP ( µ U ( r ) , µ V ( s )) , U ◦ V ) (r ,1) evl , vsv op U ( r ) = γ ( µ U ( r ) (cid:3) OP ( µ U ( r )) , U ◦ ( op )) (c ,r ) tra op U ( r ) = γ ( ∆ U (cid:3) OP ( µ U ( r )) , ( C ) ◦ ▽ U ) (c ,c ) rqr , dsv op U ( r ) = γ ( ∆ U (cid:3) OP ( µ U ( r )) , ( C ) ◦ U ) (c ,c ) cpd , sol op U ; V ( r , s ) = γ ( ∆ U (cid:3) OP ( µ U ( r ) , µ V ( s )) , ( C ) ◦ V ) (r ∗ ,c ∗ ) emu , add , sub op U ; V ( r , s ) = γ ( µ U ( r ) (cid:3) µ V ( s ) (cid:3) OP ( µ U ( r ) , µ V ( s )) , U ◦ V ◦ U ) (1,1) det , rnk op U ( r ) = γ ( r ◦ OP ( µ U ( r )) , ( C , op )) are the values of matrix Q computed as part of the QR de-composition. Figure 4b illustrates the result of tra T ( r ) . Thecolumn cast ▽ T of ordering attribute T provides names forthe attributes in the transposed relation. The result relationhas a new attribute C whose values are the names of theattributes in the application schema of r . Note that all re-sult relations come with sufficient contextual informationfor each value. For example, relation r in Figure 2 recordsthat Humidity ( H ) was 1 at 6am, which is also recorded inthe transposed relation in Figure 4b. qqr T ( r ) T H W (a) QR decomposition tra T ( r ) C 5am 6am 7am 8am
H 1 1 6 8W 3 4 7 5 (b) Transpose
Figure 4: Examples of relational matrix operations
This section gives an application example with a mixed work-load that combines relational and linear algebra operations.It maintains all data in regular relations and illustrates theimportance of maintaining contextual information.Consider relations u , f , and r in Figure 5. Relation u records name, state and year of birth of users; relation f records title, release year and director of films; relation r records user ratings for films. Tuple u states that user Annlives in California and was born in 1980; tuple f states thatfilm Heat was directed by Lee and was released in 1995; tu-ple r states that Ann’s ratings for Balto, Heat, and Net are,respectively, 2.0, 1.5, and 0.5. u (user) User State YoB u Ann CA 1980 u Tom FL 1965 u Jan CA 1970 f (film)
Title RelY Director f Heat 1995 Lee f Balto 1995 Lee f Net 1995 Smith r (rating)
User Balto Heat Net r Ann 2.0 1.5 0.5 r Tom 0.0 0.0 1.5 r Jan 1.0 4.0 1.0
Figure 5: Example database
The task is to determine how similar each of Lee’s films isto any other film, based on the ratings from California users.e covariance [18] is used to compute this similarity. In ad-dition, we need relational algebra operations (e.g., selection σ , aggregation ϑ , rename ρ , and join Z ) to retrieve selectedratings and films, aggregate ratings, and combine informa-tion from different tables. The key observation is that a mix-ture of matrix and relational operations is required to deter-mine the similarities of the ratings.The solution in Figure 6 includes three key steps: Datapreparation ( w w w w :The entire process frequently switches between linear andrelational operations. w = π U , B , H , N ( σ S = ’CA’ ( u Z r )) w = ϑ AVG ( B ) , AV G ( H ) , AVG ( N ) ( w ) w = π U , B , H , N ( sub U ; V ( w , ρ V ( π U ( w )) × w )) w = tra U ( w ) w = mmu C ; U ( w , w ) w = w × ρ M ( ϑ COU NT (∗) ( w )) w = π C , B /( M − ) , H /( M − ) , N /( M − ) ( w ) w = π T , B , H , N ( σ D = ’Lee’ ( w Z C = T f )) Figure 6: Computing the similarity of the ratings
In the following we discuss the algebra expressions inFigure 6. First, we join u and r to select ratings from Cal-ifornia users ( w cov ( X , Y ) = n − [( X − E [ X ]) ∗( Y − E [ Y ]) T ] . The expectation of an attribute, e.g., E ( H ) = ϑ AVG ( H ) ( ... ) , is computed via aggregation ( w Relationalmatrix operations , sub , tra and mmu , are used to subtract ( X − E [ X ] ), transpose ( T ), and multiply ( ∗ ) relations ( w , w , w w , w w f to select Lee’s films.Figure 7 illustrates relations w w
4, and w
8. Considertranspose tra U ( w ) with order schema U and applicationschema U = ( B , H , N ) . The result of this operation is a re-lation w C , Ann , Jan ). The values of attribute C are the attribute names in the application schema of w w
8. For example, tuple z states that Lee’s film Baltohas the smallest covariance to film Net. This section defines two crucial requirements for relationalmatrix operations.
Matrix consistency guarantees that the We use the first character of the attribute name to refer to attributes. w3 U B H N
Ann -1.25 0.5 0.25Jan 1.25 -0.5 0.25 w4 C Ann Jan
B -1.25 1.25H -0.5 0.5N -0.25 0.25 w8 T B H N z B 1.56 -0.62 -2.5 z H -0.62 0.25 1
Figure 7: Steps during the computation result of a relational matrix operation can be reduced to theresult of the corresponding matrix operation.
Origins guar-antee that each result relation includes sufficient inheritedcontextual information to relate argument and result rela-tion. We prove that each relational matrix operations is ma-trix consistent and returns a relation with origins.
Matrix consistency ensures that the result relation includesall cell values that are present in the base result and the or-der of rows in the base result can be derived from contextualinformation in the result relations. First, we define reducibil-ity to transition from relations to matrices.
Definition 6.1. ( Reducible ) Let r be a relation, U be an or-der schema. Relation r is reducible to matrix m iff m canbe constructed from the attribute values of U in relation r sorted by U : r → U m ⇐⇒ µ U ( r ) = m Example 6.2.
Consider Fig. 3 with relation r ′ = σ T > am ( r ) ,matrix n , and order schema T . From Example 4.3 we have µ T ( r ′ ) = n . Relation r ′ is reducible to matrix n since n canbe constructed from the values of H and W in the argumentrelation sorted by T , i.e., r ′ → T n . Definition 6.3. ( Matrix consistency ) Consider a unary ma-trix operation OP ( m ) . The corresponding relational matrixoperation op is matrix consistent iff for all relations r thatare reducible to matrix m , the result relation op U ( r ) is re-ducible to OP ( m ) : ∀ r , m , U ( r → U m = ⇒ ∃ U ′ ( op U ( r ) → U ′ OP ( m ))) A binary relational matrix operation is matrix consistent ifits result is reducible to the result of the corresponding bi-nary matrix operation.
Example 6.4.
Consider Figures 2 and 8 with relation r , ma-trix д , matrix RQR ( д ) and relation rqr T ( r ) . • r → T д : Relation r is reducible to matrix д • rqr T ( r ) → C RQR ( д ) : relation rqr T ( r ) is reducible tomatrix RQR ( д ) The result of a relational matrix operation is a relation that,in addition to the base result, includes a row origin and a column origin . Origins (1) uniquely define the relative posi-tioning of result values, (2) give a meaning to values with
RQR ( д ) rqr T ( r ) C H W
H -10.1 -8.8W 0.0 -4.6
Figure 8: Example of matrix consistency respect to the applied operation, and (3) establish a connec-tion between argument relations of an operation and its re-sult relation.
Example 6.5.
Consider inversion and result relation v inFigure 3. Values 7 am and 8 am show that (1) value -0.19 pre-cedes value 0.31 because 7 am precedes 8 am ; (2) -0.19 is theinversion value for humidity and for time 7 am ; (3) value -0.19 in relation v is connected to value 6 in the argumentrelation since they have the same origins (7 am and H).Origins are either inherited order schemas, applicationschemas from argument relations, or constants. The shapetype of an operation determines the cardinality of inheritedcontextual information. The indices in the shape type spec-ify the input relation, from which an origin is inherited. Forexample, if the first element of the shape type is c , the roworigin is the schema cast of the application schema of thefirst argument relation. Note that indices ∗ and are onlypossible for binary operations. Definition 6.6. ( Origins ) Consider a unary, v = op U ( r ) , orbinary, v = op U ; V ( r , s ) , matrix consistent operation withshape type ( x , y ), base result m , and attribute list U ′ suchthat v → U ′ m . Consider Table 3. v . U ′ is a row origin iff v . U ′ is equal to ro for the given shape type x . U ′ is a columnorigin iff U ′ is equal to co for the given shape type y . Table 3: Definition of origins for shape type ( x , y ) x ro r r . U r s . V c ∆ U c ∆ V r ∗ ( r . U , s . V ) c ∗ ( ∆ U , ∆ V ) ′ r ′ y co r ▽ U r ▽ V c U c V r ∗ ▽ U c ∗ U ′ op ′ Example 6.7.
Figure 9 illustrates relation r and the originsfor operations • rnk H ( π H , W ( r )) with shape type (1,1) • usv T ( r ) with shape type (r ,r ) • qqr W , T ( r ) with shape type (r ,c ) Column origins ( co ) are marked by rectangles (all values in-side a rectangle form together the column origin for the re-lation). Row origins ( ro ) are marked by ellipses. r T H W p = rnk H ( π H , W ( r )) C rnk r ro = r = (r) co = op = (rnk) p = usv T ( r ) T 5am 6am 7am 8am ro = r . U = (5am, 6am, 7am, 8am) co = ▽ U = (5am, 6am, 7am, 8am) p = qqr W , T ( r ) W T H ro = r . U = ((3,5am), (4,6am), (5,8am), (7,7am)) co = U = (H) Figure 9: Examples of origins
For p = usv T ( r ) , we have U = ( T ) , U = ( H , W ) , U ′ = ( T ) ,and U ′ = (5am, 6am, 7am, 8am). The shape type of usv is (r ,r ) (see Table 2) this makes p . T = r . T a row origin and ▽ T = (5am, 6am, 7am, 8am) a column origin. Theorem 6.8.
All relational matrix operations return ma-trix consistent relations with a row and column origin.
Proof.
First, we prove that relational matrix operationsare matrix consistent. Second, we show that inherited con-textual information in the result corresponds to Definition 6.6.(1) Consider a unary operation with shape type ( x , y ). Westart with the definition of matrix consistency (Definition 6.3), U ′ equal to U ( x = r ) or C (x = c ), and U ′ equal to U (y = c )or ▽ U (y = r ). We instantiate the implication with the defini-tion of the operation in Table 2. Then, we simplify the righthand side of the implication, substituting it with the equalityin Definition 6.1. Next, we expand the equality with the defi-nitions of relational and matrix constructors (Definitions 4.2and 4.4). A series of simplifications yields the equality of theright and left hand side.(2) Consider operation v = qqr U ( r ) with shape type (r ,c ). Matrix consistency of qqr was shown in the first part ofthe proof. The shape type of qqr is (r ,c ) and we get U = U ′ ,and U = U ′ . v . U is the row origin because x=r and U is thecolumn origin because y=c (see Table 3).The same reasoning applies to the other types of opera-tions in Table 2. Thus, each relational matrix operation re-turns a result relation with a row and a column origin. (cid:3) e following example uses a sequence of tra operationsto illustrate the importance of origins. A result relation withorigins inherits sufficient contextual information, such thateach value can be interpreted. Origins also carry sufficientinformation about the order of rows, such that in sequencesof relational matrix operations no ordering information islost between operations. Example 6.9.
Consider a relation instance r that is reducibleto matrix n . Figure 10 shows the matrix expression TRA ( TRA ( n )) and the respective relational matrix expression tra C ( tra T ( r )) .Operation T tra ( r ) returns relation r
1, which in addition tothe application schema (attributes 5 am , 6 am , 7 am , 8 am ) alsoincludes attribute C , which is preserved together with theapplication schema. r T H W n r C 5am 6am 7am 8am
H 1 1 6 8W 3 4 7 5 n r C H W n tra T ( r ) tra C ( r ) r → T nr → C n r → C n TRA ( n ) TRA ( n ) Figure 10: Origins and matrix consistency
We discuss the integration of our solution into MonetDB.The implementation of relational matrix operations includesthe processing of contextual information and the computa-tion of the base result. Contextual information is handledinside MonetDB, while the computation of the base resultcan be done in MonetDB or delegated to external libraries(e.g., MKL). The integration of each relational matrix oper-ation requires extensions throughout the system, but doesnot change the query processing pipeline and no new datastructures are introduced. To extend MonetDB with addi-tion, QR decomposition, linear regression, and the transfor-mation of numerical data to the MKL format we touch 20(out of 4500) files and add 2500 lines of code.
MonetDB stores each column of a table as a binary associa-tion table (BAT). A BAT is a table with two columns: Headand tail. The head is a column with object identifiers (OID),while the tail is a column with attribute values. All attributevalues of a tuple in a relation have the same OID value. Thus,a tuple can be constructed by concatenating all tail valueswith the same OID. MonetDB operations manipulate BATsand relational operations are represented and executed assequences of BAT operations. Example BAT operations are B ∗ B , B / B , and B − B for element-wise multiplication,division, and subtraction, and sum ( B ) to sum the values inBAT B . Example 7.1.
MonetDB stores relation σ T > am ( r ) from Fig-ure 3 in three BATs as shown in Figure 12. T OID Val
OID Val
OID Val
Figure 11: BAT representation of σ T > am ( r ) One important BAT operation is leftfetchjoin ( ↓ ), whichreturns a BAT with OIDs sorted according to the order ofOIDs of another BAT from the same relation. For instance, X ↓ Y returns BAT X , whose OIDs have the same order asOIDs of BAT Y . X ↓ X denotes X sorted by its own values. As a first step, we have extended the SQL parser to makethe relational matrix operations available in the from clauseof SQL [9]. The syntax (r BY U) the specifies ordering foran argument relation r . As an example, consider relations r and s and ordering attributes U and V . The unary operation inv U and the binary operation mmu U ; V are expressed as: SELECT * FROM INV (r BY U ); SELECT * FROM MMU (r BY U , s BY V ); These basic constructs can be composed to more complexexpressions. For instance, folding w w w π C , B /( M − ) , H /( M − ) , N /( M − ) ( mmu C ; U ( w , w ) × ρ M ( ϑ COU NT (∗) ( w ))) The SQL translation of this expression is:
SELECT
C, B/(M -1) , H/(M -1) , N/(M -1)
FROM MMU (w4 BY C, w3 BY U) AS w5 CROSS JOIN ( SELECT COUNT (*) AS M FROM w1 ) AS t; lgorithm 1 processes a node that represents a unary re-lational matrix operation op U ( r ) and translates it to a list ofBAT expressions. In lines 2 - 7, the BATs of relation r aresplit, sorted, and morphed to get BATs X with row originsand BATs Y with the application part. Splitting (lines 2 and4) divides a relation into two parts: The application part, onwhich the matrix operations are performed, and the contex-tual information, which gives a meaning to the applicationpart. BATs B are split into application part and order part ac-cording to U . Sorting (lines 3 and 4) determines the order ofthe tuples for a specific matrix operation. The order schema U is used to sort the BATs: BATs in U are sorted accordingto their values while the other BATs in B are sorted accord-ing to the OIDs of the BATs in U . The order is establishedfor each operation based on the contextual values in the re-lation. Morphing (lines 5-7) morphs contextual informationso that it can be added to the base result. Finally, the matrixoperation is applied to Y (line 8). Merging (line 9) combinesthe result of a matrix operation with relevant contextual in-formation and constructs the result relation with row andcolumn origins. Merging and splitting are efficient opera-tions that work at the schema level and do not access thedata.
Algorithm 1:
UnaryRMA( op , U , r ) B ← BATs ( r ) ; Y ← {} ; D ← { b | b ∈ B , b is in order schema U } ; G ← sort ( D ) ; for b ∈ B \ D do Y ← Y ∪ b ↓ G ; if ShapeType( op ) ∈ { (r,r), (r,c), (r,1) } then X ← G ; else if ShapeType( op ) ∈ { (c,r), (c,c) } then X ← newBAT ( Y ) ; else X ← newBAT ( r ) ; F ← eval ( op , Y ) ; return Concat ( X , F ) ; Example 7.2.
Figure 12 illustrates Algorithm 1 for v = inv T ( σ T > am ( r )) . Splitting : input list B = ( T , H , W ) is splitinto order list D = ( T ) and application list B \ D = ( H , W ) . Sorting : BAT T is sorted, producing G . Then, ( H , W ) aresorted according to G returning ( H ↓ T , W ↓ T ) . Morphing :Since inv is of shape type (r ,c ), row contextual informa-tion is the order schema: X = T ↓ T . Merging : BATs X areconcatenated with BATs F to form the result. Line 8 of Algorithm 1 calls the procedure that computes thematrix operation. The computation can be done either inthe kernel of MonetDB or by calling an external library (e.g.,MKL [2]). Calling an external library requires copying datafrom BATs to the external format and copying the resultback. The query optimizer decides about external library calls based on the complexity of the operation, the amountof data to be copied, and the relative performance of the ma-trix operation in MonetDB compared to the external library.The no-copy implementation of matrix operations in thekernel of MonetDB is performed over BATs directly. Essen-tially, standard algorithms must be reduced to BAT opera-tions. The process of reducing is highly dependent on theoperation. The goal is to design algorithms that access en-tire columns and minimize accesses to single elements ofBATs. To achieve this standard value-based algorithms mustbe transformed to vectorized BAT operations.
Algorithm 2:
INV ( B ) n ← B . lenдth ; BR ← IDmatrix ( n ) ; for i = to n do v ← sel ( B i , i ) ; B i ← B i / v ; BR i ← BR i / v ; for j = to n do if i , j then v ← sel ( B j , i ) ; B j ← B j − B i ∗ v ; BR j ← BR j − BR i ∗ v ; return BR ; Algorithm 2 illustrates the reduction for the Gauss Jor-dan elimination method for the
INV computation. The al-gorithm takes a list of BATs B = ( B , B , .., B n ) and returnsthe inversion as a list of BATs BR of the same size. Func-tion IDmatrix ( n ) creates a list of BATs that represents theidentity matrix of size n × n . The selection operation sel ( B , i ) returns the i th value in B . With the exception of the sel oper-ation, all operations are standard MonetDB BAT operationsthat are also used for relational queries. For example, theoperation on B i ← B i / v ; divides each element of a BATwith a scalar value. Setup.
All runtimes are averages over 3 runs on an Intel(R)Xeon(R) E5-2603 CPU, 1.7 GHz, 12 cores, no hyper-threading,98 GB RAM (L1: 32+32K, L2:256K, L3:15360K), Debian 4.9.189.
Competitors.
We empirically compare the implementationsof our relational matrix algebra (
RMA+ ) with the statisti-cal package R , the array database SciDB , and two state-of-the-art in-database solutions,
AIDA [10] and
MADlib [15].(1) We implemented
RMA+ in MonetDB (v11.29.8) with twooptions for matrix operations: (a) BATs (RMA+BAT): No-copy implementation in the kernel of MonetDB; (b) MKL(RMA+MKL): Copy BATs to an MKL (v2019.5.281) [2] com-patible format (contiguous array of doubles), then copy theresult back to BATs. We execute linear operations ( add , sub , emu ) on BATs and use MKL for more complex operations. The implementation can be found here: https://github.com/oksdolm/RMA nput σ T > am ( r ) T H W splitting D B \ D T H W sorting G Y T ↓ T H ↓ T W ↓ T morphing X T ↓ T eval F H -0.190.31 W merging Concat ( X , F ) T H -0.190.31 W Figure 12: Splitting, sorting, morphing, merging for query v = inv T ( σ T > am ( r )) When the matrices do not fit into main memory we switch toBATs. Due to the full integration of RMA+, MonetDB takescare of core usage and work distribution, and all cores areused for relational and for matrix operations. (2)
SciDB [30]uses an array data model, and queries are expressed in thehigh-level, declarative language AQL (Array Query Language) [29].SciDB uses all available cores. (3)
AIDA is a state-of-the-artsolution for the integration of matrix operations into a re-lational database and was shown to outperform other solu-tions like Spark or the pandas library for Python [10]. AIDAexecutes matrix operations in Python and offers a Python-like syntax for relational operations, which are then trans-lated into SQL and executed in MonetDB (v11.29.3). AIDAuses all cores both in MonetDB and in Python. We also in-tegrate our solution into MonetDB, which makes AIDA aparticularly interesting competitor. (4) MADlib [15] (v1.10)provides a collection of UDFs for PostgreSQL (v9.6) for in-database matrix and statistical calculations. MADlib doesnot use multiple cores, which affects its overall performance.(5) The R package (v3.2.3) is highly tuned for matrix oper-ations and is a representative of a non-database solution.R performs all relational operations with data.tables struc-tures and transforms the relevant columns to matrices tocompute the matrix operations. An alternative approach touse character matrices for all operations is very inefficient(cf. Section 8.5). R uses all cores for matrix operations butruns relational operations on a single core. Data.
BIXI [17] stores trips and stations of Montreal’spublic bicycle sharing system, years 2014-2017. DBLP [23]stores authors with their publication counts per conferenceas well as conference rankings. The synthetic dataset usedin the experiment to measure the effect of sparsity includesvalues between 0 and 5,000,000. All other synthetic datasetsinclude real-valued numeric attributes with uniformly dis-tributed values between 0 and 10,000.
A salient feature of our approach is that contextual infor-mation is maintained during matrix operations. We analyzethe scalability of maintaining context and study an optimiza-tion that avoids sorting. To this end, we generate relationswith a single application column and an increasing number of order columns. We compute add and qqr on these rela-tions. Since add and qqr are inexpensive for single columnmatrices, the main cost is the maintenance of the order part.To handle contextual information we split, sort, morph,and merge lists of BATs (cf. Section 7.2). Sorting is the mostexpensive operation. Fortunately, sorting is not always nec-essary. For example, permuting the input rows for the qqr operation will affect the order of the result rows, but willnot change their values. Therefore, sorting is not required.In element-wise operations like add , emu , or sol , only therelative order of the rows in the two input relations matters.Thus, only the order part of the second relation requires sort-ing (to get the same order).Figure 13 shows the results. (1) Handling contextual infor-mation is efficient and scales to large numbers of attributes.(2) The optimized operators that (partially) avoid sorting clearlyoutperform their non-optimized counterparts.
200 400 600 800 1 , R u n t i m e ( s e c ) addqqradd , relative sorting qqr , w/o sorting (a) 100K tuples
20 40 60 80 1001234 R u n t i m e ( s e c ) addadd , relative sorting qqrqqr , w/o sorting (b) 1M tuples Figure 13: Handling contextual information
Note that a number of operations ( cpd , sol , rqr , dsv , tra , det , rnk ) do not preserve row context since the number ofrows changes. Instead, a single column with predefined val-ues (operation name or attribute names of the applicationschema) is created, which is negligible in the overall run-time. Wide relations.
Current databases scale better in the num-ber of tuples than in the number of attributes. We test ourRMA+ implementation in MonetDB on wide relations. Wegenerate relations with 1000 tuples, one order attribute, anda varying number of application attributes. In Table 4, wencrease the number of attributes from 1K to 10K and mea-sure the runtime of the add operation. MonetDB can han-dle wide relations with several thousands of attributes, eventhough the runtime per column increases with the attributenumber.
Table 4: add over wide relations in RMA+
Sparse relations.
We analyze the effect of MonetDB’s built-in compression on relations with many zeros. We add tworelations (5M tuples, one order, 10 application attributes)with uniformly distributed non-zero values (range 1-5M). InTable 5 we increase the percentage of zero values (positionof zeros is random) and measure the runtime: The add oper-ation on sparse matrices is up to two times faster than thesame operation on dense matrices. Thus, RMA+ leveragesMonetDB’s compression features.
Table 5: add over sparse relations in RMA+ % 0 10 20 30 40 50 60 70 80 90 100sec
We study the scalability of RMA+ to large relations and com-pare to R as a non-database solutions for matrix operations.In Table 6 we measure the runtime for qqr on tables with upto 100M tuples and 70 attributes in the application schema.For relations up to a size of 50Mx40, RMA+ delegates thematrix computation to MKL; the runtime includes copyingthe data. RMA+ is consistently faster than R since MKL canbetter leverage the hardware. R fails for sizes above 50Mx40since it runs out of memory. In RMA+ we switch to the BATimplementation, which leverages the memory managementof MonetDB. The Gram-Schmidt qqr baseline [12] that weimplemented over BATs is slower than the MKL algorithm(e.g., 834 vs. 61.4 sec for 50Mx40), which explains the in-crease in runtime. RMA+ scales to large relations that do notfit into memory (e.g., relation size 100Mx70 requires 56GB). Table 6: Runtimes of qqr in seconds in R and RMA+
10 attr 40 attr 70 attrSystem R RMA+ R RMA+ R RMA+5M tup
50M tup
37 21.3 221 61.4 fail 2018
74 40 fail 1690 fail 4064
We study the performance of RMA+ vs. SciDB [30] as a rep-resentative of array databases. We compute add on two ma-trices with 10 columns and a varying number of rows, fol-lowed by a selection . The resulting runtimes for Ubuntuare shown in Table 7. RMA+ outperforms SciDB by morethan an order of magnitude. RMA+ performs addition di-rectly over pairs of relations, while SciDB must compute aso-called array join [29] over the input arrays in order toadd their values. Table 7: add followed by a selection: RMA+ vs. SciDB
SciDB
We investigate the overhead of data transformation for vari-ous matrix operations in a mixed relational/matrix scenario.RMA+ is free to execute matrix operations directly onBATs or rearrange the numerical data in main memory anddelegate the matrix operations to specialized packages likeMKL [2]. R does not enjoy this flexibility: R uses the matrixdata type for matrix operations and the data.tables storagestructure for relational operations. While data.tables sup-ports simple linear operations like linear model construc-tion, the data must be transformed to the matrix type formore complex operations like
CPD , OPD , or
MMU . Matricescannot store a mix of numerical and non-numerical values,which is required when working with tables; R offers charac-ter matrices, but they are very inefficient, e.g., joining tripsand stations in the BIXI dataset takes 40 sec for the charactermatrix type and less than 2 sec for data.tables.Figure 14 shows the percentage of time spent for datatransformations on relations with 50 columns and a vary-ing number of rows (100k to 500k). For R we measure thetime of transforming the relation from data.table to matrixand back as a percentage of the overall query time, which in-cludes the actual matrix operation. For RMA+ we measurethe time share for copying the data from a list of BATs to acontiguous, one-dimensional array for MKL, and for copy-ing the result back; the overall runtime in addition includesthe matrix computation in MKL (but excludes the MonetDBquery pipeline of query parsing, query tree creation, etc.).Clearly, the overhead of transforming data matters forboth R and RMA+. We draw the following conclusions: (a)Transforming data between data structures is costly. (b) For We run this experiment on Ubuntu 14.04 since SciDB does not supportDebian; Ubuntu runs on a server with 4 cores and 16GB of RAM.rows (
ADD EMU MMU QQR DSV VSV (a) Data.table and matrix
ADD EMU MMU QQR DSV VSV (b) List of BATs and 1D array
Figure 14: Data transformation share: (a) R, (b) RMA+ simple operations like
ADD and
EMU , the transformation over-head dominates the overall runtime (up to 92%). (c) For com-plex operations, the performance of the matrix operationdominates the overall runtime.
We analyze four workloads that require a mix of relationaloperations and matrix operations, and we compare our im-plementation of RMA (RMA+) to its competitors (R, AIDA,MADlib). The workloads stem from applications on our real-world datasets and differ in the complexity of relational vs.matrix part. On the BIXI dataset, we compute (1) the lin-ear regression between distance and duration for individualtrips, and (2) journeys connecting up to 5 trips; on DBLP wecompute the (3) covariance between conferences based onthe publication counts per conference and author; (4) on asynthetic dataset based on BIXI we count trips per rider. (1) Trips – Ordinary Linear Regression.
Trips in BIXI in-clude start date and start station, end date and end station,duration, and a membership flag for the rider; stations havea code, a name, and coordinates. At the level of relations, weneed to perform the following data preparation steps: (a) Ag-gregate the trips and select those trips that were performedat least 50 times; (b) join trips and stations to retrieve the sta-tion coordinates and compute the distance. We use the OLSmethod [28] to compute the linear regression between dis-tance and duration. OLS uses cross product, matrix multipli-cation, and inversion:
MMU ( INV ( CPD ( A , A )) , CPD ( A , V )) , where A is the matrix with the independent variables, and V is thevector with the dependent variable.Figure 15a shows the runtime results for trips reportedin the years 2014 (3.1M trips), 2014-2015 (6.1M trips), 2014-2016 (10.5M trips), and 2014-2017 (14.5M trips), respectively.The input data consists of numeric and non-numeric typessuch as date and time. We break the runtime down intodata preparation (solid area of the bar) and matrix compu-tation time (dashed light area) for RMA+, R, and AIDA; forR we also show the load time from a CSV file (dark area).RMA+ and AIDA outperform R and MADlib in all scenar-ios. R performs poorly on the relational operations of thedata preparation step: The join implementation of R doesnot leverage multiple cores, and R lacks a query optimizer, which adversely affects the relational performance. MADlibis outperformed by all other solutions due to the slow com-putation of the linear regression. RMA+ outperforms AIDAon all datasets. Although both RMA+ and AIDA computethe relational operations in MonetDB, RMA+ is up to 6.3times faster: While AIDA passes pointers to access numer-ical Python data in MonetDB, this does not work for otherdata types (e.g., date, time, string) due to different storageformats [11]. Therefore, expensive data transformations mustbe applied. RMA+ AIDA3.1 6.5 10.5 14.50510152025 R u n t i m e ( s e c ) R MADlib (a) System Comparison
RMA+MKL3.1 6.5 10.5 14.5012345 R u n t i m e ( s e c ) RMA+BAT (b) RMA+BAT vs RMA+MKL
Figure 15: Trips (Ordinary Linear Regression) (2) Journeys – Multiple Linear Regression.
We compose tripsthat meet in a station into journeys. We start from 15M one-trip journeys of the form (start station, end station, dura-tion); all attributes are numerical. During data preparation,we perform joins to create journeys of up to five trips, selectthose that appear at least 50 times, and join stations withtheir coordinates to compute the distances between subse-quent stations in a journey. At the matrix level, we do amultiple linear regression analysis with the distances as in-dependent variables and the overall duration as the depen-dent variable.Figure 16a shows the runtime for journey lengths of 1 to5 trips (i.e., 1 to 5 independent variables). The solid partis the time for data preparation (relational operations); thedashed light part is the time for multiple linear regression(matrix operations). RMA+ and AIDA again outperform Ron the relational part of the query. The relational part oper-ates on purely numerical data and AIDA shows comparablejoin performance to RMA+. MADlib spends about two thirdof the relational runtime on distance computations and istherefore slower than its competitors also on the relationalpart. (3) Conferences – Covariance Computation.
We computethe covariance between conferences with A++ rating to lowerrated conferences based on the number of publications perauthor and conference. The data includes two tables: rankinд stores a rating (e.g., A++, A+, B) for each conference. publication stores the number of publications per author and confer-ence; the first attribute is the author, the other attributesare conference names (i.e., the result of SQL PIVOT over a
MA+ AIDA1 2 3 4 5050100150200 R u n t i m e ( s e c ) R MADlib (a) System Comparison
RMA+MKL R u n t i m e ( s e c ) RMA+BAT (b) RMA+BAT vs RMA+MKL
Figure 16: Journeys (Multiple Linear Regression) count-aggregate by conference and author). The query com-putes the covariance matrix on publication and joins the re-sult with rankinд to select A++ conferences.We measure the runtime for publication tables of increas-ing sizes: (1) 337363x266 (i.e., 337363 authors and 266 confer-ences), (2) 550085x519, (3) 722891x744, and (4) 876559x882.The rankinд table stores 882 tuples. Note that the number ofresult rows of covariance is identical to the number of inputcolumns, e.g., covariance of publications with 266 columnsreturns a relation (or matrix) of size 266x266.Figure 17a shows the runtime results for RMA+, R, andAIDA. MADlib runs for 77, 429, 1086, resp. 1814 secondson the different relation sizes and, thus, is omitted from thefigure. In all systems, the covariance computation domi-nates the overall runtime with at least 90%. Since AIDA doesnot support covariance, we implement covariance via crossproduct [18] in all algorithms except MADlib, which has a cov() function but does not support cross product. For thecross product in RMA+ we use the routine cblas dsyrk() since the result of multiplication is symmetric, in AIDA weuse a.t @ a , in R we use crossproduct .Note that the covariance computations in AIDA and R donot return contextual information. In order to join the resultwith rankinд and to select all A++ conferences, the confer-ence names must be manually added as a new column. RMA+ AIDA1 2 3 4051015202530 size of publications table R u n t i m e ( s e c ) R (a) Systems Comparison RMA+MKL1 2 3 402004006008001 ,
000 size of publications table R u n t i m e ( s e c ) RMA+BAT (b) RMA+BAT vs RMA+MKL
Figure 17: Conferences (Covariance Computation) (4) Trip Count.
In Figure 18 we compute the number oftrips per rider to 10 different destinations. Each tuple inthe input relations stores a rider and the number of tripsto each of the 10 locations for one year. We use add on the We do not use cov() function since it uses a single core only and is slower. relations of two different years to get the trip count for aperiod of two years. We vary the number of riders from1M to 15M and measure the runtime. Since add is a simpleoperation, RMA+ uses the no-copy implementation on BATs(RMA+BAT). RMA+ is faster than AIDA and R because itdoes not transfer data to Python (as AIDA) and does nottranslate data.tables to matrices (as R). MADlib takes 23, 119,299, resp. 480 seconds for the different input sizes and, thus,is again omitted from the figure.
RMA+ AIDA1 5 10 1501234567 R u n t i m e ( s e c ) R (a) Systems Comparison RMA+MKL1 5 10 1501234567 R u n t i m e ( s e c ) RMA+BAT (b) RMA+BAT vs RMA+MKL
Figure 18: Trip Count (Matrix Addition)
RMA+BAT vs. RMA+MKL.
Following our policy, RMA+delegates matrix operations to MKL (RMA+MKL) in Figures 15a,16a, and 17a (the operations are complex and we do notrun out of memory), and uses the no-copy implementationon BATs (RMA+BAT) in Figure 18a ( add is a linear opera-tion). We compare RMA+BAT to RMA+MKL in all scenar-ios. RMA+MKL outperforms RMA+BAT for the queries ontrips (factor 1.8-3.8, cf. Figure 15b) and journeys (factor 1.4-1.9, cf. Figure 16b). For the conference query, RMA+MKL is24 to 70 times faster since the cross product requires singleelement access and operates on relations with a large num-ber of attributes. For the trip count, RMA+BAT outperformsRMA+MKL in all settings (cf. Figure 18b). Although elemen-twise addition is highly efficient in MKL, the transformationoverhead cannot be amortized.
The key learnings from our empirical evaluation are the fol-lowing: (1) RMA+ excels for mixed workloads that includeboth standard relational and matrix operations. (2) OnlyRMA+ can avoid data transformations in mixed workloads;data transformations may be costly and consume more than90% of the overall runtime. (3) For complex matrix opera-tions, however, transforming the data to a suitable formatmay pay off: In our approach, we are free to transform thedata whenever beneficial. (4) In terms of scalability to largerelations/matrices, our solution outperforms all competitorssince it relies on the memory management of the databasesystem for both the standard relational and the matrix oper-ations. (5) Finally, the handling of contextual information, afeature of RMA, is efficient and can leverage optimizationsthat avoid expensive sortings.
CONCLUSION
In this paper, we targeted applications that store data in re-lations and must deal with queries that mix relational andlinear algebra operations. We proposed the relational matrixalgebra (RMA), an extension of the relational model withmatrix operations that maintain important contextual infor-mation. RMA operations are defined over relations and canbe nested. We implemented RMA over the internal datastructures of MonetDB and do not require changes in thequery processing pipeline. Our integration is competitivewith state-of-the-art approaches and excels for mixed work-loads.RMA opens new opportunities for cross algebra optimiza-tions that involve both relational and linear algebra opera-tions. It also is interesting to investigate the handling ofwide tables, e.g., by storing them as skinny tables that areaccessed accordingly or by combining operations to avoidthe generation of wide intermediate tables.
ACKNOWLEDGMENTS
We thank Joseph Vinish D’silva for providing the sourcecode of AIDA and helping out with the system. We thankthe MonetDB team for their support. We thank Roland Kwittfor the discussion about application scenarios. The projectwas partially supported by the Swiss National Science Foun-dation (SNSF) through project number 407550 167177.
REFERENCES
ICDE. IEEE , 2018.[5] C. R. Aberger, S. Tu, K. Olukotun, and C. Re. Emptyheaded: A re-lational engine for graph processing. In
Proceedings of the 2016 In-ternational Conference on Management of Data , SIGMOD ’16, pages431–446, New York, NY, USA, 2016. ACM.[6] P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. TheMultidimensional Database System RasDaMan.
SIGMOD Rec. , 27(2),June 1998.[7] M. Boehm, A. Kumar, J. Yang, and H. V. Jagadish.
Data Managementin Machine Learning Systems . 2019.[8] Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine.Simulation of database-valued Markov chains using SimSQL. In
Pro-ceedings of the 2013 ACM SIGMOD International Conference on Man-agement of Data , SIGMOD ’13, pages 637–648, New York, NY, USA,2013. ACM.[9] O. Dolmatova, N. Augsten, and M. H. Boehlen. Preserving contextualinformation in relational matrix operations. In
In Proceedings of the36th International Conference on Data Engineering , ICDE ’20, page 4pages.[10] J. V. D’silva, F. De Moor, and B. Kemme. AIDA: Abstraction for ad-vanced in-database analytics.
Proc. VLDB Endow. , 11(11):1400–1413, July 2018.[11] Joseph Vinish D’silva, Florestan De Moor, and Bettina Kemme. Keepyour host language object and also query it: A case for SQL querysupport in RDBMS for host language objects. In Carlos Maltzahn andTanu Malik, editors,
Proceedings of the 31st International Conferenceon Scientific and Statistical Database Management, SSDBM 2019, SantaCruz, CA, USA, July 23-25, 2019 , pages 133–144. ACM, 2019.[12] W. Gander. Algorithms for the QR-Decomposition. Technical report,ETH Zurich, April 1980.[13] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sind-hwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML:Declarative machine learning on MapReduce. In
Proceedings of the2011 IEEE 27th International Conference on Data Engineering , ICDE’11, pages 231–242, Washington, DC, USA, 2011. IEEE Computer So-ciety.[14] G. H. Golub and C. F. Van Loan.
Matrix Computations . John HopkinsUniversity Press, 3rd edition, 1996.[15] J. M. Hellerstein, C. Re, F. Schoppmann, D. Z. Wang, E. Fratkin,A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. TheMADlib Analytics Library: Or MAD Skills, the SQL.
Proc. VLDB En-dow. , 5(12):1700–1711, August 2012.[16] D. Hutchison, B. Howe, and D. Suciu. Laradb: A minimalist kernelfor linear and relational algebra computation.
CoRR
Applied Multivariate Statistical Anal-ysis . Applied Multivariate Statistical Analysis. Pearson Prentice Hall,2007.[19] S. Luo, Z. J. Gao, M. Gubanov, L. L. Perez, and C. Jermaine. Scalablelinear algebra on a relational database system. In , pages 523–534, April2017.[20] W. Mckinney. Pandas: a foundational python library for data analysisand statistics. 2011.[21] D. Misev and P. Baumann. Homogenizing data and metadata retrievalin scientific applications. In
Proceedings of the ACM Eighteenth Inter-national Workshop on Data Warehousing and OLAP
Pro-ceedings of the 2007 ACM SIGMOD International Conference on Man-agement of Data
Linear Models, Least Squaresand Alternatives . Springer Series in Statistics. Springer-Verlag NewYork, 1995.[29] Inc. SciDB. SciDB User’s Guide Version 13.3.6203. In
SciDB User’sGuide , pages 1–258, 2013.[30] M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The Architec-ture of SciDB. In
Proceedings of the 23rd International Conference onScientific and Statistical Database Management , SSDBM’11, pages 1–16. Springer-Verlag, 2011.31] Y. Zhang, H. Herodotou, and J. Yang. RIOT: I/O-efficient numericalcomputing without SQL.
CoRR , abs/0909.1766, 2009.[32] Y. Zhang, M. Kersten, M. Ivanova, and N. Nes. SciQL: Bridging theGap Between Science and Relational DBMS. In
Proceedings of the15th Symposium on International Database Engineering & Applications ,IDEAS ’11. ACM, 2011.[33] Y. Zhang, M. Kersten, and S. Manegold. SciQL: Array Data Process-ing Inside an RDBMS. In