[PDF] Query Log Compression for Workload Analytics

Abstract

Analyzing database access logs is a key part of performance tuning, intrusion detection, benchmark development, and many other database administration tasks. Unfortunately, it is common for production databases to deal with millions or even more queries each day, so these logs must be summarized before they can be used. Designing an appropriate summary encoding requires trading off between conciseness and information content. For example: simple workload sampling may miss rare, but high impact queries. In this paper, we present LogR, a lossy log compression scheme suitable use for many automated log analytics tools, as well as for human inspection. We formalize and analyze the space/fidelity trade-off in the context of a broader family of "pattern" and "pattern mixture" log encodings to which LogR belongs. We show through a series of experiments that LogR compressed encodings can be created efficiently, come with provable information-theoretic bounds on their accuracy, and outperform state-of-art log summarization strategies.

Full PDF

QQuery Log Compression for Workload Analytics

Ting Xie

University at Buffalo, SUNY [email protected] Varun Chandola

University at Buffalo, SUNY [email protected] Oliver Kennedy

University at Buffalo, SUNY [email protected]

ABSTRACT

LogR , a lossy log compressionscheme suitable use for many automated log analytics tools,as well as for human inspection. We formalize and ana-lyze the space/ﬁdelity trade-oﬀ in the context of a broaderfamily of “pattern” and “pattern mixture” log encodingsto which

LogR belongs. We show through a series of ex-periments that

LogR compressed encodings can be createdeﬃciently, come with provable information-theoretic boundson their accuracy, and outperform state-of-art log summa-rization strategies.

1. INTRODUCTION

Automated analysis of database access logs is critical forsolving a wide range of problems, from database perfor-mance tuning [12], to compliance validation [18] and queryrecommendation [15]. For example, the Peloton self-tuningdatabase [44] searches for optimal conﬁgurations by repeat-edly simulating database performance based on statisticalproperties of historical queries. Unfortunately, query logsfor production databases can grow to be large — A recentstudy of queries at a major US bank for a period of 19hours found nearly 17 million SQL queries and over 60 mil-lion stored procedure executions [35] — and computing theseproperties from the log itself is slow.Tracking only a sample of these queries is not suﬃcient,as rare queries can disproportionately aﬀect database per-formance, for example, if they beneﬁt from an otherwiseunnecessary index. Rather, we need a compressed summary of the log on which we can compute aggregate statisticalproperties. The problems of compression and summariza-tion have been studied extensively (e.g., [51, 52, 26, 19, 10,47, 34]). However, these schemes either require the use ofheavyweight inference to desired statistical measures, or pro-duce unnecessarily large encodings.In this paper, we adapt ideas from pattern mining andsummarization [40, 20] to propose a middle-ground:

LogR ,a summarization scheme that facilitates eﬃcient (both interms of storage and time) approximation of workload statis-tics. By adjusting a tunable parameter in

LogR , users canchoose to obtain a high-ﬁdelity, albeit large summary, orobtain a more compact summary with lower ﬁdelity. Con-structing the summary that best balances compactness andﬁdelity is challenging, as the search space of candidate sum-maries is combinatorially large [40, 20].

LogR oﬀers a newapproach to summary construction that avoids searchingthis space, making inexpensive, accurate computation of ag-gregate workload statistics possible. As a secondary beneﬁt,the resulting summaries are also human-interpretable.

LogR does not admit closed-form solutions to classicalﬁdelity measures like information loss, so we propose an al-ternative called

Reproduction Error . We show through acombination of analytical and experimental evidence thatReproduction Error is highly correlated with several classi-cal measures of encoding ﬁdelity.

LogR -compressed data relies on a codebook based onstructural elements like

SELECT items,

FROM tables, or con-junctive

WHERE clauses [3]. This codebook provides a bi-directional mapping from SQL queries to a bit-vector en-coding and back again, reducing the compression problemto one of compactly encoding a collection of feature-vectors.We further simplify the problem by observing that a com-mon theme in use cases like automated performance tuningor query recommendation is the need for predominantly ag-gregate workload statistics. As these are order-independent,we are able to focus exclusively on compactly representing bags of feature-vectors.

LogR works by identifying groups of co-occurring struc-tural elements that we call patterns. We deﬁne a family of pattern encodings of access logs, which map patterns to theirfrequencies in the log. For pattern encodings, we considertwo idealized measures of ﬁdelity: (1) Ambiguity, whichmeasures how much room the encoding leaves for interpre-tation; and (2) Deviation, which measures how reliably theencoding approximates the original log. Neither Ambiguitynor Deviation can be computed eﬃciently for pattern en-codings. Hence we propose a measure called

Reproduction a r X i v : . [ c s . D B ] S e p rror that is eﬃciently computable and that closely tracksboth Ambiguity and Deviation.In general, the size of the encoding is inversely relatedwith Reproduction Error: The more detailed the encoding,the more faithfully it represents the original log. Thus, logcompression may be deﬁned as a search over the space ofpattern based encodings to identify the one that best tradesoﬀ between these two properties. Unfortunately, searchingfor such an ideal encoding from the space can be compu-tationally expensive [20, 40]. To overcome this limitation,we reduce the search space by ﬁrst clustering entries in thelog and then encoding each cluster separately, an approachthat we call pattern mixture encoding . Finally we identify asimple approach to encoding individual clusters that we call naive mixture encodings , and show experimentally that itproduces results competitive with more powerful techniquesfor log compression and summarization.Concretely, in this paper we make the following contribu-tions: (1) We deﬁne two families of compression for querylogs: pattern and pattern mixture, (2) We deﬁne a computa-tionally eﬃcient measure, Reproduction Error, and demon-strate that it is a close approximation of Ambiguity andDeviation (two commonly used measures), (3) We propose aclustering-based approach to eﬃciently search for naive mix-ture encodings, and show how these encodings can be furtheroptimized, and, (4) We experimentally validate LogR andshow that it produces more precise encodings faster thanseveral state-of-the-art pattern encoding algorithms.

Roadmap.

The paper is organized as follows: Section 2formally deﬁnes the log compression problem and the sum-mary representation; Section 3 then deﬁnes information lossfor the summaries; Section 4 explains the diﬃculty in com-puting the theoretical loss measure and provides a practi-cal alternative; Section 5 motivates data partitioning andgeneralizes the practical loss measure to partitioned data;Section 6 then introduces the proposed

LogR compressionscheme; Section 7 empirically validates the practical lossmeasure and evaluates the eﬀectiveness of

LogR compres-sion by comparing it with two state-of-the-art summariza-tion methods; Section 8 empirically veriﬁes the eﬀectivenessof

LogR by evaluating it under the applications of the twocomparison methods; Section 9 discusses related work andSection 10 concludes the paper.

2. PROBLEM DEFINITION

In this section, we introduce and formally deﬁne the logcompression problem. We begin by exploring several appli-cations that need to repeatedly analyze query logs.

Index Selection.

Selecting an appropriate set of indexesrequires trading oﬀ between update costs, access costs, andlimitations on available storage space. Existing strategies forselecting a (near-)optimal set of indexes typically repeatedlysimulate database performance under diﬀerent combinationsof indexes, which in turn requires repeatedly estimating thefrequency with which speciﬁc predicates appear in the work-load. For example, if status = ? occurs in 90% of the queriesin a workload, a hash index on status is beneﬁcial.

Materialized View Selection.

The results of joins orhighly selective selection predicates are good candidates formaterialization when they appear frequently in the work-load. Like index selection, view selection is a non-convex optimization problem, typically requiring exploration by re-peated simulation, which in turn requires repeated frequencyestimation over the workload.

Online Database Monitoring.

In production settings, itis common to monitor databases for atypical usage patternsthat could indicate a serious bug or security threat. Whenquery logs are monitored, it is often done retrospectively,some hours after-the-fact [35]. To support real-time moni-toring it is necessary to quickly compute the frequency of aparticular class of query in the system’s typical workload.In each case, the application’s interactions with the logamount to counting queries that have speciﬁc features: se-lection predicates, joins, or similar.

Let L be a log, or a ﬁnite collection of queries q ∈ L . Wewrite f ∈ q to indicate that q has some feature f , such asa speciﬁc predicate or table in its FROM clause. We assume(1) that the universe of features in both a log and a queryis enumerable and ﬁnite, (2) that the features are selectedto suit speciﬁc applications and (3) optionally that a queryis isomorphic to its feature set (motivated in Section 2.3.2).We outline one approach to extracting features that satisﬁesall three assumptions below. We abuse syntax and write q todenote both the query itself, as well as the set of its features.Let b denote some set of features f ∈ b , which we calla pattern . We write these sets using vector notation: b =( x , . . . , x n ) where n is the number of distinct features ap-pearing in the entire log and x i indicates the presence (ab-sence) of i th feature with a 1 (resp., 0). For any two patterns b , b (cid:48) , we say that b (cid:48) is contained in b if b (cid:48) ⊆ b . Equiva-lently, with b = ( x , . . . , x n ) and b (cid:48) = ( x (cid:48) , . . . , x (cid:48) n ): b (cid:48) ⊆ b ≡ ∀ i, x (cid:48) i ≤ x i Our goal then is to be able to query logs for the number oftimes a pattern b appears, | { q | q ∈ L ∧ b ⊆ q } | For this paper, we speciﬁcally adopt the feature-extractionconventions of a query summarization scheme by Aligon etal. [3]. In this scheme, each feature is one of the follow-ing three query elements: (1) a table or sub-query in the

FROM clause, (2) a column in the

SELECT clause, and (3) aconjunctive atom of the

WHERE clause.

Example Consider the following example query.

SELECT _id , sms_type , _time FROM MessagesWHERE status =? AND transport_type =?

This query uses 6 features: (cid:104) sms_type , SELECT (cid:105) , (cid:104) _id , SELECT (cid:105) , (cid:104) _time , SELECT (cid:105) , (cid:104) Messages , FROM (cid:105) , (cid:104) status=? , WHERE (cid:105) ,and (cid:104) transport_type=? , WHERE (cid:105)

Although this scheme is simple and limited to conjunc-tive queries (or queries with a conjunctive equivalent), itfulﬁlls all three assumptions we make on feature extractionschemes. The features of a query (and consequently a log)are enumerable and ﬁnite, and the feature set of the queryis isomorphic (modulo commutativity and column order) tothe original query. Furthermore, even if a query is not it-self conjunctive, it often has a conjunctive equivalent. Wequantify this statement with Table 1, which provides tworelevant data points from production query logs; In both2

ELECT sms type , external ids , time , id FROM messages

WHERE (sms type=?) ∧ (status=?) (a) Correlation-ignorant : Features are highlighted independently

SELECT sms type

FROM messages

WHERE sms type=?

SELECT sms type

FROM messages

WHERE status=? (b)

Correlation-aware : Pattern groups are highlighted together.

Figure 1:

Example Encoding Visualizations cases, all logged queries can be rewritten into equivalentqueries compatible with the Aligon scheme.Although we do not explore more advanced feature encod-ing schemes in detail here, we direct the interested readerto work on query summarization [39, 8, 35]. For example, ascheme by Makiyama et. al. [39] also captures aggregation-related features like group-by columns, while an approach byKul et. al. [35] encodes partial tree-structures in the query.

As a lossy form of compression,

LogR only approximatesthe information content of a query log. We next develop asimpliﬁed form of

LogR that we call pattern-based encod-ing, and develop a framework for reasoning about the ﬁdelityof a

LogR -compressed log. As a basis for this framework,we ﬁrst reframe the information content of a query log toallow us to adapt classical information-theoretical measuresof information content.

We deﬁne the information content of the log as a distri-bution p ( Q | L ) of queries Q drawn uniformly from the log. Example Consider the following query log, which con-sists of four conjunctive queries.

1. SELECT _id FROM Messages WHERE status = ?2. SELECT _time FROM MessagesWHERE status = ? AND sms_type = ?3. SELECT _id FROM Messages WHERE status = ?4. SELECT sms_type , _time FROM MessagesWHERE sms_type = ?

Drawing uniformly from the log, each entry will appear withprobability = 0 . . The query q ( = q ) occurs twice, sothe probability of drawing it is double that of the others (i.e., p ( q | L ) = p ( q | L ) = = 0 . ) Treating a query as a vector of its component features,we can deﬁne a query q = ( x , . . . , x n ) to be an obser-vation of the multivariate distribution over variables Q =( X , . . . , X n ) corresponding to features. The event X i = 1occurs if feature i appears in a uniformly drawn query. Example Continuing, the universe of features for thisquery log is (1) (cid:104) _id , SELECT (cid:105) , (2) (cid:104) _time , SELECT (cid:105) ,(3) (cid:104) sms_type , SELECT (cid:105) , (4) (cid:104) status = ? , WHERE (cid:105) ,(5) (cid:104) sms_type = ? , WHERE (cid:105) , and (6) (cid:104)

Messages , FROM (cid:105) . Ac-cordingly, the queries can be encoded as feature vectors, withﬁelds counting each feature’s occurrences: q = (cid:104) , , , , , (cid:105) , q = (cid:104) , , , , , (cid:105) , q = (cid:104) , , , , , (cid:105) , q = (cid:104) , , , , , (cid:105) Patterns.

Our target applications require us to count thenumber of times features (co-)occur in a query. For example,materialized view selection requires counting tables used to-gether in queries. Motivated by this observation, we begin by deﬁning a broad class of pattern based encodings that di-rectly encode co-occurrence probabilities. A pattern is anarbitrary set of features b = ( x , . . . , x n ) that may co-occurtogether. Each pattern captures a piece of information fromthe distribution p ( Q | L ). In particular, we are interested inthe probability of uniformly drawing a query q from the logthat contains the pattern b (i.e., q ⊇ b ): p ( Q ⊇ b | L ) = (cid:80) q ∈ L ∧ q ⊇ b p ( q | L ) When it is clear from context, we abuse notation and write p ( · ) instead of p ( · | L ). Recall that p ( Q ) can be repre-sented as a joint distribution over variables ( X , . . . , X n )and probability p ( Q ⊇ b ) is thus equivalent to the marginalprobability p ( X ≥ x , . . . , X n ≥ x n ) of pattern b . Pattern-Based Encodings.

Denote by E max : { , } n → [0 , b ∈ { , } n to their marginals. A pattern based encod-ing E is any such partial mapping E ⊆ E max . We de-note the marginal of pattern b in encoding E L by E L [ b ](= p ( Q ⊇ b | L )). When it is clear from context, we abusesyntax and also use E to denote the set of patterns it maps(i.e., domain ( E )). Hence, |E| is the number of mapped pat-terns, which we call the encoding’s Verbosity . A patternbased encoder is any algorithm encode ( L, (cid:15) ) (cid:55)→ E whose in-put is a log L and whose output is a set of patterns E , withVerbosity thresholded at some integer (cid:15) . Many pattern min-ing algorithms [20, 40] can be used for this purpose. A side-beneﬁt of pattern based encodings is that, underthe assumption of isomorphism in Section 2.1, patterns canbe translated to their query representations and used for hu-man analysis of the log. Figure 1 shows two examples. Theapproach illustrated in Figure 1a uses shading to show eachfeature’s frequency in the log, and communicates frequentlyoccurring constraints or attributes. This approach might,for example, help a human to manually select indexes. Asecond approach illustrated in Figure 1b conveys correla-tions, showing the frequency of entire patterns. Appendix Eexplores interpretable visualizations of pattern based sum-maries in greater depth.

3. INFORMATION LOSS

Our goal is to encode the distribution p ( Q ) as a set ofpatterns: obtaining a less verbose encoding (i.e., with fewerpatterns), while also ensuring that the encoding captures p ( Q ) with minimal information loss. In this section, wedeﬁnes information loss for pattern based encodings. To establish a baseline for measuring information loss, webegin with the extreme cases. At one extreme, an emptyencoding ( |E| = 0) conveys no information. At the other ex-treme, we have the encoding E max which is the full mappingfrom all patterns. Having this encoding is a suﬃcient con-dition to exactly reconstruct the original distribution p ( Q ). Proposition For any query q = ( x , . . . , x n ) ∈ N n ,the probability of drawing exactly q at random from the log(i.e., p ( X = x , . . . , X n = x n ) ) is computable, given E max . See Appendix B for proof of the proposition.3 .2 Lossy Summaries

Although lossless, E max is also verbose. Hence, we willfocus on lossy encodings that can be less verbose. A lossyencoding E ⊂ E max may not be able to precisely identify thedistribution p ( Q ), but can still be used to approximate it.We characterize the information content of a lossy encod-ing E by deﬁning a space (denoted by Ω E ) of distributions ρ ∈ Ω E allowed by an encoding E . This space is deﬁned byconstraints as follows: First, we have the general propertiesof probability distributions: ∀ q ∈ N n : ρ ( q ) ≥ (cid:80) q ∈ N n ρ ( q ) = 1Each pattern b in the encoding E constrains the marginalprobability over its component features: ∀ b ∈ domain ( E ) : E [ b ] = (cid:88) q ⊇ b ρ ( q )Note that the dual constraints 1 − E [ b ] = (cid:80) q (cid:54)⊇ b ρ ( q ) areredundant under constraint (cid:80) q ∈ N n ρ ( q ) = 1.The resulting space Ω E is the set of all query logs, orequivalently the set of all possible distributions of queries,that obey these constraints. From the outside observer’sperspective, the distribution ρ ∈ Ω E that the encoding con-veys is ambiguous: We model this ambiguity with a ran-dom variable P E with support Ω E . The true distribution p ( Q ) derived from the query log must appear in Ω E , denoted p ( Q ) ≡ ρ ∗ ∈ Ω E (i.e., p ( P E = ρ ∗ ) > ρ admitted by Ω E , it is possible that some aremore likely than others. For example, a query containing acolumn (e.g., status ) is only valid if it also references a tablethat contains the column (e.g., Messages ). This prior knowl-edge may be modeled as a prior on the distribution of P E or by an additional constraint. However, for the purposes ofthis paper, we take the uninformed prior by assuming that P E is uniformly distributed over Ω E : p ( P E = ρ ) = (cid:40) | Ω E | if ρ ∈ Ω E Naive Encodings.

One speciﬁc family of lossy encodingsthat treats each feature as being independent (e.g., as inFigure 1a) is of particular interest to us. We call this family naive encodings , and return to it throughout the rest of thepaper. A naive encoding is composed of all patterns thathave exactly one feature with non-zero marginal. { b = (0 , . . . , , x i , , . . . , | i ∈ [1 , n ] , x i = 1 } Based on the space of distributions constrained by theencoding, the information loss of an encoding can be con-sidered from two related, but subtly distinct perspectives:(1)

Ambiguity measures how much room the encoding leavesfor interpretation, (2)

Deviation measures how reliably theencoding approximates the target distribution p ( Q ). Ambiguity.

We deﬁne the Ambiguity I( E ) of an encodingas the entropy of the random variable P E . The higher theentropy, the less precisely E identiﬁes a speciﬁc distribution.I( E ) = (cid:88) ρ p ( P E = ρ ) log ( p ( P E = ρ )) Deviation.

The deviation from any permitted distribu-tion ρ to the true distribution ρ ∗ can be measured by the Kullback-Leibler (K-L) divergence [37] (denoted D KL ( ρ ∗ || ρ )).We deﬁne the Deviation d( E ) of a encoding as the expecta-tion of the K-L divergence over all permitted ρ ∈ Ω E :d( E ) = E P E [ D KL ( ρ ∗ ||P E )] = (cid:88) ρ ∈ Ω E p ( P E = ρ ) · D KL ( ρ ∗ || ρ ) Limitations.

There are two limitations to these idealizedmeasures in practice. First, K-L divergence is not deﬁnedfrom any probability measure ρ ∗ that is not absolutely con-tinuous with respect to a second (denoted ρ ∗ (cid:28) ρ ). Second,neither Deviation nor Ambiguity has a closed-form formula.

4. PRACTICAL LOSS MEASURE

Computing either Ambiguity or Deviation requires enu-merating the entire space of possible distributions, or anapproximation. One approach to estimating either measureis repeatedly sampling from, rather than enumerating thespace. However, accurate measures require a large numberof samples, rendering this approach similarly infeasible. Inthis section, we propose a faster approach to assessing theﬁdelity of a pattern encoding. Speciﬁcally, we select a singlerepresentative distribution ρ E from the space Ω E , and use ρ E to approximate both Ambiguity and Deviation. Maximum Entropy Distribution.

The representativedistribution is chosen by applying maximum entropy prin-ciple [28] which is commonly used in pattern-based summa-rization [20, 40]. That is, we select the distribution ρ E withmaximum entropy: ρ E = arg min ρ ∈ Ω E −H ( ρ ) where H ( ρ ) = (cid:88) q ∈ N n − ρ ( q ) log ρ ( q )The maximum entropy distribution ρ E best represents thecurrent state of knowledge. That is, a distribution withlower entropy assumes additional constraints derived frompatterns that we do not know and one with higher entropyviolates the constraints from patterns we do know.Maximizing an objective function belonging to the expo-nential family (entropy in our case) under a mixture of lin-ear equalities/inequality constraints is a convex optimizationproblem [11] which guarantees a unique solution and can beeﬃciently solved [17], using the cvx toolkit [23][43], and/orby iterative scaling [20, 40]. For naive encodings speciﬁcally,we can assume independence between each feature X i . Un-der this assumption, ρ E has a closed-form solution: ρ E ( q ) = (cid:89) i p ( X i = x i ) where q = ( x , . . . , x n ) (1)Using (1), we deﬁne Reproduction Error e ( E ) as the entropydiﬀerence between the representative and true distributions: e ( E ) = H ( ρ E ) − H ( ρ ∗ ) where ρ E = arg min ρ ∈ Ω E −H ( ρ ) In this section we prove that Reproduction Error closelyparallels Ambiguity. We deﬁne a partial order lattice overencodings and show that for any pair of encodings on whichthe partial order is deﬁned, a like relationship is implied forboth Reproduction Error and Ambiguity. We supplementthe proofs given in this section with an empirical analysisrelating Reproduction Error to Deviation in Section 7.1.4 ontainment.

We deﬁne a partial order over encodings ≤ Ω based on containment of their induced spaces Ω E : E ≤ Ω E ≡ Ω E ⊆ Ω E That is, one encoding (i.e., E ) precedes another (i.e., E )when all distributions admitted by the former encoding arealso admitted by the latter. Containment Captures Reproduction Error.

We ﬁrstprove that the total order given by Reproduction Error is asuperset of the partial order ≤ Ω . Lemma For any two encodings E , E with induced spaces Ω E , Ω E and maximum entropy distributions ρ E , ρ E it holdsthat E ≤ Ω E → e ( E ) ≤ e ( E ) . Proof.

Firstly Ω E ⊇ Ω E → ρ E ∈ Ω E . Since ρ E hasthe maximum entropy among all distributions ρ ∈ Ω E , wehave H ( ρ E ) ≤ H ( ρ E ) ≡ e ( E ) ≤ e ( E ). Containment Captures Ambiguity.

Next, we showthat the partial order based on containment implies a likerelationship between Ambiguities of pairs of encodings.

Lemma Given encodings E , E with uninformed prioron P E , P E , it holds that E ≤ Ω E → I ( E ) ≤ I ( E ) . Proof.

Given an uninformed prior: I( E ) = log | Ω E | . Hence E ≤ Ω E → | Ω E | ≤ | Ω E | → I( E ) ≤ I( E )

5. PATTERN MIXTURE ENCODINGS

Thus far we have deﬁned the problem of log compression,treating the query log as a single joint distribution p ( Q )that captures the frequency of feature occurrence and/orco-occurrence. Patterns capture positive information aboutcorrelations. However in cases like logs of mixed workloads,there are also many cases of anti-correlation between fea-tures. For example, consider a log that includes queriesdrawn from two workloads with disjoint feature sets. Pat-tern based summaries can not convey such anti-correlationseasily. As a result, while patterns including features fromboth workloads never actually occur in the log, a pattern-based summary of the log will indicate a non-zero marginal.Identifying signiﬁcant workload variation, as might be causedby misuse or malicious workload-injection (mixture), is rel-evant to intrusion detection systems [36]. In addition, cap-turing anti-correlations helps to reduce data dimensionalityand improves both the runtime and accuracy of state-of-the-art pattern mining algorithms (See Section 8.1.2 and 8.1.3).In this section, we propose a generalization of pattern en-codings where the log is modeled not as a single probabilitydistribution, but rather as a mixture of several simpler dis-tributions. The resulting encoding is likewise a mixture:Each component of the mixture of distributions is storedindependently. Hence, we refer to it as a pattern mixtureencoding , and it forms the basis of LogR compression.We ﬁrst focus on a simpliﬁed form of this problem, wherewe only mix naive pattern encodings (we explore more gen-eral mixtures in Section 6.4). We ﬁrst refer to the resultingscheme as naive mixture encodings , and give examples of theencoding, as well as potential visualizations in Section 5.1.Then we generalize Reproduction Error and Verbosity forpattern mixture encodings in Section 5.2. Finally, with gen-eralized encoding evaluation measures, we evaluate severalencoding strategies based on diﬀerent clustering methods forcreating naive mixture encodings.

Consider a toy query log with only 3 conjunctive queries.1.

SELECT id FROM Messages WHERE status = ? SELECT id FROM Messages SELECT sms_type FROM Messages

The vocabulary of this log consists of 4 features: (cid:104) id , SELECT (cid:105) , (cid:104) sms_type , SELECT (cid:105) , (cid:104) Messages , FROM (cid:105) , and (cid:104) status = ? , WHERE (cid:105) .Re-encoding the three queries as vectors, we get:1. (cid:104) , , , (cid:105) (cid:104) , , , (cid:105) (cid:104) , , , (cid:105) A naive encoding of this log can be expressed as: (cid:28) , , , (cid:29) This encoding captures that all queries in the log pertain tothe Messages table, but obscures the relationship betweenthe remaining features. For example, this encoding obscuresthe anti-correlation between id and sms_type . Similarly, theencoding hides the association between status = ? and id .Such relationships are critical for evaluating the eﬀectivenessof views or indexes. Example The maximal entropy distribution for a naiveencoding assumes that features are independent. Assumingindependence, the probability of query 1 from the log is: p ( id ) · p ( ¬ sms type ) · p ( Messages ) · p ( status=? ) = 427 ≈ . This is a signiﬁcant diﬀerence from the true probability ofthis query (i.e., ). Conversely queries not in the log, suchas the following, have non-zero probability in the encoding. SELECT sms_type FROM Messages WHERE status = ? p ( ¬ id ) · p ( sms type ) · p ( Messages ) · p ( status=? ) = 127 ≈ . Partition 1 ( L ) Partition 2 ( L )(1 , , ,

1) (1 , , ,

0) (0 , , , ↓ ↓ ↓ (cid:10) , , , (cid:11) (cid:104) , , , (cid:105) The resulting encoding only has one non-integral proba-bility: p ( status = ? | L ) = 0 .

5. Although there are nowtwo encodings, the encodings are not ambiguous. The fea-ture status = ? appears in exactly half of the log entries,and is indeed independent of the other features. All otherattributes in each encoding appear in all queries in theirrespective partitions. Furthermore, the maximum entropydistribution induced by each encoding is exactly the distri-bution of queries in the compressed log. Hence, the Repro-duction Error is zero for both of the two encodings.

We next generalize our deﬁnitions of Reproduction Errorand Verbosity from pattern to pattern mixture encodings.Suppose query log L has been partitioned into K clusterswith L i , S i , ρ S i and ρ ∗ i (where i ∈ [1 , K ]) representing thelog of queries, encoding, maximum entropy distribution, andtrue distribution (respectively) for i th cluster. First, observethat the distribution for the whole log (i.e., ρ ∗ ) is the sum5f distributions for each partition (i.e., ρ ∗ i ) weighted by pro-portion of queries (i.e., | L i || L | ) in the partition. ρ ∗ ( q ) = (cid:88) i =1 ,...,K w i · ρ ∗ i ( q ) where w i = | L i || L | Generalized Reproduction Error.

Similarly, the max-imum entropy distribution ρ S for the whole log is: ρ S ( q ) = (cid:88) i =1 ,...,K w i ∗ ρ S i ( q )We deﬁne the Generalized Reproduction Error of a patternmixture encoding similarly, as the weighted sum of the errorsfor each partition: H ( ρ S ) − H ( ρ ∗ ) = (cid:88) i w i · H ( ρ S i ) − (cid:88) i w i · H ( ρ ∗ i ) = (cid:88) i w i · e ( S i ) As in the base case, a pattern mixture encoding with lowGeneralized Reproduction Error indicates a high-ﬁdelity rep-resentation of the original log. A process can infer the prob-ability of any query p ( Q = q | L ) drawn from the originaldistribution, simply by inferring its probability drawn fromeach cluster i (i.e., p ( Q = q | L i )) and taking a weightedaverage over all inferences. When it is clear from context,we refer to Generalized Reproduction Error simply as Errorin the rest of this paper. Generalized Verbosity.

We generalize verbosity to mix-ture encodings as the

Total Verbosity ( (cid:80) i | S i | ), or the totalsize of the encoded representation. This approach is idealfor our target applications, where our aim is to reduce therepresentational size of the query log.

6. PATTERN MIXTURE COMPRESSION

We are now ready to describe the

LogR compressionscheme. Broadly,

LogR attempts to identify a pattern mix-ture encoding that optimizes for some target trade-oﬀ be-tween Total Verbosity and Error. A naive — though im-practical — approach to ﬁnding such an encoding would beto search the entire space of possible pattern mixture en-codings. Instead,

LogR approximates the same outcomeby identifying the naive pattern mixture encoding that isclosest to optimal for the desired trade-oﬀ. As we show ex-perimentally, the naive mixture encoding produced by theﬁrst stage is competitive with more complicated, slower tech-niques for summarizing query logs. We also explore a hypo-thetical second stage, where

LogR reﬁnes the naive mixtureencoding to further reduce error. The outcome of this hypo-thetical stage has a slightly lower Error and Verbosity, butdoes not admit eﬃcient computation of database statistics.

LogR compression searches for a naive mixture encodingthat best optimizes for a requested tradeoﬀ between TotalVerbosity and Error. As a way to make this search eﬃcient,we observe that a log (or log partition) uniquely determinesits naive mixture encoding. Thus the problem of search-ing for a naive mixture encoding reduces to the problem ofsearching for the corresponding log partitioning.We further observe that the Error of a naive mixture en-coding is proportional to the diversity of the queries in thelog being encoded; The more uniform the log (or partition),the lower the corresponding error. Hence, the partitioning problem further reduces to the problem of clustering queriesin the log by feature overlap.To identify a suitable clustering scheme, we next eval-uate four commonly used partitioning/clustering methods:(1) KMeans [27] with Euclidean distance (i.e., l -norm) andSpectral Clustering [31] with (2) Manhattan (i.e., l -norm),(3) Minkowski (i.e., l p -norm) with p = 4, and (4) Hamming( Count ( x (cid:54) = y ) Count ( x (cid:54) = y )+ Count ( x = y ) ) distances . Speciﬁcally, we evalu-ate these four strategies with respect to their ability to createnaive mixture encodings with low Error and low Verbosity. Experiment Setup.

Spectral and KMeans clustering al-gorithms are implemented by sklearn [45] in Python. Wegradually increase K (i.e., the number of clusters) conﬁg-ured for each clustering algorithm to mimic the process ofcontinuously sub-clustering the log, tolerating higher TotalVerbosity for lower Error. To compare clustering methodsfairly, we reduce randomness in clustering (e.g., random ini-tialization in KMeans) by running each of them 10 times foreach K and averaging the Error of the resulting encodings.We used two datasets: “US Bank” and “PocketData.” Wedescribe both datsets and the data preparation process indetail in Section 7. All results for our clustering experimentsare shown in Figure 2. We next show that clustering is an eﬀective way to con-sistently reduce Error, although no one clustering method isideal along all three of Error, Verbosity, and runtime.

More clusters reduces Error.

Figure 2a compares therelationship between the number of clusters (x-axis) and Er-ror (y-axis), showing the varying rates of convergence to zeroError for each clustering method. We observe that addingmore clusters does consistently reduce Error for both datasets, regardless of clustering method and distance measures.We note that the US Bank dataset is signiﬁcantly more di-verse than the PocketData dataset, with respect to the to-tal number of features (See Table 1) and that more than30 clusters may be required for reaching near-zero Error.In general, Hamming distance converges faster than othermethods on PocketData.

Adding more clusters increases Verbosity.

Figure 2bcompares the relationship between the number of clusters(x-axis) and Verbosity (y-axis). We observe that Verbosityincreases with the number of clusters. This is because whena partition is split, features common to both partitions eachincrease the Verbosity by 1 each.

Hierarchical Clustering.

Classical clustering methodsproduce non-monotonic cluster assignments. That is, theratio of Error to Verbosity can grow with more clusters, asseen in Figure 2a and 2b. An alternative is to use hierar-chical clustering [29], which forces monotonic assignmentsand oﬀers more dynamic control over the Error/Verbositytradeoﬀ.

Run Time Comparison.

The total run time (y-axis) inFigure 2c includes both distance matrix computation time(if any) and clustering time. Note the log-scale: K-Means isorders of magnitude faster than the others, with Hammingdistance also performing competitively. We also evaluated Spectral Clustering with Euclidean,Chebyshev and Canberra distances; These did not performbetter and we omit them in the interest of conciseness.6

Num of Clusters E rr o r minkowski manhattan hamming KmeansEuclidean Num of Clusters E rr o r (a) Error v. Number of Clusters (Left: PocketData, Right: US Bank)

Num of Clusters . . . . . . . . T o t a l V e r bo s i t y ⇥ Num of Clusters . . . . . . . T o t a l V e r bo s i t y ⇥ (b) Total Verbosity v. Number of Clusters (Left: PocketData, Right: US Bank); Each point is the verbosity of oneof ten trials. The Y-axis’ lower bound is the verbosity at 1 cluster to better show the change in verbosity.

Num of Clusters T o t a l T i m e ( i n S e cs ) Num of Clusters T o t a l T i m e ( i n S e cs ) (c) Runtime v. Number of Clusters (Left: PocketData, Right: US Bank)

Figure 2:

Distance Measure ComparisonTake-Aways.

Hamming distance provides the best trade-oﬀ between Error and runtime. For time-sensitive applica-tions, KMeans is preferred to Spectral Clustering.

Visualizing Naive Mixture Encoding.

As with normalpattern summaries, naive mixture summaries are also inter-pretable. For example a visualization like that of Figure 1acan be repeated, once for each cluster. For more details, seeAppendix E.

Recall that our primary goal is estimating statistical prop-erties. In particular, we are interested in counting the oc-currences (i.e., the marginal) of some pattern b in the log:Γ b ( L ) = | { q | q ∈ L ∧ b ⊆ q } | Recall that a naive encoding E L includes only single-featurepatterns. Assuming log distributions allowed by this encod-ing are equally likely, the maximal entropy distribution ρ E L is the representative distribution. Hence, we estimate Γ b ( L )by multiplying the probability of the feature occurring bythe size of the log: | L | · ρ E L , or: est [ Γ b ( L ) | E L ] = | L | ·  (cid:89) f ∈E where f ⊆ b E [ f ]  This process trivially generalizes to naive pattern mixtureencodings by mixing distributions. Speciﬁcally, given a setof partitions L ∪ . . . ∪ L K = L , the estimated counts for Γ b ( L ) under each individual partition L i can be computedbased on the partition’s encoding E i and we sum up theestimated counts in each partition est [ Γ b ( L i ) | E , . . . , E K ] = (cid:88) i ∈ [1 ,K ] est [ Γ b ( L i ) | E i ] In this section, we empirically verify the eﬀectiveness ofnaive mixture encodings in approximating log statistics fromtwo related perspectives. The ﬁrst perspective focuses on synthesis error . It measures whether patterns synthesizedby the naive mixture encoding actually exist in the log.From the second perspective, we would like to further in-vestigate the marginal deviation of patterns contained inthe log of queries. This evaluates whether a naive mixtureencoding will compute the correct marginal for patterns ofinterest to a client application. Speciﬁcally, synthesis error is measured as 1 − MN where N is the total number of ran-domly synthesized patterns and M is the number of synthe-sized patterns with positive marginals in the log. Marginaldeviation is measured as | ESTM − TM | TM where T M stands forTrue Marginal of a pattern and

EST M is the one estimatedby naive mixture encoding.Experimental results are shown in Figure 3. Both syn-thesis error and marginal deviation consistently decreasesgiven more clusters. Furthermore, as we vary the number7

10 20 30 40 50 60

Reproduction Error . . . . . . S y n t he s i s E rr o r bank datapocket data (a) Synthesis Error v. Reproduction Error

Reproduction Error . . . . . . M a r g i na l D e v i a t i on bank datapocket data (b) Marginal Deviation v. Reproduction Error

Figure 3:

Eﬀectiveness of Naive Mixture Encoding of clusters, both measures are correlated with ReproductionError.

Synthesis Error.

Figure 3a shows synthesis error (y-axis)versus Reproduction Error (x-axis). The ﬁgure is generatedby randomly synthesizing N = 10000 patterns from eachpartition of the log. Note that diﬀerent values of N givesimilar observation. The overall synthesis error is measuredas the average synthesis error for each partition, weightedby proportion of queries in the partition. Marginal Deviation.

Figure 3b shows marginal devia-tion (y-axis) versus Reproduction Error (x-axis). It is notfeasible to enumerate all patterns that exist in the data. Asan alternative, we treat each distinct query in the log as apattern and treat the marginal deviation on it as the worstcase for all patterns that it may contain. This is becausemarginal deviation tends to be smaller if it is measured on apattern that is contained in the other. For each cluster, wesum up the marginal deviation on all distinct queries andthe ﬁnal marginal deviation for the whole log is an weightedaverage (same as synthesis error) over all clusters.

Naive mixture encodings can already achieve close to near-zero Error (Figure 2a), have low Verbosity, and admit eﬃ-ciently computable log statistics Γ b ( L ). Although doing somakes estimating statistics more computationally expensive,as a thought experiment, we next consider how much of animprovement we could achieve in the Error/Verbosity trade-oﬀ by exploring a hypothetical second stage that enrichesnaive mixture encodings by adding non-naive patterns. Feature-Correlation Reﬁnement.

The ﬁrst challengeis that our closed-form formula for Reproduction Error onlyworks for naive encodings. Hence, we ﬁrst consider the sim-pler problem of identifying the individual pattern that mostreduces the Reproduction Error of a naive encoding.Recall that the closed-form representation for the Repro-duction Error arises by independence between features (i.e., ρ S ( Q = q ) = (cid:81) i p ( X i = x i )). Similarly, under naive encod-ings we have a closed-form estimation of marginals p ( Q ⊇ b )(i.e., ρ S ( Q ⊇ b ) = (cid:81) i p ( X i ≥ x i )). We deﬁne the feature-correlation of pattern b as the log-diﬀerence from its actualmarginal to the estimation, according to naive encoding. W C ( b , S ) = log ( p ( Q ⊇ b )) − log ( ρ S ( Q ⊇ b ))Intuitively, patterns with higher feature correlations createhigher Errors, which in turn makes them ideal candidatesfor addition to the compressed log encoding. For two pat-terns with the same feature-correlation, the one that occursmore frequently will have greater impact on Error [24]. Asa result, we compute an overall score for ranking patternsinvolving feature-correlation: corr rank ( b ) = p ( Q ⊇ b ) · W C ( b , S ) Table 1: Summary of Data sets

Statistics PocketData US bank

We show in Section 7.1 that corr rank closely correlateswith Reproduction Error. That is, a higher corr rank valueindicates that a pattern produces a greater ReproductionError reduction if introduced into the naive encoding.

Pattern Diversiﬁcation.

This greedy approach only al-lows us to add a single pattern to each cluster. In general, wewould like to identify a set of patterns. We cannot sum upthe corr rank of each pattern in the set to estimate its Re-production Error, as information content carried by patternsmay overlap. To counter such overlap, or equivalently to di-versify patterns, a search through the space of pattern setsis needed. This type of diversiﬁcation is commonly used inpattern mining applications, but can quickly become expen-sive. As we show experimentally in Section 7.2, the beneﬁtthat can be obtained from diversiﬁcation is minimal.

7. EXPERIMENTS

In this section, we design experiments to empirically (1) val-idate that Reproduction Error correlates with Deviation and(2) evaluate the eﬀectiveness of

LogR compression.We use two speciﬁc datasets in the experiment: (1) SQLquery logs of the Google+ Android app extracted from thePocketData public dataset [32] and (2) SQL query logs thatcapture all query activity on the majority of databases at amajor US bank over a period of approximately 19 hours. Asummary of these two datasets is given in Table 1.

The PocketData-Google+ query log.

The dataset con-sists of SQL logs that capture all database activities of 11Android phones. We selected Google+ application for ourstudy since it is one of the few applications where all userscreated a workload. This dataset can be characterized as astable workload of exclusively machine-generated queries.

The US bank query log.

This log is an anonymizedrecord of queries processed by multiple relational databaseservers at a major US bank [35] over a period of 19 hours. Ofthe nearly 73 million database operations captured, 58 mil-lion are not directly queries, but rather invocations of stored8 .0 0.1 0.19 0.36 0.42 0.52 0.74 0.91 d ( S \ S ) . . . . . . . . d ( S ) d ( S ) (a) Containment captures Deviation (US bank) d ( S \ S ) . . . . . . . d ( S ) d ( S ) (b) Containment captures Deviation (PocketData) . . . . . . . . . Error . . . . . . . . D e v i a t i on Num of Patterns=1Num of Patterns=2Num of Patterns=3 (c)

Error captures Deviation (US bank) . . . . . . Error . . . . . . D e v i a t i on Num of Patterns=1Num of Patterns=2Num of Patterns=3 (d)

Error captures Deviation (PocketData) . . . . . . . . Corr rank . . . . . . E rr o r Num of Features=3Num of Features=2 (e)

Error captures Correlation (US bank) . . . . . . . Corr rank . . . . . . E rr o r Num of Features=3Num of Features=2 (f)

Error captures Correlation (PocketData)

Figure 4:

Validating The Reproduction Error Metric procedures and 13 million not able to be parsed by standardSQL parser. Among the rest of the 2.3 million parsed SQLqueries, since we are focusing on conjunctive queries, webase our analysis on the 1.25 million valid

SELECT queries.This dataset can be characterized as a diverse workload ofboth machine- and human-generated queries.

Common Experiment Settings.

Experiments were per-formed on a 2.8 GHz Intel Core i7 CPU with 16 GB 1600MHz DDR3 memory and a SSD running macOS Sierra.

Constant Removal.

A number of queries in US Bankdiﬀer only in hard-coded constant values. Table 1 showsthe total number of queries, as well as the number of dis-tinct queries if we ignore constants. By comparison, queriesin PocketData all use JDBC parameters. For these experi-ments, we ignore constant values in queries.

Query Regularization.

We apply query rewrite rules(similar to [14]) to regularize queries into equivalent con-junctive forms, where possible. Table 1 shows that and of distinct queries are in conjunctive form for Pock-etData and US bank respectively. After regularization, allqueries in both data sets can be either simpliﬁed into con-junctive queries or re-written into a

UNION of conjunctivequeries compatible with Aligon et. al.’s feature scheme [3].

Convex Optimization Solving.

All convex optimizationproblems involved in measuring Reproduction Error and De-viation are solved by the successive approximation heuristic implemented by the CVX toolbox [23] with Sedumi solver.

In this section, we validate that Reproduction Error is apractical alternative to Deviation. In addition, we also oﬀermeasurements on its correlation with Deviation, as well asfeature-correlation described in Section 6.4.Since Deviation cannot be measured exactly (See Sec-tion 3.3), we approximate it using sampling, which is fur-ther explained in Appendix C. It is impractical to enumerateall possible encodings, we choose a subset of encodings forboth datasets. Speciﬁcally, we ﬁrst select all features withmarginals in the range [0 . , .

99] and use these features to construct patterns. We then enumerate combinations of K (up to 3) patterns as our chosen encodings. Containment Captures Deviation.

Here we empiri-cally verify that containment (Section 4.2) captures Devia-tion (i.e., E ≤ Ω E → d ( E ) ≤ d ( E )) to complete the chainof reasoning that Reproduction Error captures Deviation.Figures 4a and 4b show all pairs of encodings where E ⊃ E .The y-axis shows the diﬀerence in Deviation values (i.e., d ( E ) − d ( E )). Deviation d ( S ) is approximated by drawing1,000,000 samples from the space of possible patterns. Forclarity, we bin pairs of encodings by the degree of overlapbetween the encodings, measured by the Deviation of theset-diﬀerence between the two encodings d ( E \ E ); Higher d ( E \ E ) implies less overlap. Y-axis values are groupedinto bins and visualized by boxplot (i.e., the blue box in-dicates the range within standard deviation and red/blackcrosses are outliers). Intuitively, all points above zero on they-axis (i.e., d ( E ) − d ( E ) >

0) are pairs of encodings whereDeviation order agrees with containment order. This is thecase for virtually all encoding pairs.

Additive Separability of Deviation.

We also observefrom Figures 4a and 4b that agreement between Deviationand containment order is correlated with overlap; More sim-ilar encodings are more likely to have agreement. Combinedwith Proposition 1, this shows ﬁrst that for similar encod-ings, Reproduction Error is likely to be a reliable indicatorof Deviation. This also suggests that Deviation is additivelyseparable: The information loss (measured in d ( E ) − d ( E ))by excluding encoding E \ E from E closely correlates withthe quality (i.e., d ( E \ E )) of encoding E \ E itself: E ⊃ E → d ( E ) − d ( E ) < d ( E \E ) ≈ d ( E ) − d ( E ) Error correlates with Deviation.

As a supplement,Figures 4c and 4d empirically conﬁrm that that Reproduc-tion Error (x-axis) indeed closely correlates with Deviation(y-axis). Mirroring our ﬁndings above, correlation betweenthem is tighter at lower Reproduction Error.

Error and Feature-Correlation.

Figure 4e and 4f showthe relationship between Reproduction Error (y-axis) and9he feature-correlation score corr rank (x-axis), as deﬁnedin Section 6.4. Values of y-axis are computed from the naiveencoding extended by a single pattern b containing multiplesfeatures (up to 3). One can observe that the ReproductionError of extended naive encodings almost linearly correlateswith corr rank ( b ). In addition, one can also observe that corr rank becomes higher when the pattern b encodes morecorrelated features. In this section, we design experiments serving two pur-poses: (1) Evaluating the potential reduction in Error fromreﬁning naive mixture encodings through state-of-the-artpattern based summarizers, and (2) Evaluating whether wecan replace naive mixture encodings by the encodings cre-ated from summarizers that we have plugged-in.

Experiment Setup.

To serve both purposes, we constructpattern mixture encodings under three diﬀerent conﬁgura-tions: (1) Naive mixture encoding; (2) Pattern based en-coding and (3) Naive mixture encoding reﬁned by patternbased encoding. Naive mixture encodings are constructedby KMeans clustering. Pattern based encodings are gen-erated by two state-of-the-art pattern based summarizers:(1)

Laserlight [20] algorithm, which aims at summarizingmulti-dimensional data D = ( X , . . . , X n ) augmented withan additional binary attribute A ; (2) MTV [40] algorithm,which aims at mining maximally informative patterns thatsummarize multi-dimensional binary data.The experiment results are shown in Figure 5 which con-tains 3 sub-ﬁgures. All sub-ﬁgures share the same x-axis,i.e., the number of clusters. Figure 5a evaluates the possiblechange in Error (y-axis) by plugging-in

MTV and

Laserlight .Figure 5b compares the Error (y-axis) between the naivemixture encoding and the pattern mixture encoding ob-tained from only using patterns from

MTV and

Laserlight .Figure 5c compares the running time (y-axis) between con-structing naive mixture encodings and applying pattern basedsummarizers. We only show the results for US bank dataset as results for PocketData give similar observations.

Figure 5b and 5c suggest that naive mixture encodingsoutperform pattern based encodings in two ways.

Computation Eﬃciency.

Furthermore, as one can ob-serve from Figure 5c, that the running time of constructingnaive mixture encodings is signiﬁcantly lower than that of

Laserlight and

MTV . Reproduction Error.

We observe from Figure 5b that theReproduction Error of naive mixture encodings are ordersof magnitude lower than those obtained from summarizingusing the patterns generated by

Laserlight or MTV . Verbosity.

The one way in which pattern based encodingsoutperform naive pattern mixtures is in verbosity. Both

Laserlight and

MTV produce encodings with signiﬁcantlyfewer patterns, as the naive pattern mixture summary re-quires at least one pattern for each feature (e.g., 5290 pat-terns in the US bank dataset). Conversely, mining this num-ber of patterns is not computationally feasible (Figure 5c).

The experiment result is shown in Figure 5a. Note that weoﬀset y-axis to show the change in Error. We observe from

Num of Clusters R ep r odu c t i on E rr o r NaiveMixtureLaserLight+NaiveMixtureMTV+NaiveMixture (a)

Naive Mixture v. Naive Mixture+LaserLight/MTV.Note that we y-axis is oﬀset (non-zero start).

Num of Clusters R ep r odu c t i on E rr o r NaiveMixture MTV LaserLight (b)

Naive Mixture v. LaserLight/MTV alone. Note thaty-axis is in log scale.

Num of Clusters R un t i m e ( S e c ) NaiveMixture MTV LaserLight (c)

Runtime Comparison (y-axis in log scale)

Figure 5:

Naive Mixture Encodings (US bank) the ﬁgure that reduction on Error contributed by plugging-in pattern based summarizers is small for both algorithms.

Dimensionality Restriction.

For

Laserlight , this obser-vation is partially due to the fact that we only keep top 100features (in terms of variability) of the data as its input,since

Laserlight is implemented in PostgresSQL 9.1 whichhas a threshold of 100 arguments (one argument for eachfeature) that can be passed to a function.

Pattern Restriction.

For

MTV , this is due to a limitationof 15 patterns that we have experienced in conﬁguring it. Werefer the reader to Section 4.5 of the paper [40] that explainsthe diﬃculty in inferring the maximum entropy distributionwith increasing number of patterns.

8. ALTERNATIVE APPLICATIONS

To fairly evaluate

Laserlight and

MTV , we incorporatetheir own data sets and empirically evaluate them against naive mixture encoding under their own applications.

Data Sets.

Speciﬁcally, we choose

Mushroom data setused in

MTV [40] which is obtained from FIMI datasetrepository and U.S. Census data on Income or simply

In-come data set, which is downloaded from IPUMS-USA at https://usa.ipums.org/usa/ and used in

Laserlight [20]. Thebasic statistics of the data sets are given in Table 2.

All experiments involving

Laserlight and

MTV will beevaluated under their own Error measures and data sets,unless otherwise stated. The experiments are organized as10able 2:

Data Sets of Alternative Applications

Statistics Income Mushroom > , follows: First, we establish baselines by evaluating classical Laserlight and

MTV on their original data; Then we showthat classical

Laserlight and

MTV can be generalized topartitioned data and that the generalization improves ontheir Error measures and also runtime; At last, we comparetheir generalized versions with naive mixture encoding toshow that naive mixture encoding is a reasonable alternative.

We ﬁrst explain how naive mixture encoding is evaluatedbased on Error deﬁned by

Laserlight and

MTV . Evaluating Naive Encoding on Laserlight Error.

Al-gorithm

Laserlight summarizes data D which consists of fea-ture vectors t augmented by some binary feature v . Denotethe valuation of the binary feature v for each feature vector t as v ( t ). The goal is to mine a summary encoding E , whichis a set of patterns contained in t ∈ D that oﬀer predictivepower on v ( t ). Denote the estimation (based on E ) of v ( t )as u E ( t ) ∈ [0 , Laserlight

Error is measured by (cid:88) t ( v ( t ) log( v ( t ) u E ( t ) ) + (1 − v ( t )) log( 1 − v ( t )1 − u E ( t ) ))Since naive encoding E L is equivalent to independence as-sumption on features, u E L ( t ) is simply the probability of v ( t ) = 1, namely u E L ( t ) = |{ t | v ( t )=1 ,t ∈ D }|| D | regardless of t .Consequently, the Laserlight

Error of naive encoding is −| D | ( u E L log u E L + (1 − u E L ) log(1 − u E L )) Evaluating Naive Encoding on MTV Error.

We de-note the data also as D with some summary encoding E , the MTV

Error of E is −| D | H ( ρ ∗ ) + 1 / |E| log | D | where H ( ρ ∗ ) stands for the entropy of maximum entropydistribution ρ ∗ deﬁned in the paper. The second term in MTV

Error penalizes the verbosity of the encoding E . Wedeﬁne the entropy of some feature f as H ( f ) = − p log p − (1 − p ) log(1 − p ) where p is the probability of the featurebeing present, H ( ρ ∗ ) of naive encoding is simply the sum-mation (cid:80) f H ( f ) over all feature entropies. Evaluating Naive Mixture Encoding.

Evaluation of naive encoding can be generalized to naive mixture by takinga weighted average over resulting clusters (See Section 5.2).

Establishing Baselines.

To establish baselines, we evalu-ate

Laserlight and

MTV on their own data sets. The resultsrelated to Error and run time are given in Figure 6 and Fig-ure 7 respectively. La s e r li gh t E rr o r M ea s u r e LaserlightNaive Encoding (Reference) (a)

Laserlight Error v. M T V E rr o r M ea s u r e MTVNaive Encoding (Reference) (b)

MTV Error v.

Figure 6:

Error v. Number of Patterns. Note: y-axisis oﬀset to better visualize diﬀerence.

100 200 300 400 500 600 700

Num of Patterns R unn i ng T i m e ( S e c ) (a) Laserlight Run Time on Income Data

Num of Patterns R unn i ng T i m e ( S e c ) (b) MTV Run Time on Mushroom data

Figure 7:

Run Time of Laserlight and MTV

In Figure 6, X-axis is the number of patterns and y-axisrepresents the Error measure of

Laserlight and

MTV in Fig-ure 6a and 6b respectively. We incorporate naive encoding in Figure 6a as the reference method. Since there are 783total number of features for

Income data set, the verbosityof naive encoding will be 783, which is shown as vertical dot-ted line in Figure 6a. The

Laserlight

Error of naive encod-ing is shown as the horizontal dotted line accordingly. For

Mushroom data set (Figure 6b), the verbosity of its naiveencoding will be 96. However,

MTV quits with error mes-sage if it is requested to mine over 15 patterns. Hence for11igure 6b, the limit of x-axis is 15 and we only show Er-ror of naive encoding as a reference line without markingout its verbosity. We observe in Figure 6a that naive en-coding outperforms

Laserlight when their verbosity is equal(i.e., 783). In addition, approximately after 100 patterns,the slope of Error reduction becomes relatively ﬂat. Similarobservations can be made from Figure 6b. In Figure 7, weobserve that the running time increases exponentially withthe number of patterns, for both

Laserlight and

MTV .The take-aways from Figure 6 and Figure 7 are that (1) naive encoding is faster and more accurate than classical

Laserlight and

MTV ; (2) the runtime increases superlinearlywith the number of patterns mined from both

Laserlight and

MTV . Anti-correlation and Dimentionality Reduction.

Re-call in Section 7.2.2 that

Laserlight is restricted to 100 fea-tures. For its own

Income data set,

Laserlight can be ap-plied with its full set of 783 features. This is due to the priorknowledge that the 783 features belong to 9 groups. In eachgroup, features are mutually anti-correlated which can bereduced to a single feature. Similarly,

Mushroom data setcan be reduced from 95 to 21 features (See Table 2).

We generalize

Laserlight and

MTV on partitioned databy running them on each cluster. We then combine Errorson all clusters by taking a weighted average, as described inSection 5.2. Depending on how many patterns are minedfrom each cluster,

Laserlight and

MTV can be generalizedinto two types: (1) The number of patterns mined from eachcluster is scaled to be equal to the verbosity of the naive en-coding ; and (2) The total number of patterns mined fromall clusters is ﬁxed to a given number. We name the ﬁrsttype

Laserlight (MTV) Mixture Scaled , which is compara-ble to naive mixture encoding . We name the second type

Laserlight (MTV) Mixture Fixed , which is comparable tothe classical

LaserLight (MTV) algorithm.The Error reduction of

Laserlight becomes relatively slowover 100 patterns (See Figure 6a). Hence we conﬁgure

Clas-sical Laserlight and

Laserlight Mixture Fixed to mine a totalof 100 patterns from the original data and partitioned datarespectively, in order to avoid underestimation of

Laserlight performance. We describe the strategy for distributing the100 patterns to partitions of the data sets in Appendix D.3.The experiment result is given in Figure 8. Figure 8aand Figure 8b shows

Laserlight Mixture Fixed versus

Clas-sical Laserlight in Error and run time respectively, whenthe number of data partitions (i.e., clusters) is gradually in-creased. We observe an exponentially decreasing trend (i.e.,improvement) in both

Laserlight Error and run time. Weomit the experiment results for

MTV as they give similarobservations.

Take-away.

As the data is partitioned into more clusters,both runtime and Error of

Laserlight (MTV) Mixture Fixed exponentially improve. This observation can be potentiallygeneralized to other pattern mining based algorithms.

At last, we compare

Laserlight (MTV) Mixture Scaled with naive mixture encoding . Note that it is time-consumingfor

Laserlight to mine the same number of patterns as naiveencoding on Income data (See runtime analysis in Figure 7a), La s e r li gh t E rr o r M ea s u r e Classical Laserlight (a)

Laserlight Error v. R unn i ng T i m e ( s e c ) Classical Laserlight (b)

Total Running time v.

Figure 8:

Laserlight Mixture Fixed v. Classical we choose

Mushroom data for

Laserlight Mixture Scaled in-stead. The experiment results are given in Figure 9. Thex-axes for all sub-ﬁgures in Figure 9 represent the number ofclusters and the y-axes stands for

Laserlight and

MTV

Errorrespectively. We incorporate baselines (i.e., naive encoding ,classical

Laserlight and

MTV ) as reference lines in Figure 9aand 9b respectively. We also experienced a limitation of15 patterns in conﬁguring

MTV . Hence the comparison be-tween

MTV Mixture Scaled and naive mixture encoding isnot strictly on equal footing as

MTV Mixture Scaled is notable to reach the same Total Verbosity as naive mixture en-coding . Note that their diﬀerence in verbosity is mitigatedby the fact that

MTV

Error measure penalizes verbosity.

Num of Clusters La s e r li gh t E rr o r ⇥ Naive Mixture EncodingNaive Encoding (Reference)Laserlight Mixture ScaledClassical Laserlight (Reference) (a)

Laserlight Error v.

Num of Clusters . . . . . . . M T VE rr o r ⇥ Naive Mixture EncodingNaive Encoding (Reference)MTV Mixture ScaledClassical MTV (Reference) (b)

MTV Error v.

Figure 9:

Naive Mixture v. Laserlight/MTV Mixture

Figure 9a shows that both naive mixture encoding and12 aserlight Mixture Scaled have lower Error than their base-lines. In addition,

Laserlight Mixture Scaled has lower Errorthan naive mixture encoding when the number of clusters isless than 4 and they become close after 6 clusters. In otherwords,

Laserlight is more accurate on lightly partitioneddata. As the data is further partitioned, clusters become‘easier’ to summarize, and naive encoding becomes moresimilar to

Laserlight . Figure 9b shows that naive mixtureencoding marginally outperforms

Laserlight Mixture Scaled . Take-away.

Naive mixture encoding is faster and has sim-ilar (lower) Error than

Laserlight (MTV) Mixture Scaled .

9. RELATED WORK9.1 Workload Analysis

Existing approaches related to workload analysis are fre-quently aimed at speciﬁc tasks like query recommendation [41,22, 33, 50, 4], performance optimization [9, 13], outlier de-tection [30] or visual analysis [39].

Query Recommendation.

This task aims to track his-torical querying behavior and generating query recommen-dations. Related approaches [41, 33] ﬂatten a query abstractsyntax tree as a bag of fragments [41] or snippets [33] andadopt feature vector representation of queries. User proﬁlesare then built from the query log by grouping and summa-rizing feature vectors by user in order to make personalizedrecommendation. Under OLAP systems, proﬁles are alsobuilt for workloads of similar OLAP sessions [4].

Performance Optimization.

Index selection [16, 21] andmaterialized view selection [2, 9, 13] are typical performanceoptimization tasks. The conﬁguration search space is usu-ally large, but can be reduced with appropriate summaries.

Outlier Detection.

Kamra et al. [30] aim at detectinganomalous behavior of queries in the log by summarizingquery logs into proﬁles of normal user behavior.

Visual Analysis.

Makiyama et al. [39] provide a set ofvisualizations that facilitate further workload analysis onSloan Digital Sky Survey (SDSS) dataset. QueryScope [25]aims at ﬁnding better tuning opportunities by helping hu-man experts to identify patterns shared among queries.In these approaches, queries are commonly encoded asfeature vectors or bit-maps where a bit array is mapped toa list of features with 1 in a position if the correspondingfeature appears in the query and 0 otherwise. Workloadsunder the bit-map encoding must then be compressed beforethey can be eﬃciently queried or visualized for analysis.

Run-length Encoding.

Run-length encoding (RLE) isa loss-less compression scheme commonly used in

InvertedIndex Compression [48, 53] and

Column-Oriented Compres-sion [1]. RLE-based compression algorithms include butnot limited to: Byte-aligned Bitmap Code (BBC) used inOracle systems [7], Word-aligned Hybrid (WAH) [49] andmany others [42, 5, 6]. In general, RLE-based methodsfocus on column-wise compression and requires additionalheavyweight inference on frequencies of cross-column (i.e.,row-wise) patterns used for workload analysis.

Lempel-Ziv Encoding.

Lempel-Ziv [51, 52] is the loss-less compression algorithm used by gzip. It takes variablesized patterns (row-wise in our case) and replaces them with ﬁxed length codes, in contrast to Huﬀman encoding [26].Lempel-Ziv encoding does not require knowledge about pat-tern frequencies in advance and builds the pattern dictionarydynamically. There are many other similar schemes for com-pressing ﬁles represented as sequential bit-maps, e.g. [46].

Dictionary Encoding.

Dictionary encoding is a moregeneral form of Lempel-Ziv. It has the advantage that pat-terns with frequencies stored in the dictionary can be in-terpreted as workloads statistics useful for analysis. In thispaper, we extend dictionary encoding and focus on using adictionary to infer frequencies of patterns not in it. Mam-paey et al. proposed

MTV algorithm [40] that ﬁnds the dic-tionary (of given size) having the lowest

Bayesian Informa-tion Criterion(BIC) score. Lower BIC score indicates bet-ter accuracy on frequency inference. Gebaly et al. proposed

Laserlight algorithm [20] that builds a pattern dictionaryfor a slightly diﬀerent goal. The quality of the dictionarydepends on whether it can correctly infer the truth-value ofsome binary feature.

Generative Models.

A generative model is a lossy com-pressed representation of the original log. Typical genera-tive models are probabilistic topic models [10, 47] and noisy-channel model [34]. Generative models can infer patternfrequencies but they lack a model-independent measure foreﬃciently evaluating overall inference accuracy.

Low-Rank Matrix Decomposition.

Low-rank matrixdecomposition [19], e.g. Principal Component Analysis (PCA)and Non-negative matrix factorization (NMF) [38], oﬀerslossy data compression. But the resulting matrices after de-composition are not suited for inferring workload statistics.

10. CONCLUSION

In this paper, we introduced the problem of log compres-sion and deﬁned a family of pattern based log encodings.We precisely characterized the information content of logsand oﬀered three principled and one practical measures ofencoding quality: Verbosity, Ambiguity, Deviation and Re-production Error. To reduce the search space of patternbased encodings, we introduced the idea of partitioning logsinto separate components, which induces the family of pat-tern mixture as well as its simpliﬁed form: naive mixtureencodings. Finally, we experimentally showed that naivemixture encodings are more informative and can be con-structed more eﬃciently than encodings constructed fromstate-of-the-art pattern based summarization techniques.

Future Work.

The use of mixture models for summariza-tion has potential implications for work on pattern mining;As we show, existing techniques can be substantially im-proved. We also expect that making accurate correlatedfeature counting eﬃcient will enable a range of more power-ful database tuning and intrusion detection systems.13

1. REFERENCES [1]

Abadi, D., Madden, S., and Ferreira, M.

Integrating compression and execution incolumn-oriented database systems. In

Proceedings ofthe 2006 ACM SIGMOD International Conference onManagement of Data (New York, NY, USA, 2006),SIGMOD ’06, ACM, pp. 671–682.[2]

Agrawal, S., Chaudhuri, S., and Narasayya,V. R.

Automated selection of materialized views andindexes in sql databases. In

Proceedings of the 26thInternational Conference on Very Large Data Bases (San Francisco, CA, USA, 2000), VLDB ’00, MorganKaufmann Publishers Inc., pp. 496–505.[3]

Aligon, J., Golfarelli, M., Marcel, P., Rizzi,S., and Turricchia, E.

Similarity measures for olapsessions.

Knowledge and Information Systems 39 , 2(May 2014), 463–489.[4]

Aligon, J., Golfarelli, M., Marcel, P., Rizzi,S., and Turricchia, E.

Similarity measures for olapsessions.

Knowledge and information systems 39 , 2(2014), 463–489.[5]

Amer-Yahia, S., and Johnson, T.

Optimizingqueries on compressed bitmaps. In

Proceedings of the26th International Conference on Very Large DataBases (San Francisco, CA, USA, 2000), VLDB ’00,Morgan Kaufmann Publishers Inc., pp. 329–338.[6]

Antoshenkov, G.

Byte-aligned bitmap compression.In

Data Compression Conference, 1995. DCC ’95.Proceedings (March 1995), pp. 476–.[7]

Antoshenkov, G., and Ziauddin, M.

Queryprocessing and optimization in oracle rdb.

The VLDBJournal 5 , 4 (Dec 1996), 229–237.[8]

Aouiche, K., Jouve, P.-E., and Darmont, J.

Clustering-based materialized view selection in datawarehouses. In

ADBIS (2006).[9]

Aouiche, K., Jouve, P.-E., and Darmont, J.

Clustering-based materialized view selection in datawarehouses. In

East European Conference onAdvances in Databases and Information Systems (2006), Springer, pp. 81–95.[10]

Blei, D. M.

Probabilistic topic models.

Commun.ACM 55 , 4 (Apr. 2012), 77–84.[11]

Boyd, S., and Vandenberghe, L.

ConvexOptimization . Cambridge University Press, New York,NY, USA, 2004.[12]

Bruno, N., and Chaudhuri, S.

Automatic physicaldatabase tuning: A relaxation-based approach. In

ACM SIGMOD (2005).[13]

Bruno, N., Chaudhuri, S., and Gravano, L.

Stholes: a multidimensional workload-awarehistogram. In

Acm Sigmod Record (2001), vol. 30,ACM, pp. 211–222.[14]

Chandra, B., Joseph, M., Radhakrishnan, B.,Acharya, S., and Sudarshan, S.

Partial markingfor automated grading of sql queries.

Proc. VLDBEndow. 9 , 13 (Sept. 2016), 1541–1544.[15]

Chatzopoulou, G., Eirinaki, M., Koshy, S.,Mittal, S., Polyzotis, N., and Varman, J. S. V.

The QueRIE system for personalized queryrecommendations.

IEEE Data Eng. Bull. (2011).[16]

Chaudhuri, S., and Narasayya, V. R.

An eﬃcient,cost-driven index selection tool for microsoft sql server. In

VLDB (1997), vol. 97, Citeseer, pp. 146–155.[17]

Darroch, J. N., and Ratcliff, D.

Generalizediterative scaling for log-linear models.

Ann. Math.Statist. 43 , 5 (10 1972), 1470–1480.[18]

Dwork, C.

ICALP 2006, Proceedings, Part II .Springer, 2006, ch. Diﬀerential Privacy.[19]

Eckart, C., and Young, G.

The approximation ofone matrix by another of lower rank.

Psychometrika 1 ,3 (1936), 211–218.[20]

El Gebaly, K., Agrawal, P., Golab, L., Korn,F., and Srivastava, D.

Interpretable andinformative explanations of outcomes.

Proc. VLDBEndow. 8 , 1 (Sept. 2014), 61–72.[21]

Finkelstein, S., Schkolnick, M., and Tiberio, P.

Physical database design for relational databases.

ACM Transactions on Database Systems (TODS) 13 ,1 (1988), 91–128.[22]

Giacometti, A., Marcel, P., Negre, E., andSoulet, A.

Query recommendations for olapdiscovery driven analysis. In

Proceedings of the ACMTwelfth International Workshop on Data Warehousingand OLAP (New York, NY, USA, 2009), DOLAP ’09,ACM, pp. 81–88.[23]

Grant, M., and Boyd, S.

CVX: Matlab software fordisciplined convex programming, version 2.1. http://cvxr.com/cvx , Mar. 2014.[24]

Han, J., Cheng, H., Xin, D., and Yan, X.

Frequentpattern mining: Current status and future directions.

Data Min. Knowl. Discov. 15 , 1 (Aug. 2007), 55–86.[25]

Hu, L., Ross, K. A., Chang, Y.-C., Lang, C. A.,and Zhang, D.

Queryscope: visualizing queries forrepeatable database tuning. pVLDB (2008).[26]

Huffman, D. A.

A method for the construction ofminimum-redundancy codes.

Proceedings of the IRE40 , 9 (1952), 1098–1101.[27]

Jain, A. K.

Data clustering: 50 years beyondk-means.

Pattern Recogn. Lett. 31 , 8 (June 2010),651–666.[28]

Jaynes, E., and Bretthorst, G.

ProbabilityTheory: The Logic of Science . Cambridge UniversityPress, 2003.[29]

Johnson, S. C.

Hierarchical clustering schemes.

Psychometrika 32 , 3 (Sep 1967), 241–254.[30]

Kamra, A., Terzi, E., and Bertino, E.

Detectinganomalous access patterns in relational databases.

TheVLDB JournalThe International Journal on VeryLarge Data Bases 17 , 5 (2008), 1063–1077.[31]

Kannan, R., Vempala, S., and Vetta, A.

Onclusterings: Good, bad and spectral.

J. ACM 51 , 3(May 2004), 497–515.[32]

Kennedy, O., Ajay, J. A., Challen, G., andZiarek, L.

Pocket data: The need for tpc-mobile. In

TPC-TC (2015).[33]

Khoussainova, N., Kwon, Y., Balazinska, M.,and Suciu, D.

Snipsuggest: Context-awareautocompletion for sql.

Proc. VLDB Endow. 4 , 1(Oct. 2010), 22–33.[34]

Knight, K., and Marcu, D.

Summarization beyondsentence extraction: A probabilistic approach tosentence compression.

Artif. Intell. 139 , 1 (July 2002),91–107.1435]

Kul, G., Luong, D., Xie, T., Coonan, P.,Chandola, V., Kennedy, O., and Upadhyaya, S.

Ettu: Analyzing query intents in corporate databases.In

Proceedings of the 25th International ConferenceCompanion on World Wide Web (2016), pp. 463–466.[36]

Kul, G., Upadhyaya, S. J., and Chandola, V.

Detecting data leakage from databases on androidapps with concept drift. In

IEEE TrustCom (2018),pp. 905–913.[37]

Kullback, S., and Leibler, R. A.

On informationand suﬃciency.

Ann. Math. Statist. 22 , 1 (03 1951),79–86.[38]

Lee, D. D., and Seung, H. S.

Learning the parts ofobjects by non-negative matrix factorization.

Nature401 , 6755 (1999), 788.[39]

Makiyama, V. H., Raddick, J., and Santos, R. D.

Text mining applied to sql queries: A case study forthe sdss skyserver. In

SIMBig (2015), pp. 66–72.[40]

Mampaey, M., Vreeken, J., and Tatti, N.

Summarizing data succinctly with the mostinformative itemsets.

ACM Trans. Knowl. Discov.Data 6 , 4 (Dec. 2012), 16:1–16:42.[41]

Mittal, S., Varman, J. S. V., Chatzopoulou, G.,Eirinaki, M., and Polyzotis, N.

Querie: A queryrecommender system supporting interactive databaseexploration. In (Dec 2010),pp. 1411–1414.[42]

Moffat, A., and Zobel, J.

Compression and fastindexing for multi-gigabyte text databases.

AustralianComputer Journal 26 (1994).[43]

O’Donoghue, B., Chu, E., Parikh, N., and Boyd,S.

Conic optimization via operator splitting andhomogeneous self-dual embedding.

Journal ofOptimization Theory and Applications 169 , 3 (Jun2016), 1042–1068.[44]

Pavlo, A., Angulo, G., Arulraj, J., Lin, H., Lin,J., Ma, L., Menon, P., Mowry, T. C., Perron,M., Quah, I., Santurkar, S., Tomasic, A., Toor,S., Aken, D. V., Wang, Z., Wu, Y., Xian, R., andZhang, T.

Self-driving database managementsystems. In

CIDR (2017).[45]

Pedregosa, F., Varoquaux, G., Gramfort, A.,Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V.,Vanderplas, J., Passos, A., Cournapeau, D.,Brucher, M., Perrot, M., and Duchesnay, E.

Scikit-learn: Machine learning in Python.

Journal ofMachine Learning Research 12 (2011), 2825–2830.[46]

Skibi´nski, P., and Swacha, J.

Fast and eﬃcient logﬁle compression. In

CEUR Workshop Proceedings ofthe 11th East-European Conference on Advances inDatabases and Information Systems (ADBIS) (2007),pp. 330–342.[47]

Wang, D., Zhu, S., Li, T., and Gong, Y.

Multi-document summarization using sentence-basedtopic models. In

Proceedings of the ACL-IJCNLP2009 Conference Short Papers (Stroudsburg, PA,USA, 2009), ACLShort ’09, Association forComputational Linguistics, pp. 297–300.[48]

Witten, I. H., Moffat, A., and Bell, T. C.

Managing gigabytes: Compressing and indexing documents and images - errata, 1996.[49]

Wu, K., Otoo, E. J., and Shoshani, A.

Compressing bitmap indexes for faster searchoperations. In

Proceedings 14th InternationalConference on Scientiﬁc and Statistical DatabaseManagement (2002), pp. 99–108.[50]

Yang, X., Procopiuc, C. M., and Srivastava, D.

Recommending join queries via query log analysis. In

Data Engineering, 2009. ICDE’09. IEEE 25thInternational Conference on (2009), IEEE,pp. 964–975.[51]

Ziv, J., and Lempel, A.

A universal algorithm forsequential data compression.

IEEE Transactions onInformation Theory 23 , 3 (May 1977), 337–343.[52]

Ziv, J., and Lempel, A.

Compression of individualsequences via variable-rate coding.

IEEE Trans. Inf.Theor. 24 , 5 (Sept. 2006), 530–536.[53]

Zobel, J., and Moffat, A.

Inverted ﬁles for textsearch engines.

ACM Comput. Surv. 38 , 2 (July 2006).

APPENDIXA. NOMENCLATURE

Symbol Meaning f Feature b Pattern b (cid:48) ⊆ b b (cid:48) is contained in bq Query L Log, a bag of queries Q Query randomly drawn from Lp ( Q | L ) Query distribution of Lp ( Q ⊇ b | L ) Marginal probability of Q ⊇ b E L [ b ] same as p ( Q ⊇ b | L ) corr rank ( b ) Feature-correlation score E max Mapping from all patterns to marginals E Encoding, partial mapping

E ⊆ E max domain ( · ) Domain of mapping ρ An arbitrary query distributionΩ E Space of ρ constrained by EE ≤ Ω E (cid:48) Ω E ⊆ Ω E (cid:48) P E A random ρ drawn from Ω E ρ ∗ Same as p ( Q | L ), ρ ∗ ∈ Ω E H ( · ) Entropy of Distribution ρ E Representative distribution of Ω E d( E ) DeviationI( E ) Ambiguity e ( E ) Reproduction Error |E| | domain ( E ) | , Verbosity of encoding ρ (cid:28) ρ (cid:48) ρ is absolutely continuous w.r.t ρ (cid:48) D KL ( ρ || ρ (cid:48) ) K-L Divergence from ρ (cid:48) to ρ . B. PROOF OF PROPOSITION 1

Denote by n = { , } n the space of possible 0-1 vectorsof size n , and deﬁne an encoding ¯ E q with patterns: domain ( E q ) = { ( x + b , . . . , x n + b n ) | ( b , . . . , b n ) ∈ n } We will show that E q ⊆ E max contains suﬃcient informationto compute p = p ( X = x , . . . , X n = x n ) through severalsteps. First, we deﬁne a new pair of marginal probabilities p (cid:104) b (cid:105) = p ( X ≥ x + b , X = x , . . . , X n = x n ). x is15ntegral, so p = p (cid:104) (cid:105) − p (cid:104) (cid:105) . Generalizing, we candeﬁne: p k (cid:104) b , . . . , b k (cid:105) = p ( X ≥ x + b , . . . , X k ≥ x k + b k ,X k +1 = x k +1 , . . . , X n = x n )Again, x k being integral gives us that: p k − (cid:104) b , . . . , b k − (cid:105) = p k (cid:104) b , . . . , b k − , (cid:105)− p k (cid:104) b , . . . , b k − , (cid:105) Finally, when k = n , the probability p n (cid:104) b , . . . , b n (cid:105) is themarginal probability p ( Q ⊇ b | L ) of a pattern b = ( x + b , . . . , x n + b n ), which by deﬁnition is oﬀered by E q for any( b , . . . , b n ) ∈ n . The resulting encoding E = (cid:83) q ∈ L E q identiﬁes the distribution p ( Q | L ), which we refer to as lossless encoding. Clearly any encoding that extends E (in-cluding E max ) is lossless. C. SAMPLING FROM SPACE OF DISTRI-BUTIONS

Here we describe how we sample a random distribution ρ from the space Ω E of probability distributions. C.1 Preliminary Sampling

To sample a random distribution ρ which assigns a prob-ability value to each element in vector space B n . The naiveway is to treat ρ as a random multi-dimensional vector( ρ ( q ) , . . . , ρ ( q | B n | )) that sum up to 1. However, | B n | is Algorithm 1

Sampling1: procedure

TwoStepSampling

2: Step 1:3: for each v ∈ B m ∧ C v (cid:54) = ∅ do V ← V (cid:83) v end for class p ← UniRandDistribProb (V,1)7: Step 2:8: for each v ∈ V do ρ ← ρ (cid:83) UniRandDistribProb ( C v , class p ( v ))10: end for return ρ end procedure procedure UniRandDistribProb (Set S, double prob)15: for each element e ∈ S do p ( e ) ← UniformRandNum ( range = [0 , end for for each element e ∈ S do p ( e ) ← prob × p ( e ) ÷ (cid:80) e p ( e )20: end for return p end procedure exponentially large (i.e., 2 n ) and we reduce the number of el-ements in ρ by grouping them (i.e., q . . . , q | N n | ) into equiv-alence classes. Encoding-equivalent Classes.

The basic idea for group-ing is based on containment relationship between query q i and patterns b ∈ E in the encoding E . More precisely, if q i ⊇ b , it indicates that the assignment ρ ( q i ) on i th dimen-sion is constrained by marginal E [ b ] (See Section 3.2). As a result, if queries q i , q j share the same containment relation-ship with pattern b , assignments ρ ( q i ) , ρ ( q j ) on i th and j thdimension make no diﬀerence for satisfying the constraint ofpattern b . We thus deﬁne pattern-equivalence as q i ≡ b q j ⇔ BI ( q i , b ) = BI ( q j , b ) BI ( q i , b ) is the Binary Indicator function satisfying BI ( q i , b ) =1 ≡ q i ⊇ b . Queries are encoding-equivalent q i ≡ E q j ifthey are pattern-equivalent for all patterns in the encoding.Numbering patterns in the encoding as b , . . . , b m , any bi-nary vector v ∈ B m maps to an equivalence class C v = { q | ( BI ( q , b ) , . . . , BI ( q , b m )) = v ∧ q ∈ N n } . Thoughthe number of non-empty equivalent classes may grow aslarge as O (2 m ), it is much smaller than 2 n in most casesand sampling a random distribution ρ can be divided intotwo steps as shown in line 1 of algorithm 1. Note that class p in the algorithm, which is produced by the ﬁrst step, is arandomly sampled distribution over all non-empty equiv-alence classes. The second step redistributes probabilitiesrandomly assigned to each equivalence class to its class mem-bers in an unbiased way. C.2 Incorporating Constraints

So far we are creating random samples from an uncon-strained space of distributions. To make sure ρ produced bythe two-step sampling fall within space Ω E , the probabili-ties distributed over equivalence classes (denoted as class p )must obey the linear equality constraints derived from theencoding E . Denote the space of candidate class p as U andthe subspace allowed by the encoding as U E ⊆ U , one naivesolution is to reject class p / ∈ U E . However, the subspace U E constrained under linear equality constraints is equivalent toan intersection of hyperplanes in the full space U . The vol-ume of U E is thus inﬁnitely small comparing to that of U ,such that any random sample class p ∈ U will almost never fall within U E . To make sampling feasible, we do not rejecta sample class p ∈ U but project it onto the hyperplane of U E by ﬁnding its closest (Euclidean distance) counterpart class p (cid:48) ∈ U E : class p (cid:48) = arg min class p (cid:48) ∈ U E || class p (cid:48) − class p || Finding the projection point class p (cid:48) of class p can be achievedby linear programming. D. ALGORITHM CONFIGURATIONS

Here we give detailed description on our selected state-of-the-art pattern based summarizers (i.e.,

Laserlight and

MTV ) and also specify how we conﬁgured them in experi-ments discussed in Section 7.2.

Common Conﬁguration.

We set up both algorithms tomine 15 patterns from target clusters. This is because, em-pirically we found that

MTV quits with error message over15 patterns. For fair comparison, we set the same numberof patterns for

Laserlight . D.1 Laserlight Algorithm

Description.

Laserlight algorithm is proposed in [20] forsummarizing multi-dimension data (i.e., D = ( X , . . . , X n ))augmented by a binary attribute A . The goal is to searchfor a set of patterns (i.e., encoding) from the data D thatprovide maximum information for predicting augmented at-tribute A , which is a sub-problem of summarizing the joint16istribution p ( D, A ). Another algorithm

Flashlight is alsoproposed in the same paper but we omit it in our experimentdue its inferior scalability. The implementation of

Laserlight has been incorporated into PostgreSQL 9.1 and the sourcecode is only available upon request.

Experiment Settings.

Due to the restriction on themaximum number of data dimension by the PostgreSQLimplementation of

Laserlight , we project the distribution p ( Q | L ) onto a limited set of 100 features. The selectioncriteria is based on feature entropy or variability. More pre-cisely, regarding the existence of i th feature as random bi-nary variable X i , features are ranked by entropy H ( X i ).The feature with highest entropy H ( X i ) is chosen as theaugmented attribute A . The algorithm heuristically selectsa limited set of samples from the space of candidate pat-terns, from which the pattern that is most informative isselected to be added to the encoding. Note that when weapplied Laserlight in our experiments, we set the numberof samples to be 16, which is suggested in [20] based on itsown data sets.

D.2 MTV Algorithm

Description.

MTV algorithm is proposed in [40] for sum-marizing multi-dimensional data with binary attributes. Thegoal is to mine a succinct set of patterns (i.e., encoding) thatconvey the most important information (See the paper fordeﬁnition). The implementation of this algorithm can beobtained at http://adrem.ua.ac.be/succinctsummary.

Experiment Settings.

MTV requires to set the mini-mum support threshold for patterns. That is, patterns withmarginal less than the threshold will be ignored, in order toreduce the search space of candidate patterns. We set theminimum support threshold to be 0 .

05 in our experimentssuch that any pattern that is contained in more than 5% ofqueries will be considered as candidate.

D.3 Conﬁguring Laserlight Mixture Fixed

Given a data partitioning and ﬁxed total number of pat-terns to mine from all clusters, in order to determine thenumber of patterns mined from each cluster, we need to as-sign weights (cid:80) i w i = 1 for each cluster i . ReproductionError e ( E L ) of the naive encoding E L for a cluster reﬂectsits ‘easiness’ for pattern mining and the intuition is that e ( E L ) = 0 indicates there is no need for additional patternmining. Reproduction Error e ( E L ) is aﬀected by the numberof features n ever occur in the cluster. Consider a toy datawith only two feature vectors v = (0 ,

0) and v = (1 , v with new features of value 0 and v of value1 will increase e ( E L ) but not necessarily the number of pat-terns needed for accurately summarizing the data. Hencewe normalize e ( E L ), dividing it by the number of features n ,which gives us w i = e ( E L ) n . In addition, since the generalizedmeasure for Laserlight and

MTV gives weight to each clus-ter proportional to its number of distinct data instances m ,we also adjust w i and multiply them with the number of dis-tance data instances. The ﬁnal weight assignment becomes w i ∝ mn e ( E L ). E. INTERPRETING NAIVE MIXTURE EN-CODING

Due to sensitive information contained in the US bankdata set, we only provide visualization on PocketData. The visualization of PocketData is based on its naive mix-ture encoding under 8 clusters . The result is given in Fig-ure 10. There are 5 sub-ﬁgures with each representing anaive encoding for one cluster. Note that we use shadingto represent the magnitude of marginals and features withmarginal too small will be invisible and omitted. Questionmark ‘?’ is the placeholder for constants. Three clustersfrom the eight are not shown in the ﬁgure: One cluster istoo messy (i.e., further sub-clustering is needed) and twoclusters gives similar visualization to Figure 10a and 10e.The caption of each sub-ﬁgures expresses our understand-ing on the task that queries in the cluster are performing,by visualizing the corresponding naive encodings. For sim-plicity, we will also omit features of SELECT category if theyare neither participating in

WHERE clause nor intuitively re-lated to other features in

SELECT . SELECT conversation id , participants type , first name , chat id , blocked , active FROM conversation participants view

WHERE (chat id!=?) ∧ (conversation id=?) ∧ (active=1) (a) Check the person who is active in speciﬁc conversa-tion and not participating in speciﬁed chat.SELECT status , timestamp , expiration timestamp , sms raw sender , message id , text FROM conversations , message notifications view , messages view ORDER BY

Descend on timestamp

Limit

WHERE (expiration timestamp>?) ∧ (status!=5) ∧ (conversation id=?) ∧ (conversations.conversation id=conversation id) (b) Check sender information for most recent SMS mes-sages that participate in given conversation.SELECT status , timestamp , conversation id , chat watermark , message id , sms type FROM conversations , message notifications view WHERE (conversation status!=1) ∧ (conversation pending leave!=1) ∧ (conversation notification level!=10) ∧ (timestamp>1355...) ∧ (timestamp>chat watermark) ∧ (conversation id=?) ∧ (conversations.conversation id=conversation id) (c) Check recent messages in conversations of speciﬁctype.SELECT suggestion type , name , chat id FROM suggested contacts

Limit Order By

Ascend on upper(name)

WHERE (chat id!=?) ∧ (name!=?) (d) Suggest contacts that avoid certain names and chat.SELECT sms type , timestamp , id FROM messages

WHERE (sms type=1) ∧ (status=4) ∧ (transport type=3) ∧ (timestamp>=?) (e) Check messages under type/status conditions

Figure 10:

Visualize PocketData by its naive mixtureencoding2