Query Log Compression for Workload Analytics
QQuery Log Compression for Workload Analytics
Ting Xie
University at Buffalo, SUNY [email protected] Varun Chandola
University at Buffalo, SUNY [email protected] Oliver Kennedy
University at Buffalo, SUNY [email protected]
ABSTRACT
Analyzing database access logs is a key part of performancetuning, intrusion detection, benchmark development, andmany other database administration tasks. Unfortunately,it is common for production databases to deal with mil-lions or even more queries each day, so these logs must besummarized before they can be used. Designing an ap-propriate summary encoding requires trading off betweenconciseness and information content. For example: simpleworkload sampling may miss rare, but high impact queries.In this paper, we present
LogR , a lossy log compressionscheme suitable use for many automated log analytics tools,as well as for human inspection. We formalize and ana-lyze the space/fidelity trade-off in the context of a broaderfamily of “pattern” and “pattern mixture” log encodingsto which
LogR belongs. We show through a series of ex-periments that
LogR compressed encodings can be createdefficiently, come with provable information-theoretic boundson their accuracy, and outperform state-of-art log summa-rization strategies.
1. INTRODUCTION
Automated analysis of database access logs is critical forsolving a wide range of problems, from database perfor-mance tuning [12], to compliance validation [18] and queryrecommendation [15]. For example, the Peloton self-tuningdatabase [44] searches for optimal configurations by repeat-edly simulating database performance based on statisticalproperties of historical queries. Unfortunately, query logsfor production databases can grow to be large — A recentstudy of queries at a major US bank for a period of 19hours found nearly 17 million SQL queries and over 60 mil-lion stored procedure executions [35] — and computing theseproperties from the log itself is slow.Tracking only a sample of these queries is not sufficient,as rare queries can disproportionately affect database per-formance, for example, if they benefit from an otherwiseunnecessary index. Rather, we need a compressed summary of the log on which we can compute aggregate statisticalproperties. The problems of compression and summariza-tion have been studied extensively (e.g., [51, 52, 26, 19, 10,47, 34]). However, these schemes either require the use ofheavyweight inference to desired statistical measures, or pro-duce unnecessarily large encodings.In this paper, we adapt ideas from pattern mining andsummarization [40, 20] to propose a middle-ground:
LogR ,a summarization scheme that facilitates efficient (both interms of storage and time) approximation of workload statis-tics. By adjusting a tunable parameter in
LogR , users canchoose to obtain a high-fidelity, albeit large summary, orobtain a more compact summary with lower fidelity. Con-structing the summary that best balances compactness andfidelity is challenging, as the search space of candidate sum-maries is combinatorially large [40, 20].
LogR offers a newapproach to summary construction that avoids searchingthis space, making inexpensive, accurate computation of ag-gregate workload statistics possible. As a secondary benefit,the resulting summaries are also human-interpretable.
LogR does not admit closed-form solutions to classicalfidelity measures like information loss, so we propose an al-ternative called
Reproduction Error . We show through acombination of analytical and experimental evidence thatReproduction Error is highly correlated with several classi-cal measures of encoding fidelity.
LogR -compressed data relies on a codebook based onstructural elements like
SELECT items,
FROM tables, or con-junctive
WHERE clauses [3]. This codebook provides a bi-directional mapping from SQL queries to a bit-vector en-coding and back again, reducing the compression problemto one of compactly encoding a collection of feature-vectors.We further simplify the problem by observing that a com-mon theme in use cases like automated performance tuningor query recommendation is the need for predominantly ag-gregate workload statistics. As these are order-independent,we are able to focus exclusively on compactly representing bags of feature-vectors.
LogR works by identifying groups of co-occurring struc-tural elements that we call patterns. We define a family of pattern encodings of access logs, which map patterns to theirfrequencies in the log. For pattern encodings, we considertwo idealized measures of fidelity: (1) Ambiguity, whichmeasures how much room the encoding leaves for interpre-tation; and (2) Deviation, which measures how reliably theencoding approximates the original log. Neither Ambiguitynor Deviation can be computed efficiently for pattern en-codings. Hence we propose a measure called
Reproduction a r X i v : . [ c s . D B ] S e p rror that is efficiently computable and that closely tracksboth Ambiguity and Deviation.In general, the size of the encoding is inversely relatedwith Reproduction Error: The more detailed the encoding,the more faithfully it represents the original log. Thus, logcompression may be defined as a search over the space ofpattern based encodings to identify the one that best tradesoff between these two properties. Unfortunately, searchingfor such an ideal encoding from the space can be compu-tationally expensive [20, 40]. To overcome this limitation,we reduce the search space by first clustering entries in thelog and then encoding each cluster separately, an approachthat we call pattern mixture encoding . Finally we identify asimple approach to encoding individual clusters that we call naive mixture encodings , and show experimentally that itproduces results competitive with more powerful techniquesfor log compression and summarization.Concretely, in this paper we make the following contribu-tions: (1) We define two families of compression for querylogs: pattern and pattern mixture, (2) We define a computa-tionally efficient measure, Reproduction Error, and demon-strate that it is a close approximation of Ambiguity andDeviation (two commonly used measures), (3) We propose aclustering-based approach to efficiently search for naive mix-ture encodings, and show how these encodings can be furtheroptimized, and, (4) We experimentally validate LogR andshow that it produces more precise encodings faster thanseveral state-of-the-art pattern encoding algorithms.
Roadmap.
The paper is organized as follows: Section 2formally defines the log compression problem and the sum-mary representation; Section 3 then defines information lossfor the summaries; Section 4 explains the difficulty in com-puting the theoretical loss measure and provides a practi-cal alternative; Section 5 motivates data partitioning andgeneralizes the practical loss measure to partitioned data;Section 6 then introduces the proposed
LogR compressionscheme; Section 7 empirically validates the practical lossmeasure and evaluates the effectiveness of
LogR compres-sion by comparing it with two state-of-the-art summariza-tion methods; Section 8 empirically verifies the effectivenessof
LogR by evaluating it under the applications of the twocomparison methods; Section 9 discusses related work andSection 10 concludes the paper.
2. PROBLEM DEFINITION
In this section, we introduce and formally define the logcompression problem. We begin by exploring several appli-cations that need to repeatedly analyze query logs.
Index Selection.
Selecting an appropriate set of indexesrequires trading off between update costs, access costs, andlimitations on available storage space. Existing strategies forselecting a (near-)optimal set of indexes typically repeatedlysimulate database performance under different combinationsof indexes, which in turn requires repeatedly estimating thefrequency with which specific predicates appear in the work-load. For example, if status = ? occurs in 90% of the queriesin a workload, a hash index on status is beneficial.
Materialized View Selection.
The results of joins orhighly selective selection predicates are good candidates formaterialization when they appear frequently in the work-load. Like index selection, view selection is a non-convex optimization problem, typically requiring exploration by re-peated simulation, which in turn requires repeated frequencyestimation over the workload.
Online Database Monitoring.
In production settings, itis common to monitor databases for atypical usage patternsthat could indicate a serious bug or security threat. Whenquery logs are monitored, it is often done retrospectively,some hours after-the-fact [35]. To support real-time moni-toring it is necessary to quickly compute the frequency of aparticular class of query in the system’s typical workload.In each case, the application’s interactions with the logamount to counting queries that have specific features: se-lection predicates, joins, or similar.
Let L be a log, or a finite collection of queries q ∈ L . Wewrite f ∈ q to indicate that q has some feature f , such asa specific predicate or table in its FROM clause. We assume(1) that the universe of features in both a log and a queryis enumerable and finite, (2) that the features are selectedto suit specific applications and (3) optionally that a queryis isomorphic to its feature set (motivated in Section 2.3.2).We outline one approach to extracting features that satisfiesall three assumptions below. We abuse syntax and write q todenote both the query itself, as well as the set of its features.Let b denote some set of features f ∈ b , which we calla pattern . We write these sets using vector notation: b =( x , . . . , x n ) where n is the number of distinct features ap-pearing in the entire log and x i indicates the presence (ab-sence) of i th feature with a 1 (resp., 0). For any two patterns b , b (cid:48) , we say that b (cid:48) is contained in b if b (cid:48) ⊆ b . Equiva-lently, with b = ( x , . . . , x n ) and b (cid:48) = ( x (cid:48) , . . . , x (cid:48) n ): b (cid:48) ⊆ b ≡ ∀ i, x (cid:48) i ≤ x i Our goal then is to be able to query logs for the number oftimes a pattern b appears, | { q | q ∈ L ∧ b ⊆ q } | For this paper, we specifically adopt the feature-extractionconventions of a query summarization scheme by Aligon etal. [3]. In this scheme, each feature is one of the follow-ing three query elements: (1) a table or sub-query in the
FROM clause, (2) a column in the
SELECT clause, and (3) aconjunctive atom of the
WHERE clause.
Example Consider the following example query.
SELECT _id , sms_type , _time FROM MessagesWHERE status =? AND transport_type =?
This query uses 6 features: (cid:104) sms_type , SELECT (cid:105) , (cid:104) _id , SELECT (cid:105) , (cid:104) _time , SELECT (cid:105) , (cid:104) Messages , FROM (cid:105) , (cid:104) status=? , WHERE (cid:105) ,and (cid:104) transport_type=? , WHERE (cid:105)
Although this scheme is simple and limited to conjunc-tive queries (or queries with a conjunctive equivalent), itfulfills all three assumptions we make on feature extractionschemes. The features of a query (and consequently a log)are enumerable and finite, and the feature set of the queryis isomorphic (modulo commutativity and column order) tothe original query. Furthermore, even if a query is not it-self conjunctive, it often has a conjunctive equivalent. Wequantify this statement with Table 1, which provides tworelevant data points from production query logs; In both2
ELECT sms type , external ids , time , id FROM messages
WHERE (sms type=?) ∧ (status=?) (a) Correlation-ignorant : Features are highlighted independently
SELECT sms type
FROM messages
WHERE sms type=?
SELECT sms type
FROM messages
WHERE status=? (b)
Correlation-aware : Pattern groups are highlighted together.
Figure 1:
Example Encoding Visualizations cases, all logged queries can be rewritten into equivalentqueries compatible with the Aligon scheme.Although we do not explore more advanced feature encod-ing schemes in detail here, we direct the interested readerto work on query summarization [39, 8, 35]. For example, ascheme by Makiyama et. al. [39] also captures aggregation-related features like group-by columns, while an approach byKul et. al. [35] encodes partial tree-structures in the query.
As a lossy form of compression,
LogR only approximatesthe information content of a query log. We next develop asimplified form of
LogR that we call pattern-based encod-ing, and develop a framework for reasoning about the fidelityof a
LogR -compressed log. As a basis for this framework,we first reframe the information content of a query log toallow us to adapt classical information-theoretical measuresof information content.
We define the information content of the log as a distri-bution p ( Q | L ) of queries Q drawn uniformly from the log. Example Consider the following query log, which con-sists of four conjunctive queries.
1. SELECT _id FROM Messages WHERE status = ?2. SELECT _time FROM MessagesWHERE status = ? AND sms_type = ?3. SELECT _id FROM Messages WHERE status = ?4. SELECT sms_type , _time FROM MessagesWHERE sms_type = ?
Drawing uniformly from the log, each entry will appear withprobability = 0 . . The query q ( = q ) occurs twice, sothe probability of drawing it is double that of the others (i.e., p ( q | L ) = p ( q | L ) = = 0 . ) Treating a query as a vector of its component features,we can define a query q = ( x , . . . , x n ) to be an obser-vation of the multivariate distribution over variables Q =( X , . . . , X n ) corresponding to features. The event X i = 1occurs if feature i appears in a uniformly drawn query. Example Continuing, the universe of features for thisquery log is (1) (cid:104) _id , SELECT (cid:105) , (2) (cid:104) _time , SELECT (cid:105) ,(3) (cid:104) sms_type , SELECT (cid:105) , (4) (cid:104) status = ? , WHERE (cid:105) ,(5) (cid:104) sms_type = ? , WHERE (cid:105) , and (6) (cid:104)
Messages , FROM (cid:105) . Ac-cordingly, the queries can be encoded as feature vectors, withfields counting each feature’s occurrences: q = (cid:104) , , , , , (cid:105) , q = (cid:104) , , , , , (cid:105) , q = (cid:104) , , , , , (cid:105) , q = (cid:104) , , , , , (cid:105) Patterns.
Our target applications require us to count thenumber of times features (co-)occur in a query. For example,materialized view selection requires counting tables used to-gether in queries. Motivated by this observation, we begin by defining a broad class of pattern based encodings that di-rectly encode co-occurrence probabilities. A pattern is anarbitrary set of features b = ( x , . . . , x n ) that may co-occurtogether. Each pattern captures a piece of information fromthe distribution p ( Q | L ). In particular, we are interested inthe probability of uniformly drawing a query q from the logthat contains the pattern b (i.e., q ⊇ b ): p ( Q ⊇ b | L ) = (cid:80) q ∈ L ∧ q ⊇ b p ( q | L ) When it is clear from context, we abuse notation and write p ( · ) instead of p ( · | L ). Recall that p ( Q ) can be repre-sented as a joint distribution over variables ( X , . . . , X n )and probability p ( Q ⊇ b ) is thus equivalent to the marginalprobability p ( X ≥ x , . . . , X n ≥ x n ) of pattern b . Pattern-Based Encodings.
Denote by E max : { , } n → [0 , b ∈ { , } n to their marginals. A pattern based encod-ing E is any such partial mapping E ⊆ E max . We de-note the marginal of pattern b in encoding E L by E L [ b ](= p ( Q ⊇ b | L )). When it is clear from context, we abusesyntax and also use E to denote the set of patterns it maps(i.e., domain ( E )). Hence, |E| is the number of mapped pat-terns, which we call the encoding’s Verbosity . A patternbased encoder is any algorithm encode ( L, (cid:15) ) (cid:55)→ E whose in-put is a log L and whose output is a set of patterns E , withVerbosity thresholded at some integer (cid:15) . Many pattern min-ing algorithms [20, 40] can be used for this purpose. A side-benefit of pattern based encodings is that, underthe assumption of isomorphism in Section 2.1, patterns canbe translated to their query representations and used for hu-man analysis of the log. Figure 1 shows two examples. Theapproach illustrated in Figure 1a uses shading to show eachfeature’s frequency in the log, and communicates frequentlyoccurring constraints or attributes. This approach might,for example, help a human to manually select indexes. Asecond approach illustrated in Figure 1b conveys correla-tions, showing the frequency of entire patterns. Appendix Eexplores interpretable visualizations of pattern based sum-maries in greater depth.
3. INFORMATION LOSS
Our goal is to encode the distribution p ( Q ) as a set ofpatterns: obtaining a less verbose encoding (i.e., with fewerpatterns), while also ensuring that the encoding captures p ( Q ) with minimal information loss. In this section, wedefines information loss for pattern based encodings. To establish a baseline for measuring information loss, webegin with the extreme cases. At one extreme, an emptyencoding ( |E| = 0) conveys no information. At the other ex-treme, we have the encoding E max which is the full mappingfrom all patterns. Having this encoding is a sufficient con-dition to exactly reconstruct the original distribution p ( Q ). Proposition For any query q = ( x , . . . , x n ) ∈ N n ,the probability of drawing exactly q at random from the log(i.e., p ( X = x , . . . , X n = x n ) ) is computable, given E max . See Appendix B for proof of the proposition.3 .2 Lossy Summaries
Although lossless, E max is also verbose. Hence, we willfocus on lossy encodings that can be less verbose. A lossyencoding E ⊂ E max may not be able to precisely identify thedistribution p ( Q ), but can still be used to approximate it.We characterize the information content of a lossy encod-ing E by defining a space (denoted by Ω E ) of distributions ρ ∈ Ω E allowed by an encoding E . This space is defined byconstraints as follows: First, we have the general propertiesof probability distributions: ∀ q ∈ N n : ρ ( q ) ≥ (cid:80) q ∈ N n ρ ( q ) = 1Each pattern b in the encoding E constrains the marginalprobability over its component features: ∀ b ∈ domain ( E ) : E [ b ] = (cid:88) q ⊇ b ρ ( q )Note that the dual constraints 1 − E [ b ] = (cid:80) q (cid:54)⊇ b ρ ( q ) areredundant under constraint (cid:80) q ∈ N n ρ ( q ) = 1.The resulting space Ω E is the set of all query logs, orequivalently the set of all possible distributions of queries,that obey these constraints. From the outside observer’sperspective, the distribution ρ ∈ Ω E that the encoding con-veys is ambiguous: We model this ambiguity with a ran-dom variable P E with support Ω E . The true distribution p ( Q ) derived from the query log must appear in Ω E , denoted p ( Q ) ≡ ρ ∗ ∈ Ω E (i.e., p ( P E = ρ ∗ ) > ρ admitted by Ω E , it is possible that some aremore likely than others. For example, a query containing acolumn (e.g., status ) is only valid if it also references a tablethat contains the column (e.g., Messages ). This prior knowl-edge may be modeled as a prior on the distribution of P E or by an additional constraint. However, for the purposes ofthis paper, we take the uninformed prior by assuming that P E is uniformly distributed over Ω E : p ( P E = ρ ) = (cid:40) | Ω E | if ρ ∈ Ω E Naive Encodings.
One specific family of lossy encodingsthat treats each feature as being independent (e.g., as inFigure 1a) is of particular interest to us. We call this family naive encodings , and return to it throughout the rest of thepaper. A naive encoding is composed of all patterns thathave exactly one feature with non-zero marginal. { b = (0 , . . . , , x i , , . . . , | i ∈ [1 , n ] , x i = 1 } Based on the space of distributions constrained by theencoding, the information loss of an encoding can be con-sidered from two related, but subtly distinct perspectives:(1)
Ambiguity measures how much room the encoding leavesfor interpretation, (2)
Deviation measures how reliably theencoding approximates the target distribution p ( Q ). Ambiguity.
We define the Ambiguity I( E ) of an encodingas the entropy of the random variable P E . The higher theentropy, the less precisely E identifies a specific distribution.I( E ) = (cid:88) ρ p ( P E = ρ ) log ( p ( P E = ρ )) Deviation.
The deviation from any permitted distribu-tion ρ to the true distribution ρ ∗ can be measured by the Kullback-Leibler (K-L) divergence [37] (denoted D KL ( ρ ∗ || ρ )).We define the Deviation d( E ) of a encoding as the expecta-tion of the K-L divergence over all permitted ρ ∈ Ω E :d( E ) = E P E [ D KL ( ρ ∗ ||P E )] = (cid:88) ρ ∈ Ω E p ( P E = ρ ) · D KL ( ρ ∗ || ρ ) Limitations.
There are two limitations to these idealizedmeasures in practice. First, K-L divergence is not definedfrom any probability measure ρ ∗ that is not absolutely con-tinuous with respect to a second (denoted ρ ∗ (cid:28) ρ ). Second,neither Deviation nor Ambiguity has a closed-form formula.
4. PRACTICAL LOSS MEASURE
Computing either Ambiguity or Deviation requires enu-merating the entire space of possible distributions, or anapproximation. One approach to estimating either measureis repeatedly sampling from, rather than enumerating thespace. However, accurate measures require a large numberof samples, rendering this approach similarly infeasible. Inthis section, we propose a faster approach to assessing thefidelity of a pattern encoding. Specifically, we select a singlerepresentative distribution ρ E from the space Ω E , and use ρ E to approximate both Ambiguity and Deviation. Maximum Entropy Distribution.
The representativedistribution is chosen by applying maximum entropy prin-ciple [28] which is commonly used in pattern-based summa-rization [20, 40]. That is, we select the distribution ρ E withmaximum entropy: ρ E = arg min ρ ∈ Ω E −H ( ρ ) where H ( ρ ) = (cid:88) q ∈ N n − ρ ( q ) log ρ ( q )The maximum entropy distribution ρ E best represents thecurrent state of knowledge. That is, a distribution withlower entropy assumes additional constraints derived frompatterns that we do not know and one with higher entropyviolates the constraints from patterns we do know.Maximizing an objective function belonging to the expo-nential family (entropy in our case) under a mixture of lin-ear equalities/inequality constraints is a convex optimizationproblem [11] which guarantees a unique solution and can beefficiently solved [17], using the cvx toolkit [23][43], and/orby iterative scaling [20, 40]. For naive encodings specifically,we can assume independence between each feature X i . Un-der this assumption, ρ E has a closed-form solution: ρ E ( q ) = (cid:89) i p ( X i = x i ) where q = ( x , . . . , x n ) (1)Using (1), we define Reproduction Error e ( E ) as the entropydifference between the representative and true distributions: e ( E ) = H ( ρ E ) − H ( ρ ∗ ) where ρ E = arg min ρ ∈ Ω E −H ( ρ ) In this section we prove that Reproduction Error closelyparallels Ambiguity. We define a partial order lattice overencodings and show that for any pair of encodings on whichthe partial order is defined, a like relationship is implied forboth Reproduction Error and Ambiguity. We supplementthe proofs given in this section with an empirical analysisrelating Reproduction Error to Deviation in Section 7.1.4 ontainment.
We define a partial order over encodings ≤ Ω based on containment of their induced spaces Ω E : E ≤ Ω E ≡ Ω E ⊆ Ω E That is, one encoding (i.e., E ) precedes another (i.e., E )when all distributions admitted by the former encoding arealso admitted by the latter. Containment Captures Reproduction Error.
We firstprove that the total order given by Reproduction Error is asuperset of the partial order ≤ Ω . Lemma For any two encodings E , E with induced spaces Ω E , Ω E and maximum entropy distributions ρ E , ρ E it holdsthat E ≤ Ω E → e ( E ) ≤ e ( E ) . Proof.
Firstly Ω E ⊇ Ω E → ρ E ∈ Ω E . Since ρ E hasthe maximum entropy among all distributions ρ ∈ Ω E , wehave H ( ρ E ) ≤ H ( ρ E ) ≡ e ( E ) ≤ e ( E ). Containment Captures Ambiguity.
Next, we showthat the partial order based on containment implies a likerelationship between Ambiguities of pairs of encodings.
Lemma Given encodings E , E with uninformed prioron P E , P E , it holds that E ≤ Ω E → I ( E ) ≤ I ( E ) . Proof.
Given an uninformed prior: I( E ) = log | Ω E | . Hence E ≤ Ω E → | Ω E | ≤ | Ω E | → I( E ) ≤ I( E )
5. PATTERN MIXTURE ENCODINGS
Thus far we have defined the problem of log compression,treating the query log as a single joint distribution p ( Q )that captures the frequency of feature occurrence and/orco-occurrence. Patterns capture positive information aboutcorrelations. However in cases like logs of mixed workloads,there are also many cases of anti-correlation between fea-tures. For example, consider a log that includes queriesdrawn from two workloads with disjoint feature sets. Pat-tern based summaries can not convey such anti-correlationseasily. As a result, while patterns including features fromboth workloads never actually occur in the log, a pattern-based summary of the log will indicate a non-zero marginal.Identifying significant workload variation, as might be causedby misuse or malicious workload-injection (mixture), is rel-evant to intrusion detection systems [36]. In addition, cap-turing anti-correlations helps to reduce data dimensionalityand improves both the runtime and accuracy of state-of-the-art pattern mining algorithms (See Section 8.1.2 and 8.1.3).In this section, we propose a generalization of pattern en-codings where the log is modeled not as a single probabilitydistribution, but rather as a mixture of several simpler dis-tributions. The resulting encoding is likewise a mixture:Each component of the mixture of distributions is storedindependently. Hence, we refer to it as a pattern mixtureencoding , and it forms the basis of LogR compression.We first focus on a simplified form of this problem, wherewe only mix naive pattern encodings (we explore more gen-eral mixtures in Section 6.4). We first refer to the resultingscheme as naive mixture encodings , and give examples of theencoding, as well as potential visualizations in Section 5.1.Then we generalize Reproduction Error and Verbosity forpattern mixture encodings in Section 5.2. Finally, with gen-eralized encoding evaluation measures, we evaluate severalencoding strategies based on different clustering methods forcreating naive mixture encodings.
Consider a toy query log with only 3 conjunctive queries.1.
SELECT id FROM Messages WHERE status = ? SELECT id FROM Messages SELECT sms_type FROM Messages
The vocabulary of this log consists of 4 features: (cid:104) id , SELECT (cid:105) , (cid:104) sms_type , SELECT (cid:105) , (cid:104) Messages , FROM (cid:105) , and (cid:104) status = ? , WHERE (cid:105) .Re-encoding the three queries as vectors, we get:1. (cid:104) , , , (cid:105) (cid:104) , , , (cid:105) (cid:104) , , , (cid:105) A naive encoding of this log can be expressed as: (cid:28) , , , (cid:29) This encoding captures that all queries in the log pertain tothe Messages table, but obscures the relationship betweenthe remaining features. For example, this encoding obscuresthe anti-correlation between id and sms_type . Similarly, theencoding hides the association between status = ? and id .Such relationships are critical for evaluating the effectivenessof views or indexes. Example The maximal entropy distribution for a naiveencoding assumes that features are independent. Assumingindependence, the probability of query 1 from the log is: p ( id ) · p ( ¬ sms type ) · p ( Messages ) · p ( status=? ) = 427 ≈ . This is a significant difference from the true probability ofthis query (i.e., ). Conversely queries not in the log, suchas the following, have non-zero probability in the encoding. SELECT sms_type FROM Messages WHERE status = ? p ( ¬ id ) · p ( sms type ) · p ( Messages ) · p ( status=? ) = 127 ≈ . Partition 1 ( L ) Partition 2 ( L )(1 , , ,
1) (1 , , ,
0) (0 , , , ↓ ↓ ↓ (cid:10) , , , (cid:11) (cid:104) , , , (cid:105) The resulting encoding only has one non-integral proba-bility: p ( status = ? | L ) = 0 .
5. Although there are nowtwo encodings, the encodings are not ambiguous. The fea-ture status = ? appears in exactly half of the log entries,and is indeed independent of the other features. All otherattributes in each encoding appear in all queries in theirrespective partitions. Furthermore, the maximum entropydistribution induced by each encoding is exactly the distri-bution of queries in the compressed log. Hence, the Repro-duction Error is zero for both of the two encodings.
We next generalize our definitions of Reproduction Errorand Verbosity from pattern to pattern mixture encodings.Suppose query log L has been partitioned into K clusterswith L i , S i , ρ S i and ρ ∗ i (where i ∈ [1 , K ]) representing thelog of queries, encoding, maximum entropy distribution, andtrue distribution (respectively) for i th cluster. First, observethat the distribution for the whole log (i.e., ρ ∗ ) is the sum5f distributions for each partition (i.e., ρ ∗ i ) weighted by pro-portion of queries (i.e., | L i || L | ) in the partition. ρ ∗ ( q ) = (cid:88) i =1 ,...,K w i · ρ ∗ i ( q ) where w i = | L i || L | Generalized Reproduction Error.
Similarly, the max-imum entropy distribution ρ S for the whole log is: ρ S ( q ) = (cid:88) i =1 ,...,K w i ∗ ρ S i ( q )We define the Generalized Reproduction Error of a patternmixture encoding similarly, as the weighted sum of the errorsfor each partition: H ( ρ S ) − H ( ρ ∗ ) = (cid:88) i w i · H ( ρ S i ) − (cid:88) i w i · H ( ρ ∗ i ) = (cid:88) i w i · e ( S i ) As in the base case, a pattern mixture encoding with lowGeneralized Reproduction Error indicates a high-fidelity rep-resentation of the original log. A process can infer the prob-ability of any query p ( Q = q | L ) drawn from the originaldistribution, simply by inferring its probability drawn fromeach cluster i (i.e., p ( Q = q | L i )) and taking a weightedaverage over all inferences. When it is clear from context,we refer to Generalized Reproduction Error simply as Errorin the rest of this paper. Generalized Verbosity.
We generalize verbosity to mix-ture encodings as the
Total Verbosity ( (cid:80) i | S i | ), or the totalsize of the encoded representation. This approach is idealfor our target applications, where our aim is to reduce therepresentational size of the query log.
6. PATTERN MIXTURE COMPRESSION
We are now ready to describe the
LogR compressionscheme. Broadly,
LogR attempts to identify a pattern mix-ture encoding that optimizes for some target trade-off be-tween Total Verbosity and Error. A naive — though im-practical — approach to finding such an encoding would beto search the entire space of possible pattern mixture en-codings. Instead,
LogR approximates the same outcomeby identifying the naive pattern mixture encoding that isclosest to optimal for the desired trade-off. As we show ex-perimentally, the naive mixture encoding produced by thefirst stage is competitive with more complicated, slower tech-niques for summarizing query logs. We also explore a hypo-thetical second stage, where
LogR refines the naive mixtureencoding to further reduce error. The outcome of this hypo-thetical stage has a slightly lower Error and Verbosity, butdoes not admit efficient computation of database statistics.
LogR compression searches for a naive mixture encodingthat best optimizes for a requested tradeoff between TotalVerbosity and Error. As a way to make this search efficient,we observe that a log (or log partition) uniquely determinesits naive mixture encoding. Thus the problem of search-ing for a naive mixture encoding reduces to the problem ofsearching for the corresponding log partitioning.We further observe that the Error of a naive mixture en-coding is proportional to the diversity of the queries in thelog being encoded; The more uniform the log (or partition),the lower the corresponding error. Hence, the partitioning problem further reduces to the problem of clustering queriesin the log by feature overlap.To identify a suitable clustering scheme, we next eval-uate four commonly used partitioning/clustering methods:(1) KMeans [27] with Euclidean distance (i.e., l -norm) andSpectral Clustering [31] with (2) Manhattan (i.e., l -norm),(3) Minkowski (i.e., l p -norm) with p = 4, and (4) Hamming( Count ( x (cid:54) = y ) Count ( x (cid:54) = y )+ Count ( x = y ) ) distances . Specifically, we evalu-ate these four strategies with respect to their ability to createnaive mixture encodings with low Error and low Verbosity. Experiment Setup.
Spectral and KMeans clustering al-gorithms are implemented by sklearn [45] in Python. Wegradually increase K (i.e., the number of clusters) config-ured for each clustering algorithm to mimic the process ofcontinuously sub-clustering the log, tolerating higher TotalVerbosity for lower Error. To compare clustering methodsfairly, we reduce randomness in clustering (e.g., random ini-tialization in KMeans) by running each of them 10 times foreach K and averaging the Error of the resulting encodings.We used two datasets: “US Bank” and “PocketData.” Wedescribe both datsets and the data preparation process indetail in Section 7. All results for our clustering experimentsare shown in Figure 2. We next show that clustering is an effective way to con-sistently reduce Error, although no one clustering method isideal along all three of Error, Verbosity, and runtime.
More clusters reduces Error.
Figure 2a compares therelationship between the number of clusters (x-axis) and Er-ror (y-axis), showing the varying rates of convergence to zeroError for each clustering method. We observe that addingmore clusters does consistently reduce Error for both datasets, regardless of clustering method and distance measures.We note that the US Bank dataset is significantly more di-verse than the PocketData dataset, with respect to the to-tal number of features (See Table 1) and that more than30 clusters may be required for reaching near-zero Error.In general, Hamming distance converges faster than othermethods on PocketData.
Adding more clusters increases Verbosity.
Figure 2bcompares the relationship between the number of clusters(x-axis) and Verbosity (y-axis). We observe that Verbosityincreases with the number of clusters. This is because whena partition is split, features common to both partitions eachincrease the Verbosity by 1 each.
Hierarchical Clustering.
Classical clustering methodsproduce non-monotonic cluster assignments. That is, theratio of Error to Verbosity can grow with more clusters, asseen in Figure 2a and 2b. An alternative is to use hierar-chical clustering [29], which forces monotonic assignmentsand offers more dynamic control over the Error/Verbositytradeoff.
Run Time Comparison.
The total run time (y-axis) inFigure 2c includes both distance matrix computation time(if any) and clustering time. Note the log-scale: K-Means isorders of magnitude faster than the others, with Hammingdistance also performing competitively. We also evaluated Spectral Clustering with Euclidean,Chebyshev and Canberra distances; These did not performbetter and we omit them in the interest of conciseness.6
Num of Clusters E rr o r minkowski manhattan hamming KmeansEuclidean Num of Clusters E rr o r (a) Error v. Number of Clusters (Left: PocketData, Right: US Bank)
Num of Clusters . . . . . . . . T o t a l V e r bo s i t y ⇥ Num of Clusters . . . . . . . T o t a l V e r bo s i t y ⇥ (b) Total Verbosity v. Number of Clusters (Left: PocketData, Right: US Bank); Each point is the verbosity of oneof ten trials. The Y-axis’ lower bound is the verbosity at 1 cluster to better show the change in verbosity.
Num of Clusters T o t a l T i m e ( i n S e cs ) Num of Clusters T o t a l T i m e ( i n S e cs ) (c) Runtime v. Number of Clusters (Left: PocketData, Right: US Bank)
Figure 2:
Distance Measure ComparisonTake-Aways.
Hamming distance provides the best trade-off between Error and runtime. For time-sensitive applica-tions, KMeans is preferred to Spectral Clustering.
Visualizing Naive Mixture Encoding.
As with normalpattern summaries, naive mixture summaries are also inter-pretable. For example a visualization like that of Figure 1acan be repeated, once for each cluster. For more details, seeAppendix E.
Recall that our primary goal is estimating statistical prop-erties. In particular, we are interested in counting the oc-currences (i.e., the marginal) of some pattern b in the log:Γ b ( L ) = | { q | q ∈ L ∧ b ⊆ q } | Recall that a naive encoding E L includes only single-featurepatterns. Assuming log distributions allowed by this encod-ing are equally likely, the maximal entropy distribution ρ E L is the representative distribution. Hence, we estimate Γ b ( L )by multiplying the probability of the feature occurring bythe size of the log: | L | · ρ E L , or: est [ Γ b ( L ) | E L ] = | L | · (cid:89) f ∈E where f ⊆ b E [ f ] This process trivially generalizes to naive pattern mixtureencodings by mixing distributions. Specifically, given a setof partitions L ∪ . . . ∪ L K = L , the estimated counts for Γ b ( L ) under each individual partition L i can be computedbased on the partition’s encoding E i and we sum up theestimated counts in each partition est [ Γ b ( L i ) | E , . . . , E K ] = (cid:88) i ∈ [1 ,K ] est [ Γ b ( L i ) | E i ] In this section, we empirically verify the effectiveness ofnaive mixture encodings in approximating log statistics fromtwo related perspectives. The first perspective focuses on synthesis error . It measures whether patterns synthesizedby the naive mixture encoding actually exist in the log.From the second perspective, we would like to further in-vestigate the marginal deviation of patterns contained inthe log of queries. This evaluates whether a naive mixtureencoding will compute the correct marginal for patterns ofinterest to a client application. Specifically, synthesis error is measured as 1 − MN where N is the total number of ran-domly synthesized patterns and M is the number of synthe-sized patterns with positive marginals in the log. Marginaldeviation is measured as | ESTM − TM | TM where T M stands forTrue Marginal of a pattern and
EST M is the one estimatedby naive mixture encoding.Experimental results are shown in Figure 3. Both syn-thesis error and marginal deviation consistently decreasesgiven more clusters. Furthermore, as we vary the number7
10 20 30 40 50 60
Reproduction Error . . . . . . S y n t he s i s E rr o r bank datapocket data (a) Synthesis Error v. Reproduction Error
Reproduction Error . . . . . . M a r g i na l D e v i a t i on bank datapocket data (b) Marginal Deviation v. Reproduction Error
Figure 3:
Effectiveness of Naive Mixture Encoding of clusters, both measures are correlated with ReproductionError.
Synthesis Error.
Figure 3a shows synthesis error (y-axis)versus Reproduction Error (x-axis). The figure is generatedby randomly synthesizing N = 10000 patterns from eachpartition of the log. Note that different values of N givesimilar observation. The overall synthesis error is measuredas the average synthesis error for each partition, weightedby proportion of queries in the partition. Marginal Deviation.
Figure 3b shows marginal devia-tion (y-axis) versus Reproduction Error (x-axis). It is notfeasible to enumerate all patterns that exist in the data. Asan alternative, we treat each distinct query in the log as apattern and treat the marginal deviation on it as the worstcase for all patterns that it may contain. This is becausemarginal deviation tends to be smaller if it is measured on apattern that is contained in the other. For each cluster, wesum up the marginal deviation on all distinct queries andthe final marginal deviation for the whole log is an weightedaverage (same as synthesis error) over all clusters.
Naive mixture encodings can already achieve close to near-zero Error (Figure 2a), have low Verbosity, and admit effi-ciently computable log statistics Γ b ( L ). Although doing somakes estimating statistics more computationally expensive,as a thought experiment, we next consider how much of animprovement we could achieve in the Error/Verbosity trade-off by exploring a hypothetical second stage that enrichesnaive mixture encodings by adding non-naive patterns. Feature-Correlation Refinement.
The first challengeis that our closed-form formula for Reproduction Error onlyworks for naive encodings. Hence, we first consider the sim-pler problem of identifying the individual pattern that mostreduces the Reproduction Error of a naive encoding.Recall that the closed-form representation for the Repro-duction Error arises by independence between features (i.e., ρ S ( Q = q ) = (cid:81) i p ( X i = x i )). Similarly, under naive encod-ings we have a closed-form estimation of marginals p ( Q ⊇ b )(i.e., ρ S ( Q ⊇ b ) = (cid:81) i p ( X i ≥ x i )). We define the feature-correlation of pattern b as the log-difference from its actualmarginal to the estimation, according to naive encoding. W C ( b , S ) = log ( p ( Q ⊇ b )) − log ( ρ S ( Q ⊇ b ))Intuitively, patterns with higher feature correlations createhigher Errors, which in turn makes them ideal candidatesfor addition to the compressed log encoding. For two pat-terns with the same feature-correlation, the one that occursmore frequently will have greater impact on Error [24]. Asa result, we compute an overall score for ranking patternsinvolving feature-correlation: corr rank ( b ) = p ( Q ⊇ b ) · W C ( b , S ) Table 1: Summary of Data sets
Statistics PocketData US bank
We show in Section 7.1 that corr rank closely correlateswith Reproduction Error. That is, a higher corr rank valueindicates that a pattern produces a greater ReproductionError reduction if introduced into the naive encoding.
Pattern Diversification.
This greedy approach only al-lows us to add a single pattern to each cluster. In general, wewould like to identify a set of patterns. We cannot sum upthe corr rank of each pattern in the set to estimate its Re-production Error, as information content carried by patternsmay overlap. To counter such overlap, or equivalently to di-versify patterns, a search through the space of pattern setsis needed. This type of diversification is commonly used inpattern mining applications, but can quickly become expen-sive. As we show experimentally in Section 7.2, the benefitthat can be obtained from diversification is minimal.
7. EXPERIMENTS
In this section, we design experiments to empirically (1) val-idate that Reproduction Error correlates with Deviation and(2) evaluate the effectiveness of
LogR compression.We use two specific datasets in the experiment: (1) SQLquery logs of the Google+ Android app extracted from thePocketData public dataset [32] and (2) SQL query logs thatcapture all query activity on the majority of databases at amajor US bank over a period of approximately 19 hours. Asummary of these two datasets is given in Table 1.
The PocketData-Google+ query log.
The dataset con-sists of SQL logs that capture all database activities of 11Android phones. We selected Google+ application for ourstudy since it is one of the few applications where all userscreated a workload. This dataset can be characterized as astable workload of exclusively machine-generated queries.
The US bank query log.
This log is an anonymizedrecord of queries processed by multiple relational databaseservers at a major US bank [35] over a period of 19 hours. Ofthe nearly 73 million database operations captured, 58 mil-lion are not directly queries, but rather invocations of stored8 .0 0.1 0.19 0.36 0.42 0.52 0.74 0.91 d ( S \ S ) . . . . . . . . d ( S ) d ( S ) (a) Containment captures Deviation (US bank) d ( S \ S ) . . . . . . . d ( S ) d ( S ) (b) Containment captures Deviation (PocketData) . . . . . . . . . Error . . . . . . . . D e v i a t i on Num of Patterns=1Num of Patterns=2Num of Patterns=3 (c)
Error captures Deviation (US bank) . . . . . . Error . . . . . . D e v i a t i on Num of Patterns=1Num of Patterns=2Num of Patterns=3 (d)
Error captures Deviation (PocketData) . . . . . . . . Corr rank . . . . . . E rr o r Num of Features=3Num of Features=2 (e)
Error captures Correlation (US bank) . . . . . . . Corr rank . . . . . . E rr o r Num of Features=3Num of Features=2 (f)
Error captures Correlation (PocketData)
Figure 4:
Validating The Reproduction Error Metric procedures and 13 million not able to be parsed by standardSQL parser. Among the rest of the 2.3 million parsed SQLqueries, since we are focusing on conjunctive queries, webase our analysis on the 1.25 million valid
SELECT queries.This dataset can be characterized as a diverse workload ofboth machine- and human-generated queries.
Common Experiment Settings.
Experiments were per-formed on a 2.8 GHz Intel Core i7 CPU with 16 GB 1600MHz DDR3 memory and a SSD running macOS Sierra.
Constant Removal.
A number of queries in US Bankdiffer only in hard-coded constant values. Table 1 showsthe total number of queries, as well as the number of dis-tinct queries if we ignore constants. By comparison, queriesin PocketData all use JDBC parameters. For these experi-ments, we ignore constant values in queries.
Query Regularization.
We apply query rewrite rules(similar to [14]) to regularize queries into equivalent con-junctive forms, where possible. Table 1 shows that and of distinct queries are in conjunctive form for Pock-etData and US bank respectively. After regularization, allqueries in both data sets can be either simplified into con-junctive queries or re-written into a
UNION of conjunctivequeries compatible with Aligon et. al.’s feature scheme [3].
Convex Optimization Solving.
All convex optimizationproblems involved in measuring Reproduction Error and De-viation are solved by the successive approximation heuristic implemented by the CVX toolbox [23] with Sedumi solver.
In this section, we validate that Reproduction Error is apractical alternative to Deviation. In addition, we also offermeasurements on its correlation with Deviation, as well asfeature-correlation described in Section 6.4.Since Deviation cannot be measured exactly (See Sec-tion 3.3), we approximate it using sampling, which is fur-ther explained in Appendix C. It is impractical to enumerateall possible encodings, we choose a subset of encodings forboth datasets. Specifically, we first select all features withmarginals in the range [0 . , .
99] and use these features to construct patterns. We then enumerate combinations of K (up to 3) patterns as our chosen encodings. Containment Captures Deviation.
Here we empiri-cally verify that containment (Section 4.2) captures Devia-tion (i.e., E ≤ Ω E → d ( E ) ≤ d ( E )) to complete the chainof reasoning that Reproduction Error captures Deviation.Figures 4a and 4b show all pairs of encodings where E ⊃ E .The y-axis shows the difference in Deviation values (i.e., d ( E ) − d ( E )). Deviation d ( S ) is approximated by drawing1,000,000 samples from the space of possible patterns. Forclarity, we bin pairs of encodings by the degree of overlapbetween the encodings, measured by the Deviation of theset-difference between the two encodings d ( E \ E ); Higher d ( E \ E ) implies less overlap. Y-axis values are groupedinto bins and visualized by boxplot (i.e., the blue box in-dicates the range within standard deviation and red/blackcrosses are outliers). Intuitively, all points above zero on they-axis (i.e., d ( E ) − d ( E ) >
0) are pairs of encodings whereDeviation order agrees with containment order. This is thecase for virtually all encoding pairs.
Additive Separability of Deviation.
We also observefrom Figures 4a and 4b that agreement between Deviationand containment order is correlated with overlap; More sim-ilar encodings are more likely to have agreement. Combinedwith Proposition 1, this shows first that for similar encod-ings, Reproduction Error is likely to be a reliable indicatorof Deviation. This also suggests that Deviation is additivelyseparable: The information loss (measured in d ( E ) − d ( E ))by excluding encoding E \ E from E closely correlates withthe quality (i.e., d ( E \ E )) of encoding E \ E itself: E ⊃ E → d ( E ) − d ( E ) < d ( E \E ) ≈ d ( E ) − d ( E ) Error correlates with Deviation.
As a supplement,Figures 4c and 4d empirically confirm that that Reproduc-tion Error (x-axis) indeed closely correlates with Deviation(y-axis). Mirroring our findings above, correlation betweenthem is tighter at lower Reproduction Error.
Error and Feature-Correlation.
Figure 4e and 4f showthe relationship between Reproduction Error (y-axis) and9he feature-correlation score corr rank (x-axis), as definedin Section 6.4. Values of y-axis are computed from the naiveencoding extended by a single pattern b containing multiplesfeatures (up to 3). One can observe that the ReproductionError of extended naive encodings almost linearly correlateswith corr rank ( b ). In addition, one can also observe that corr rank becomes higher when the pattern b encodes morecorrelated features. In this section, we design experiments serving two pur-poses: (1) Evaluating the potential reduction in Error fromrefining naive mixture encodings through state-of-the-artpattern based summarizers, and (2) Evaluating whether wecan replace naive mixture encodings by the encodings cre-ated from summarizers that we have plugged-in.
Experiment Setup.
To serve both purposes, we constructpattern mixture encodings under three different configura-tions: (1) Naive mixture encoding; (2) Pattern based en-coding and (3) Naive mixture encoding refined by patternbased encoding. Naive mixture encodings are constructedby KMeans clustering. Pattern based encodings are gen-erated by two state-of-the-art pattern based summarizers:(1)
Laserlight [20] algorithm, which aims at summarizingmulti-dimensional data D = ( X , . . . , X n ) augmented withan additional binary attribute A ; (2) MTV [40] algorithm,which aims at mining maximally informative patterns thatsummarize multi-dimensional binary data.The experiment results are shown in Figure 5 which con-tains 3 sub-figures. All sub-figures share the same x-axis,i.e., the number of clusters. Figure 5a evaluates the possiblechange in Error (y-axis) by plugging-in
MTV and
Laserlight .Figure 5b compares the Error (y-axis) between the naivemixture encoding and the pattern mixture encoding ob-tained from only using patterns from
MTV and
Laserlight .Figure 5c compares the running time (y-axis) between con-structing naive mixture encodings and applying pattern basedsummarizers. We only show the results for US bank dataset as results for PocketData give similar observations.
Figure 5b and 5c suggest that naive mixture encodingsoutperform pattern based encodings in two ways.
Computation Efficiency.
Furthermore, as one can ob-serve from Figure 5c, that the running time of constructingnaive mixture encodings is significantly lower than that of
Laserlight and
MTV . Reproduction Error.
We observe from Figure 5b that theReproduction Error of naive mixture encodings are ordersof magnitude lower than those obtained from summarizingusing the patterns generated by
Laserlight or MTV . Verbosity.
The one way in which pattern based encodingsoutperform naive pattern mixtures is in verbosity. Both
Laserlight and
MTV produce encodings with significantlyfewer patterns, as the naive pattern mixture summary re-quires at least one pattern for each feature (e.g., 5290 pat-terns in the US bank dataset). Conversely, mining this num-ber of patterns is not computationally feasible (Figure 5c).
The experiment result is shown in Figure 5a. Note that weoffset y-axis to show the change in Error. We observe from
Num of Clusters R ep r odu c t i on E rr o r NaiveMixtureLaserLight+NaiveMixtureMTV+NaiveMixture (a)
Naive Mixture v. Naive Mixture+LaserLight/MTV.Note that we y-axis is offset (non-zero start).
Num of Clusters R ep r odu c t i on E rr o r NaiveMixture MTV LaserLight (b)
Naive Mixture v. LaserLight/MTV alone. Note thaty-axis is in log scale.
Num of Clusters R un t i m e ( S e c ) NaiveMixture MTV LaserLight (c)
Runtime Comparison (y-axis in log scale)
Figure 5:
Naive Mixture Encodings (US bank) the figure that reduction on Error contributed by plugging-in pattern based summarizers is small for both algorithms.
Dimensionality Restriction.
For
Laserlight , this obser-vation is partially due to the fact that we only keep top 100features (in terms of variability) of the data as its input,since
Laserlight is implemented in PostgresSQL 9.1 whichhas a threshold of 100 arguments (one argument for eachfeature) that can be passed to a function.
Pattern Restriction.
For
MTV , this is due to a limitationof 15 patterns that we have experienced in configuring it. Werefer the reader to Section 4.5 of the paper [40] that explainsthe difficulty in inferring the maximum entropy distributionwith increasing number of patterns.
8. ALTERNATIVE APPLICATIONS
To fairly evaluate
Laserlight and
MTV , we incorporatetheir own data sets and empirically evaluate them against naive mixture encoding under their own applications.
Data Sets.
Specifically, we choose
Mushroom data setused in
MTV [40] which is obtained from FIMI datasetrepository and U.S. Census data on Income or simply
In-come data set, which is downloaded from IPUMS-USA at https://usa.ipums.org/usa/ and used in
Laserlight [20]. Thebasic statistics of the data sets are given in Table 2.
All experiments involving
Laserlight and
MTV will beevaluated under their own Error measures and data sets,unless otherwise stated. The experiments are organized as10able 2:
Data Sets of Alternative Applications
Statistics Income Mushroom > , follows: First, we establish baselines by evaluating classical Laserlight and
MTV on their original data; Then we showthat classical
Laserlight and
MTV can be generalized topartitioned data and that the generalization improves ontheir Error measures and also runtime; At last, we comparetheir generalized versions with naive mixture encoding toshow that naive mixture encoding is a reasonable alternative.
We first explain how naive mixture encoding is evaluatedbased on Error defined by
Laserlight and
MTV . Evaluating Naive Encoding on Laserlight Error.
Al-gorithm
Laserlight summarizes data D which consists of fea-ture vectors t augmented by some binary feature v . Denotethe valuation of the binary feature v for each feature vector t as v ( t ). The goal is to mine a summary encoding E , whichis a set of patterns contained in t ∈ D that offer predictivepower on v ( t ). Denote the estimation (based on E ) of v ( t )as u E ( t ) ∈ [0 , Laserlight
Error is measured by (cid:88) t ( v ( t ) log( v ( t ) u E ( t ) ) + (1 − v ( t )) log( 1 − v ( t )1 − u E ( t ) ))Since naive encoding E L is equivalent to independence as-sumption on features, u E L ( t ) is simply the probability of v ( t ) = 1, namely u E L ( t ) = |{ t | v ( t )=1 ,t ∈ D }|| D | regardless of t .Consequently, the Laserlight
Error of naive encoding is −| D | ( u E L log u E L + (1 − u E L ) log(1 − u E L )) Evaluating Naive Encoding on MTV Error.
We de-note the data also as D with some summary encoding E , the MTV
Error of E is −| D | H ( ρ ∗ ) + 1 / |E| log | D | where H ( ρ ∗ ) stands for the entropy of maximum entropydistribution ρ ∗ defined in the paper. The second term in MTV
Error penalizes the verbosity of the encoding E . Wedefine the entropy of some feature f as H ( f ) = − p log p − (1 − p ) log(1 − p ) where p is the probability of the featurebeing present, H ( ρ ∗ ) of naive encoding is simply the sum-mation (cid:80) f H ( f ) over all feature entropies. Evaluating Naive Mixture Encoding.
Evaluation of naive encoding can be generalized to naive mixture by takinga weighted average over resulting clusters (See Section 5.2).
Establishing Baselines.
To establish baselines, we evalu-ate
Laserlight and
MTV on their own data sets. The resultsrelated to Error and run time are given in Figure 6 and Fig-ure 7 respectively. La s e r li gh t E rr o r M ea s u r e LaserlightNaive Encoding (Reference) (a)
Laserlight Error v. M T V E rr o r M ea s u r e MTVNaive Encoding (Reference) (b)
MTV Error v.
Figure 6:
Error v. Number of Patterns. Note: y-axisis offset to better visualize difference.
100 200 300 400 500 600 700
Num of Patterns R unn i ng T i m e ( S e c ) (a) Laserlight Run Time on Income Data
Num of Patterns R unn i ng T i m e ( S e c ) (b) MTV Run Time on Mushroom data
Figure 7:
Run Time of Laserlight and MTV
In Figure 6, X-axis is the number of patterns and y-axisrepresents the Error measure of
Laserlight and
MTV in Fig-ure 6a and 6b respectively. We incorporate naive encoding in Figure 6a as the reference method. Since there are 783total number of features for
Income data set, the verbosityof naive encoding will be 783, which is shown as vertical dot-ted line in Figure 6a. The
Laserlight
Error of naive encod-ing is shown as the horizontal dotted line accordingly. For
Mushroom data set (Figure 6b), the verbosity of its naiveencoding will be 96. However,
MTV quits with error mes-sage if it is requested to mine over 15 patterns. Hence for11igure 6b, the limit of x-axis is 15 and we only show Er-ror of naive encoding as a reference line without markingout its verbosity. We observe in Figure 6a that naive en-coding outperforms
Laserlight when their verbosity is equal(i.e., 783). In addition, approximately after 100 patterns,the slope of Error reduction becomes relatively flat. Similarobservations can be made from Figure 6b. In Figure 7, weobserve that the running time increases exponentially withthe number of patterns, for both
Laserlight and
MTV .The take-aways from Figure 6 and Figure 7 are that (1) naive encoding is faster and more accurate than classical
Laserlight and
MTV ; (2) the runtime increases superlinearlywith the number of patterns mined from both
Laserlight and
MTV . Anti-correlation and Dimentionality Reduction.
Re-call in Section 7.2.2 that
Laserlight is restricted to 100 fea-tures. For its own
Income data set,
Laserlight can be ap-plied with its full set of 783 features. This is due to the priorknowledge that the 783 features belong to 9 groups. In eachgroup, features are mutually anti-correlated which can bereduced to a single feature. Similarly,
Mushroom data setcan be reduced from 95 to 21 features (See Table 2).
We generalize
Laserlight and
MTV on partitioned databy running them on each cluster. We then combine Errorson all clusters by taking a weighted average, as described inSection 5.2. Depending on how many patterns are minedfrom each cluster,
Laserlight and
MTV can be generalizedinto two types: (1) The number of patterns mined from eachcluster is scaled to be equal to the verbosity of the naive en-coding ; and (2) The total number of patterns mined fromall clusters is fixed to a given number. We name the firsttype
Laserlight (MTV) Mixture Scaled , which is compara-ble to naive mixture encoding . We name the second type
Laserlight (MTV) Mixture Fixed , which is comparable tothe classical
LaserLight (MTV) algorithm.The Error reduction of
Laserlight becomes relatively slowover 100 patterns (See Figure 6a). Hence we configure
Clas-sical Laserlight and
Laserlight Mixture Fixed to mine a totalof 100 patterns from the original data and partitioned datarespectively, in order to avoid underestimation of
Laserlight performance. We describe the strategy for distributing the100 patterns to partitions of the data sets in Appendix D.3.The experiment result is given in Figure 8. Figure 8aand Figure 8b shows
Laserlight Mixture Fixed versus
Clas-sical Laserlight in Error and run time respectively, whenthe number of data partitions (i.e., clusters) is gradually in-creased. We observe an exponentially decreasing trend (i.e.,improvement) in both
Laserlight Error and run time. Weomit the experiment results for
MTV as they give similarobservations.
Take-away.
As the data is partitioned into more clusters,both runtime and Error of
Laserlight (MTV) Mixture Fixed exponentially improve. This observation can be potentiallygeneralized to other pattern mining based algorithms.
At last, we compare
Laserlight (MTV) Mixture Scaled with naive mixture encoding . Note that it is time-consumingfor
Laserlight to mine the same number of patterns as naiveencoding on Income data (See runtime analysis in Figure 7a), La s e r li gh t E rr o r M ea s u r e Classical Laserlight (a)
Laserlight Error v. R unn i ng T i m e ( s e c ) Classical Laserlight (b)
Total Running time v.
Figure 8:
Laserlight Mixture Fixed v. Classical we choose
Mushroom data for
Laserlight Mixture Scaled in-stead. The experiment results are given in Figure 9. Thex-axes for all sub-figures in Figure 9 represent the number ofclusters and the y-axes stands for
Laserlight and
MTV
Errorrespectively. We incorporate baselines (i.e., naive encoding ,classical
Laserlight and
MTV ) as reference lines in Figure 9aand 9b respectively. We also experienced a limitation of15 patterns in configuring
MTV . Hence the comparison be-tween
MTV Mixture Scaled and naive mixture encoding isnot strictly on equal footing as
MTV Mixture Scaled is notable to reach the same Total Verbosity as naive mixture en-coding . Note that their difference in verbosity is mitigatedby the fact that
MTV
Error measure penalizes verbosity.
Num of Clusters La s e r li gh t E rr o r ⇥ Naive Mixture EncodingNaive Encoding (Reference)Laserlight Mixture ScaledClassical Laserlight (Reference) (a)
Laserlight Error v.
Num of Clusters . . . . . . . M T VE rr o r ⇥ Naive Mixture EncodingNaive Encoding (Reference)MTV Mixture ScaledClassical MTV (Reference) (b)
MTV Error v.
Figure 9:
Naive Mixture v. Laserlight/MTV Mixture
Figure 9a shows that both naive mixture encoding and12 aserlight Mixture Scaled have lower Error than their base-lines. In addition,
Laserlight Mixture Scaled has lower Errorthan naive mixture encoding when the number of clusters isless than 4 and they become close after 6 clusters. In otherwords,
Laserlight is more accurate on lightly partitioneddata. As the data is further partitioned, clusters become‘easier’ to summarize, and naive encoding becomes moresimilar to
Laserlight . Figure 9b shows that naive mixtureencoding marginally outperforms
Laserlight Mixture Scaled . Take-away.
Naive mixture encoding is faster and has sim-ilar (lower) Error than
Laserlight (MTV) Mixture Scaled .
9. RELATED WORK9.1 Workload Analysis
Existing approaches related to workload analysis are fre-quently aimed at specific tasks like query recommendation [41,22, 33, 50, 4], performance optimization [9, 13], outlier de-tection [30] or visual analysis [39].
Query Recommendation.
This task aims to track his-torical querying behavior and generating query recommen-dations. Related approaches [41, 33] flatten a query abstractsyntax tree as a bag of fragments [41] or snippets [33] andadopt feature vector representation of queries. User profilesare then built from the query log by grouping and summa-rizing feature vectors by user in order to make personalizedrecommendation. Under OLAP systems, profiles are alsobuilt for workloads of similar OLAP sessions [4].
Performance Optimization.
Index selection [16, 21] andmaterialized view selection [2, 9, 13] are typical performanceoptimization tasks. The configuration search space is usu-ally large, but can be reduced with appropriate summaries.
Outlier Detection.
Kamra et al. [30] aim at detectinganomalous behavior of queries in the log by summarizingquery logs into profiles of normal user behavior.
Visual Analysis.
Makiyama et al. [39] provide a set ofvisualizations that facilitate further workload analysis onSloan Digital Sky Survey (SDSS) dataset. QueryScope [25]aims at finding better tuning opportunities by helping hu-man experts to identify patterns shared among queries.In these approaches, queries are commonly encoded asfeature vectors or bit-maps where a bit array is mapped toa list of features with 1 in a position if the correspondingfeature appears in the query and 0 otherwise. Workloadsunder the bit-map encoding must then be compressed beforethey can be efficiently queried or visualized for analysis.
Run-length Encoding.
Run-length encoding (RLE) isa loss-less compression scheme commonly used in
InvertedIndex Compression [48, 53] and
Column-Oriented Compres-sion [1]. RLE-based compression algorithms include butnot limited to: Byte-aligned Bitmap Code (BBC) used inOracle systems [7], Word-aligned Hybrid (WAH) [49] andmany others [42, 5, 6]. In general, RLE-based methodsfocus on column-wise compression and requires additionalheavyweight inference on frequencies of cross-column (i.e.,row-wise) patterns used for workload analysis.
Lempel-Ziv Encoding.
Lempel-Ziv [51, 52] is the loss-less compression algorithm used by gzip. It takes variablesized patterns (row-wise in our case) and replaces them with fixed length codes, in contrast to Huffman encoding [26].Lempel-Ziv encoding does not require knowledge about pat-tern frequencies in advance and builds the pattern dictionarydynamically. There are many other similar schemes for com-pressing files represented as sequential bit-maps, e.g. [46].
Dictionary Encoding.
Dictionary encoding is a moregeneral form of Lempel-Ziv. It has the advantage that pat-terns with frequencies stored in the dictionary can be in-terpreted as workloads statistics useful for analysis. In thispaper, we extend dictionary encoding and focus on using adictionary to infer frequencies of patterns not in it. Mam-paey et al. proposed
MTV algorithm [40] that finds the dic-tionary (of given size) having the lowest
Bayesian Informa-tion Criterion(BIC) score. Lower BIC score indicates bet-ter accuracy on frequency inference. Gebaly et al. proposed
Laserlight algorithm [20] that builds a pattern dictionaryfor a slightly different goal. The quality of the dictionarydepends on whether it can correctly infer the truth-value ofsome binary feature.
Generative Models.
A generative model is a lossy com-pressed representation of the original log. Typical genera-tive models are probabilistic topic models [10, 47] and noisy-channel model [34]. Generative models can infer patternfrequencies but they lack a model-independent measure forefficiently evaluating overall inference accuracy.
Low-Rank Matrix Decomposition.
Low-rank matrixdecomposition [19], e.g. Principal Component Analysis (PCA)and Non-negative matrix factorization (NMF) [38], offerslossy data compression. But the resulting matrices after de-composition are not suited for inferring workload statistics.
10. CONCLUSION
In this paper, we introduced the problem of log compres-sion and defined a family of pattern based log encodings.We precisely characterized the information content of logsand offered three principled and one practical measures ofencoding quality: Verbosity, Ambiguity, Deviation and Re-production Error. To reduce the search space of patternbased encodings, we introduced the idea of partitioning logsinto separate components, which induces the family of pat-tern mixture as well as its simplified form: naive mixtureencodings. Finally, we experimentally showed that naivemixture encodings are more informative and can be con-structed more efficiently than encodings constructed fromstate-of-the-art pattern based summarization techniques.
Future Work.
The use of mixture models for summariza-tion has potential implications for work on pattern mining;As we show, existing techniques can be substantially im-proved. We also expect that making accurate correlatedfeature counting efficient will enable a range of more power-ful database tuning and intrusion detection systems.13
1. REFERENCES [1]
Abadi, D., Madden, S., and Ferreira, M.
Integrating compression and execution incolumn-oriented database systems. In
Proceedings ofthe 2006 ACM SIGMOD International Conference onManagement of Data (New York, NY, USA, 2006),SIGMOD ’06, ACM, pp. 671–682.[2]
Agrawal, S., Chaudhuri, S., and Narasayya,V. R.
Automated selection of materialized views andindexes in sql databases. In
Proceedings of the 26thInternational Conference on Very Large Data Bases (San Francisco, CA, USA, 2000), VLDB ’00, MorganKaufmann Publishers Inc., pp. 496–505.[3]
Aligon, J., Golfarelli, M., Marcel, P., Rizzi,S., and Turricchia, E.
Similarity measures for olapsessions.
Knowledge and Information Systems 39 , 2(May 2014), 463–489.[4]
Aligon, J., Golfarelli, M., Marcel, P., Rizzi,S., and Turricchia, E.
Similarity measures for olapsessions.
Knowledge and information systems 39 , 2(2014), 463–489.[5]
Amer-Yahia, S., and Johnson, T.
Optimizingqueries on compressed bitmaps. In
Proceedings of the26th International Conference on Very Large DataBases (San Francisco, CA, USA, 2000), VLDB ’00,Morgan Kaufmann Publishers Inc., pp. 329–338.[6]
Antoshenkov, G.
Byte-aligned bitmap compression.In
Data Compression Conference, 1995. DCC ’95.Proceedings (March 1995), pp. 476–.[7]
Antoshenkov, G., and Ziauddin, M.
Queryprocessing and optimization in oracle rdb.
The VLDBJournal 5 , 4 (Dec 1996), 229–237.[8]
Aouiche, K., Jouve, P.-E., and Darmont, J.
Clustering-based materialized view selection in datawarehouses. In
ADBIS (2006).[9]
Aouiche, K., Jouve, P.-E., and Darmont, J.
Clustering-based materialized view selection in datawarehouses. In
East European Conference onAdvances in Databases and Information Systems (2006), Springer, pp. 81–95.[10]
Blei, D. M.
Probabilistic topic models.
Commun.ACM 55 , 4 (Apr. 2012), 77–84.[11]
Boyd, S., and Vandenberghe, L.
ConvexOptimization . Cambridge University Press, New York,NY, USA, 2004.[12]
Bruno, N., and Chaudhuri, S.
Automatic physicaldatabase tuning: A relaxation-based approach. In
ACM SIGMOD (2005).[13]
Bruno, N., Chaudhuri, S., and Gravano, L.
Stholes: a multidimensional workload-awarehistogram. In
Acm Sigmod Record (2001), vol. 30,ACM, pp. 211–222.[14]
Chandra, B., Joseph, M., Radhakrishnan, B.,Acharya, S., and Sudarshan, S.
Partial markingfor automated grading of sql queries.
Proc. VLDBEndow. 9 , 13 (Sept. 2016), 1541–1544.[15]
Chatzopoulou, G., Eirinaki, M., Koshy, S.,Mittal, S., Polyzotis, N., and Varman, J. S. V.
The QueRIE system for personalized queryrecommendations.
IEEE Data Eng. Bull. (2011).[16]
Chaudhuri, S., and Narasayya, V. R.
An efficient,cost-driven index selection tool for microsoft sql server. In
VLDB (1997), vol. 97, Citeseer, pp. 146–155.[17]
Darroch, J. N., and Ratcliff, D.
Generalizediterative scaling for log-linear models.
Ann. Math.Statist. 43 , 5 (10 1972), 1470–1480.[18]
Dwork, C.
ICALP 2006, Proceedings, Part II .Springer, 2006, ch. Differential Privacy.[19]
Eckart, C., and Young, G.
The approximation ofone matrix by another of lower rank.
Psychometrika 1 ,3 (1936), 211–218.[20]
El Gebaly, K., Agrawal, P., Golab, L., Korn,F., and Srivastava, D.
Interpretable andinformative explanations of outcomes.
Proc. VLDBEndow. 8 , 1 (Sept. 2014), 61–72.[21]
Finkelstein, S., Schkolnick, M., and Tiberio, P.
Physical database design for relational databases.
ACM Transactions on Database Systems (TODS) 13 ,1 (1988), 91–128.[22]
Giacometti, A., Marcel, P., Negre, E., andSoulet, A.
Query recommendations for olapdiscovery driven analysis. In
Proceedings of the ACMTwelfth International Workshop on Data Warehousingand OLAP (New York, NY, USA, 2009), DOLAP ’09,ACM, pp. 81–88.[23]
Grant, M., and Boyd, S.
CVX: Matlab software fordisciplined convex programming, version 2.1. http://cvxr.com/cvx , Mar. 2014.[24]
Han, J., Cheng, H., Xin, D., and Yan, X.
Frequentpattern mining: Current status and future directions.
Data Min. Knowl. Discov. 15 , 1 (Aug. 2007), 55–86.[25]
Hu, L., Ross, K. A., Chang, Y.-C., Lang, C. A.,and Zhang, D.
Queryscope: visualizing queries forrepeatable database tuning. pVLDB (2008).[26]
Huffman, D. A.
A method for the construction ofminimum-redundancy codes.
Proceedings of the IRE40 , 9 (1952), 1098–1101.[27]
Jain, A. K.
Data clustering: 50 years beyondk-means.
Pattern Recogn. Lett. 31 , 8 (June 2010),651–666.[28]
Jaynes, E., and Bretthorst, G.
ProbabilityTheory: The Logic of Science . Cambridge UniversityPress, 2003.[29]
Johnson, S. C.
Hierarchical clustering schemes.
Psychometrika 32 , 3 (Sep 1967), 241–254.[30]
Kamra, A., Terzi, E., and Bertino, E.
Detectinganomalous access patterns in relational databases.
TheVLDB JournalThe International Journal on VeryLarge Data Bases 17 , 5 (2008), 1063–1077.[31]
Kannan, R., Vempala, S., and Vetta, A.
Onclusterings: Good, bad and spectral.
J. ACM 51 , 3(May 2004), 497–515.[32]
Kennedy, O., Ajay, J. A., Challen, G., andZiarek, L.
Pocket data: The need for tpc-mobile. In
TPC-TC (2015).[33]
Khoussainova, N., Kwon, Y., Balazinska, M.,and Suciu, D.
Snipsuggest: Context-awareautocompletion for sql.
Proc. VLDB Endow. 4 , 1(Oct. 2010), 22–33.[34]
Knight, K., and Marcu, D.
Summarization beyondsentence extraction: A probabilistic approach tosentence compression.
Artif. Intell. 139 , 1 (July 2002),91–107.1435]
Kul, G., Luong, D., Xie, T., Coonan, P.,Chandola, V., Kennedy, O., and Upadhyaya, S.
Ettu: Analyzing query intents in corporate databases.In
Proceedings of the 25th International ConferenceCompanion on World Wide Web (2016), pp. 463–466.[36]
Kul, G., Upadhyaya, S. J., and Chandola, V.
Detecting data leakage from databases on androidapps with concept drift. In
IEEE TrustCom (2018),pp. 905–913.[37]
Kullback, S., and Leibler, R. A.
On informationand sufficiency.
Ann. Math. Statist. 22 , 1 (03 1951),79–86.[38]
Lee, D. D., and Seung, H. S.
Learning the parts ofobjects by non-negative matrix factorization.
Nature401 , 6755 (1999), 788.[39]
Makiyama, V. H., Raddick, J., and Santos, R. D.
Text mining applied to sql queries: A case study forthe sdss skyserver. In
SIMBig (2015), pp. 66–72.[40]
Mampaey, M., Vreeken, J., and Tatti, N.
Summarizing data succinctly with the mostinformative itemsets.
ACM Trans. Knowl. Discov.Data 6 , 4 (Dec. 2012), 16:1–16:42.[41]
Mittal, S., Varman, J. S. V., Chatzopoulou, G.,Eirinaki, M., and Polyzotis, N.
Querie: A queryrecommender system supporting interactive databaseexploration. In (Dec 2010),pp. 1411–1414.[42]
Moffat, A., and Zobel, J.
Compression and fastindexing for multi-gigabyte text databases.
AustralianComputer Journal 26 (1994).[43]
O’Donoghue, B., Chu, E., Parikh, N., and Boyd,S.
Conic optimization via operator splitting andhomogeneous self-dual embedding.
Journal ofOptimization Theory and Applications 169 , 3 (Jun2016), 1042–1068.[44]
Pavlo, A., Angulo, G., Arulraj, J., Lin, H., Lin,J., Ma, L., Menon, P., Mowry, T. C., Perron,M., Quah, I., Santurkar, S., Tomasic, A., Toor,S., Aken, D. V., Wang, Z., Wu, Y., Xian, R., andZhang, T.
Self-driving database managementsystems. In
CIDR (2017).[45]
Pedregosa, F., Varoquaux, G., Gramfort, A.,Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V.,Vanderplas, J., Passos, A., Cournapeau, D.,Brucher, M., Perrot, M., and Duchesnay, E.
Scikit-learn: Machine learning in Python.
Journal ofMachine Learning Research 12 (2011), 2825–2830.[46]
Skibi´nski, P., and Swacha, J.
Fast and efficient logfile compression. In
CEUR Workshop Proceedings ofthe 11th East-European Conference on Advances inDatabases and Information Systems (ADBIS) (2007),pp. 330–342.[47]
Wang, D., Zhu, S., Li, T., and Gong, Y.
Multi-document summarization using sentence-basedtopic models. In
Proceedings of the ACL-IJCNLP2009 Conference Short Papers (Stroudsburg, PA,USA, 2009), ACLShort ’09, Association forComputational Linguistics, pp. 297–300.[48]
Witten, I. H., Moffat, A., and Bell, T. C.
Managing gigabytes: Compressing and indexing documents and images - errata, 1996.[49]
Wu, K., Otoo, E. J., and Shoshani, A.
Compressing bitmap indexes for faster searchoperations. In
Proceedings 14th InternationalConference on Scientific and Statistical DatabaseManagement (2002), pp. 99–108.[50]
Yang, X., Procopiuc, C. M., and Srivastava, D.
Recommending join queries via query log analysis. In
Data Engineering, 2009. ICDE’09. IEEE 25thInternational Conference on (2009), IEEE,pp. 964–975.[51]
Ziv, J., and Lempel, A.
A universal algorithm forsequential data compression.
IEEE Transactions onInformation Theory 23 , 3 (May 1977), 337–343.[52]
Ziv, J., and Lempel, A.
Compression of individualsequences via variable-rate coding.
IEEE Trans. Inf.Theor. 24 , 5 (Sept. 2006), 530–536.[53]
Zobel, J., and Moffat, A.
Inverted files for textsearch engines.
ACM Comput. Surv. 38 , 2 (July 2006).
APPENDIXA. NOMENCLATURE
Symbol Meaning f Feature b Pattern b (cid:48) ⊆ b b (cid:48) is contained in bq Query L Log, a bag of queries Q Query randomly drawn from Lp ( Q | L ) Query distribution of Lp ( Q ⊇ b | L ) Marginal probability of Q ⊇ b E L [ b ] same as p ( Q ⊇ b | L ) corr rank ( b ) Feature-correlation score E max Mapping from all patterns to marginals E Encoding, partial mapping
E ⊆ E max domain ( · ) Domain of mapping ρ An arbitrary query distributionΩ E Space of ρ constrained by EE ≤ Ω E (cid:48) Ω E ⊆ Ω E (cid:48) P E A random ρ drawn from Ω E ρ ∗ Same as p ( Q | L ), ρ ∗ ∈ Ω E H ( · ) Entropy of Distribution ρ E Representative distribution of Ω E d( E ) DeviationI( E ) Ambiguity e ( E ) Reproduction Error |E| | domain ( E ) | , Verbosity of encoding ρ (cid:28) ρ (cid:48) ρ is absolutely continuous w.r.t ρ (cid:48) D KL ( ρ || ρ (cid:48) ) K-L Divergence from ρ (cid:48) to ρ . B. PROOF OF PROPOSITION 1
Denote by n = { , } n the space of possible 0-1 vectorsof size n , and define an encoding ¯ E q with patterns: domain ( E q ) = { ( x + b , . . . , x n + b n ) | ( b , . . . , b n ) ∈ n } We will show that E q ⊆ E max contains sufficient informationto compute p = p ( X = x , . . . , X n = x n ) through severalsteps. First, we define a new pair of marginal probabilities p (cid:104) b (cid:105) = p ( X ≥ x + b , X = x , . . . , X n = x n ). x is15ntegral, so p = p (cid:104) (cid:105) − p (cid:104) (cid:105) . Generalizing, we candefine: p k (cid:104) b , . . . , b k (cid:105) = p ( X ≥ x + b , . . . , X k ≥ x k + b k ,X k +1 = x k +1 , . . . , X n = x n )Again, x k being integral gives us that: p k − (cid:104) b , . . . , b k − (cid:105) = p k (cid:104) b , . . . , b k − , (cid:105)− p k (cid:104) b , . . . , b k − , (cid:105) Finally, when k = n , the probability p n (cid:104) b , . . . , b n (cid:105) is themarginal probability p ( Q ⊇ b | L ) of a pattern b = ( x + b , . . . , x n + b n ), which by definition is offered by E q for any( b , . . . , b n ) ∈ n . The resulting encoding E = (cid:83) q ∈ L E q identifies the distribution p ( Q | L ), which we refer to as lossless encoding. Clearly any encoding that extends E (in-cluding E max ) is lossless. C. SAMPLING FROM SPACE OF DISTRI-BUTIONS
Here we describe how we sample a random distribution ρ from the space Ω E of probability distributions. C.1 Preliminary Sampling
To sample a random distribution ρ which assigns a prob-ability value to each element in vector space B n . The naiveway is to treat ρ as a random multi-dimensional vector( ρ ( q ) , . . . , ρ ( q | B n | )) that sum up to 1. However, | B n | is Algorithm 1
Sampling1: procedure
TwoStepSampling
2: Step 1:3: for each v ∈ B m ∧ C v (cid:54) = ∅ do V ← V (cid:83) v end for class p ← UniRandDistribProb (V,1)7: Step 2:8: for each v ∈ V do ρ ← ρ (cid:83) UniRandDistribProb ( C v , class p ( v ))10: end for return ρ end procedure procedure UniRandDistribProb (Set S, double prob)15: for each element e ∈ S do p ( e ) ← UniformRandNum ( range = [0 , end for for each element e ∈ S do p ( e ) ← prob × p ( e ) ÷ (cid:80) e p ( e )20: end for return p end procedure exponentially large (i.e., 2 n ) and we reduce the number of el-ements in ρ by grouping them (i.e., q . . . , q | N n | ) into equiv-alence classes. Encoding-equivalent Classes.
The basic idea for group-ing is based on containment relationship between query q i and patterns b ∈ E in the encoding E . More precisely, if q i ⊇ b , it indicates that the assignment ρ ( q i ) on i th dimen-sion is constrained by marginal E [ b ] (See Section 3.2). As a result, if queries q i , q j share the same containment relation-ship with pattern b , assignments ρ ( q i ) , ρ ( q j ) on i th and j thdimension make no difference for satisfying the constraint ofpattern b . We thus define pattern-equivalence as q i ≡ b q j ⇔ BI ( q i , b ) = BI ( q j , b ) BI ( q i , b ) is the Binary Indicator function satisfying BI ( q i , b ) =1 ≡ q i ⊇ b . Queries are encoding-equivalent q i ≡ E q j ifthey are pattern-equivalent for all patterns in the encoding.Numbering patterns in the encoding as b , . . . , b m , any bi-nary vector v ∈ B m maps to an equivalence class C v = { q | ( BI ( q , b ) , . . . , BI ( q , b m )) = v ∧ q ∈ N n } . Thoughthe number of non-empty equivalent classes may grow aslarge as O (2 m ), it is much smaller than 2 n in most casesand sampling a random distribution ρ can be divided intotwo steps as shown in line 1 of algorithm 1. Note that class p in the algorithm, which is produced by the first step, is arandomly sampled distribution over all non-empty equiv-alence classes. The second step redistributes probabilitiesrandomly assigned to each equivalence class to its class mem-bers in an unbiased way. C.2 Incorporating Constraints
So far we are creating random samples from an uncon-strained space of distributions. To make sure ρ produced bythe two-step sampling fall within space Ω E , the probabili-ties distributed over equivalence classes (denoted as class p )must obey the linear equality constraints derived from theencoding E . Denote the space of candidate class p as U andthe subspace allowed by the encoding as U E ⊆ U , one naivesolution is to reject class p / ∈ U E . However, the subspace U E constrained under linear equality constraints is equivalent toan intersection of hyperplanes in the full space U . The vol-ume of U E is thus infinitely small comparing to that of U ,such that any random sample class p ∈ U will almost never fall within U E . To make sampling feasible, we do not rejecta sample class p ∈ U but project it onto the hyperplane of U E by finding its closest (Euclidean distance) counterpart class p (cid:48) ∈ U E : class p (cid:48) = arg min class p (cid:48) ∈ U E || class p (cid:48) − class p || Finding the projection point class p (cid:48) of class p can be achievedby linear programming. D. ALGORITHM CONFIGURATIONS
Here we give detailed description on our selected state-of-the-art pattern based summarizers (i.e.,
Laserlight and
MTV ) and also specify how we configured them in experi-ments discussed in Section 7.2.
Common Configuration.
We set up both algorithms tomine 15 patterns from target clusters. This is because, em-pirically we found that
MTV quits with error message over15 patterns. For fair comparison, we set the same numberof patterns for
Laserlight . D.1 Laserlight Algorithm
Description.
Laserlight algorithm is proposed in [20] forsummarizing multi-dimension data (i.e., D = ( X , . . . , X n ))augmented by a binary attribute A . The goal is to searchfor a set of patterns (i.e., encoding) from the data D thatprovide maximum information for predicting augmented at-tribute A , which is a sub-problem of summarizing the joint16istribution p ( D, A ). Another algorithm
Flashlight is alsoproposed in the same paper but we omit it in our experimentdue its inferior scalability. The implementation of
Laserlight has been incorporated into PostgreSQL 9.1 and the sourcecode is only available upon request.
Experiment Settings.
Due to the restriction on themaximum number of data dimension by the PostgreSQLimplementation of
Laserlight , we project the distribution p ( Q | L ) onto a limited set of 100 features. The selectioncriteria is based on feature entropy or variability. More pre-cisely, regarding the existence of i th feature as random bi-nary variable X i , features are ranked by entropy H ( X i ).The feature with highest entropy H ( X i ) is chosen as theaugmented attribute A . The algorithm heuristically selectsa limited set of samples from the space of candidate pat-terns, from which the pattern that is most informative isselected to be added to the encoding. Note that when weapplied Laserlight in our experiments, we set the numberof samples to be 16, which is suggested in [20] based on itsown data sets.
D.2 MTV Algorithm
Description.
MTV algorithm is proposed in [40] for sum-marizing multi-dimensional data with binary attributes. Thegoal is to mine a succinct set of patterns (i.e., encoding) thatconvey the most important information (See the paper fordefinition). The implementation of this algorithm can beobtained at http://adrem.ua.ac.be/succinctsummary.
Experiment Settings.
MTV requires to set the mini-mum support threshold for patterns. That is, patterns withmarginal less than the threshold will be ignored, in order toreduce the search space of candidate patterns. We set theminimum support threshold to be 0 .
05 in our experimentssuch that any pattern that is contained in more than 5% ofqueries will be considered as candidate.
D.3 Configuring Laserlight Mixture Fixed
Given a data partitioning and fixed total number of pat-terns to mine from all clusters, in order to determine thenumber of patterns mined from each cluster, we need to as-sign weights (cid:80) i w i = 1 for each cluster i . ReproductionError e ( E L ) of the naive encoding E L for a cluster reflectsits ‘easiness’ for pattern mining and the intuition is that e ( E L ) = 0 indicates there is no need for additional patternmining. Reproduction Error e ( E L ) is affected by the numberof features n ever occur in the cluster. Consider a toy datawith only two feature vectors v = (0 ,
0) and v = (1 , v with new features of value 0 and v of value1 will increase e ( E L ) but not necessarily the number of pat-terns needed for accurately summarizing the data. Hencewe normalize e ( E L ), dividing it by the number of features n ,which gives us w i = e ( E L ) n . In addition, since the generalizedmeasure for Laserlight and
MTV gives weight to each clus-ter proportional to its number of distinct data instances m ,we also adjust w i and multiply them with the number of dis-tance data instances. The final weight assignment becomes w i ∝ mn e ( E L ). E. INTERPRETING NAIVE MIXTURE EN-CODING
Due to sensitive information contained in the US bankdata set, we only provide visualization on PocketData. The visualization of PocketData is based on its naive mix-ture encoding under 8 clusters . The result is given in Fig-ure 10. There are 5 sub-figures with each representing anaive encoding for one cluster. Note that we use shadingto represent the magnitude of marginals and features withmarginal too small will be invisible and omitted. Questionmark ‘?’ is the placeholder for constants. Three clustersfrom the eight are not shown in the figure: One cluster istoo messy (i.e., further sub-clustering is needed) and twoclusters gives similar visualization to Figure 10a and 10e.The caption of each sub-figures expresses our understand-ing on the task that queries in the cluster are performing,by visualizing the corresponding naive encodings. For sim-plicity, we will also omit features of SELECT category if theyare neither participating in
WHERE clause nor intuitively re-lated to other features in
SELECT . SELECT conversation id , participants type , first name , chat id , blocked , active FROM conversation participants view
WHERE (chat id!=?) ∧ (conversation id=?) ∧ (active=1) (a) Check the person who is active in specific conversa-tion and not participating in specified chat.SELECT status , timestamp , expiration timestamp , sms raw sender , message id , text FROM conversations , message notifications view , messages view ORDER BY
Descend on timestamp
Limit
WHERE (expiration timestamp>?) ∧ (status!=5) ∧ (conversation id=?) ∧ (conversations.conversation id=conversation id) (b) Check sender information for most recent SMS mes-sages that participate in given conversation.SELECT status , timestamp , conversation id , chat watermark , message id , sms type FROM conversations , message notifications view WHERE (conversation status!=1) ∧ (conversation pending leave!=1) ∧ (conversation notification level!=10) ∧ (timestamp>1355...) ∧ (timestamp>chat watermark) ∧ (conversation id=?) ∧ (conversations.conversation id=conversation id) (c) Check recent messages in conversations of specifictype.SELECT suggestion type , name , chat id FROM suggested contacts
Limit Order By
Ascend on upper(name)
WHERE (chat id!=?) ∧ (name!=?) (d) Suggest contacts that avoid certain names and chat.SELECT sms type , timestamp , id FROM messages
WHERE (sms type=1) ∧ (status=4) ∧ (transport type=3) ∧ (timestamp>=?) (e) Check messages under type/status conditions
Figure 10:
Visualize PocketData by its naive mixtureencoding2