[PDF] Approximate Computation for Big Data Analytics

Abstract

Over the past a few years, research and development has made significant progresses on big data analytics. A fundamental issue for big data analytics is the efficiency. If the optimal solution is unable to attain or not required or has a price to high to pay, it is reasonable to sacrifice optimality with a `good' feasible solution that can be computed efficiently. Existing approximation techniques can be in general classified into approximation algorithms, approximate query processing for aggregate SQL queries and approximation computing for multiple layers of the system stack. In this article, we systematically introduce approximate computation, i.e., query approximation and data approximation, for efficiency and effectiveness big data analytics. We first explain the idea and rationale of query approximation, and show efficiency can be obtained with high effectiveness in practice with three analytic tasks: graph pattern matching, trajectory compression and dense subgraph computation. We then explain the idea and rationale of data approximation, and show efficiency can be obtained even without sacrificing for effectiveness in practice with three analytic tasks: shortest paths/distances, network anomaly detection and link prediction.

Full PDF

11 Approximate Computation for Big Data Analytics

Shuai Ma, Jinpeng Huai

Abstract —Over the past a few years, research and development has made signiﬁcant progresses on big data analytics. A fundamentalissue for big data analytics is the efﬁciency. If the optimal solution is unable to attain or not required or has a price to high to pay, it isreasonable to sacriﬁce optimality with a “good” feasible solution that can be computed efﬁciently. Existing approximation techniquescan be in general classiﬁed into approximation algorithms, approximate query processing for aggregate SQL queries andapproximation computing for multiple layers of the system stack. In this article, we systematically introduce approximate computation, i.e., query approximation and data approximation, for efﬁciency and effectiveness big data analytics. We ﬁrst explain the idea andrationale of query approximation, and show efﬁciency can be obtained with high effectiveness in practice with three analytic tasks:graph pattern matching, trajectory compression and dense subgraph computation. We then explain the idea and rationale of dataapproximation, and show efﬁciency can be obtained even without sacriﬁcing for effectiveness in practice with three analytic tasks:shortest paths/distances, network anomaly detection and link prediction.

Index Terms —Big data, query approximation, data approximation (cid:70)

NTRODUCTION

Over the past a few years, research and development hasmade signiﬁcant progresses on big data analytics with thesupports from both governments and industries all overthe world, such as Spark , IBM Watson and Google Al-phaGo . A fundamental issue for big data analytics is theefﬁciency, and various advances towards attacking this issuehave been achieved recently, from theory to algorithms tosystems [15], [29], [48]. However, if the optimal solution isunable to attain or not required or has a price to high to pay, it isreasonable to sacriﬁce optimality with a “good” feasible solutionthat can be computed efﬁciently . Hence, various approximationtechniques have been developed, and can in general be clas-siﬁed into three aspects: algorithms, SQL aggregate queriesand multiple layers of the system stack.(1) Approximation algorithms were formally deﬁned in the1970s [20], [28]. An approximation algorithm is neces-sarily polynomial, and is evaluated by the worst casepossible relative error over all possible instances ofthe NP-hard optimization problem, under the widelybelieved P (cid:54) = N P conjecture. This is relatively matureresearch ﬁeld algorithm community, many approxima-tion algorithm have been designed for optimizationproblems (see books [4], [24], [45]).(2)

Approximate query processing supports a slightly con-strained set of SQL-style declarative queries, and itspeciﬁcally provides approximate results for standardSQL aggregate queries, e.g., queries involving COUNT,AVG, SUM and PERCENTILE. Over the past twodecades, approximate query processing has been suc-cessfully studied, among which sampling techniqueare heavily employed [9], [21], [30], [42]. Not only • S. Ma and J. Huai are with the SKLSDE Lab & Beijing Advanced Inno-vation Center for Big Data and Brain Computing, Beihang University,Beijing, China.E-mail: { mashuai, huaijp } @buaa.edu.cn. https://spark.apache.org https://deepmind.com/research/alphago traditional DBMS systems, such as Oracle , provideapproximate functions to support approximate results,but also emerging new systems specially designedfor approximate queries, such as BlinkDB , Verdict ,Simba , have been designed. However, as pointed outin [9], “ it seems impossible to have an approximatequery processing system that supports the richness ofSQL with signiﬁcant saving of work while providing anaccuracy guarantee that is acceptable to a broad set ofapplication workloads.”(3) Approximation computing is a recent computation tech-nique that returns a possibly inaccurate result ratherthan a guaranteed accurate result from a system pointof view. It involves with multiple layers of the systemstack from software to hardware to systems (such asapproximate circuits, approximate storage and loopperforation), and can be used for applications wherean approximate result is sufﬁcient for its purpose [2],[41]. Recently, a workshop on approximate computingacross the stack has been usefully held for researchon hardware, programming languages and compilersupport for approximate computing since 2014. (see e.g., , 2017 and 2018 ). Besides the various taskoriented quality metrics, the quality-energy trade-offis also concerned for approximate computing. For in-stance, allowing only 5% loss of classiﬁcation accuracycan provide 50 times energy saving for clustering algo-rithm k-means [41].In this article, we present the idea of approximate com-putation for efﬁcient and effective big data analytics: queryapproximation and data approximation, based on our recent https://oracle-base.com/articles/12c/approximate-query-processing-12cr2 http://blindb.org/ http://verdictdb.org/ https://initialdlab.github.io/Simba/index.html http://approximate.computer/wax2016/ http://approximate.computer/wax2017/ http://approximate.computer/wax2018/ a r X i v : . [ c s . D B ] J a n Figure 1. Query approximation research experiences [13], [14], [25], [31], [33]–[39]. Approx-imation algorithms ask for feasible solutions that are theo-retically bounded with respect to optimal solutions from analgorithm design aspect. Approximate query processing andapproximation computing relax the need for accuracy guar-antees for aggregate SQL queries and for multiple layers ofthe system stack, respectively. Similarly, our approximatecomputation is unnecessarily theoretically bounded withrespect to optimal solutions, but from an algorithm designpoint of view. That is, we focus on approximate computationfor big data analytics for a situation where an approximateresult is sufﬁcient for a purpose.

UERY A PPROXIMATION

Query approximation deals with complex queries involvedwith big data analytic tasks. Given a class Q of data ana-lytic queries with high a computational complexity, queryapproximation is to transform into another class Q (cid:48) ofqueries with a low computational complexity and satisﬁableapproximate answers, as depicted in Fig. 1 in which Q , Q (cid:48) , D and R denote the original query, approximate query, dataand query result, respectively. Query approximation needsto reach a balance between the query efﬁciency and answerquality when approximating Q with Q (cid:48) .The rationale behind query approximation lies in thatinexact or approximate answers are sufﬁcient or acceptablefor many big data analytic tasks. On one hand, when thevolume of data is extremely large, it may be impossible ornot necessary to compute the exact answers. Observe thatnobody would try each and every store to ﬁnd a pair ofshoes with the best cost-performance ratio. That is, inexact(approximate) solutions are good enough for certain cases.On the other hand, when taking noises (very common forbig data) into account, it may not always be a good ideato compute exact answers for those data analytic taskswhose answers are rare or hard to identify, such as thedetection of homegrown violent extremists (HVEs) whoseek to commit acts of terrorism in the United States andabroad [26], as ﬁnding exact solutions may have a highchance to miss/ignore possible candidates.We next explain query approximation computation inmore detail using three different data analytic tasks. (1) Strong Simulation [33], [34] . Given a pattern graph Q and a data graph G , graph pattern matching is to ﬁnd allsubgraphs of G that match Q , and is being increasingly usedin various applications, e.g., biology and social networks. Here matching is typically deﬁned in terms of subgraphisomorphism [19]: a subgraph G s of G matches Q if there existsa bijective function f from the nodes of Q to the nodes in G s such that (a) for each pattern node u in Q , u and f ( u ) havethe same label, and (b) there exists an edge ( u, u (cid:48) ) in Q ifand only if there exists an edge ( f ( u ) , f ( u (cid:48) )) in G s .The goodness of subgraph isomorphism is that allmatched subgraphs are exactly the same as the patterngraph, i.e., completely preserving the topology structurebetween the pattern graph and data graph. However, sub-graph isomorphism is NP -complete, and may return expo-nentially many matched subgraphs. Further, subgraph iso-morphism is too restrictive to ﬁnd sensible matches in cer-tain scenarios, as observed in [17]. Even worse, online datain many cases only represents a partial world ( e.g., terroristcollaboration networks and homosexual networks are oftenaccompanied with a large amount of ofﬂine data). Exactcomputations on online data, whose ofﬂine counterpart isextremely hard to collect, typically decreases the chance ofidentifying candidate answers. These hinder the usability ofgraph pattern matching in emerging applications.To lower the high complexity of subgraph isomorphism,substitutes for subgraph isomorphism [16], [17], which al-low graph pattern matching to be conducted in cubic-time,have been proposed by extending graph simulation [23].However, they fall short of capturing the topology of datagraphs, i.e., graphs may have a structure drastically differentfrom pattern graphs that they match, and the matches foundare often too large to analyze.To rectify these problems, strong simulation, an “approx-imate” substitute for subgraph isomorphism, is proposedfor graph pattern matching [34], which (a) theoreticallypreserves the key topology of pattern graphs and ﬁnds abounded number of matches, (b) retains the same com-plexity as earlier extensions of graph simulation [16], [17],by providing a cubic-time algorithm for strong simulationcompuation, and (c) has the locality property that allowsus to develop an effective distributed algorithm to conductgraph pattern matching on distributed graphs.Strong simulation is experimentally veriﬁed that it isable to identify sensible matches that are not found bysubgraph isomorphism, and it ﬁnds high quality matchesthat retain graph topology. Indeed, 70%-80% of matchesfound by subgraph isomorphism are retrieved by strongsimulation. Further, strong simulation is over timesfaster than subgraph isomorphism, and has a boundednumber of matches. (2) One-Pass Trajectory Compression [31] . Trajectory com-pression ( a.k.a. trajectory simpliﬁcation) is to compress datapoints in a trajectory to a set of continuous line segments,and is commonly used in practice.The compression ratios of lossless methods are poor, andquerying on the compressed data is time consuming dueto the reconstruction of the original data [43]. Hence, lossytechniques, which provide approximate solutions with goodcompression ratios and bounded errors, are the mainstream.Piece-wise line simpliﬁcation ( LS ) comes from the com-putational geometry, whose target is to approximate a givenﬁner piece-wise linear curve by another coarser piece-wiselinear curve (normally a subset of the former), such that the maximum distance of the former from the later is boundedby a user speciﬁed constant ( i.e., error bound). It is widelyused due to its distinct advantages: (a) simple and easy toimplement, (b) no need extra knowledge and suitable forfreely moving objects, and (c) bounded errors with goodcompression ratios. LS algorithms fall into two categories: optimal and ap-proximate . Optimal methods [27] are to ﬁnd the minimumnumber of points or segments to represent the originalpolygonal lines w.r.t. an error bound (cid:15) . They have highertime and space complexities, and are not practical for largetrajectory data. Hence, various approximate LS algorithmshave been developed, from batch algorithms ( e.g., [12]) toonline algorithms ( e.g., [32]) and to one-pass algorithms( e.g., [31]).An LS algorithm is one-pass if it processes each pointin a trajectory once and only once when compressing thetrajectory. Obviously, one-pass algorithms have low timeand space complexities, and are more appropriate for onlineprocessing. The difﬁculty comes from the need to achieveeffective compression ratios. Existing trajectory simpliﬁca-tion algorithms ( e.g., [12]) and online algorithms ( e.g., [32])essentially employ global distance checking, although on-line algorithms restrict the checking within a window. Thatis, whenever a new line segment is formed, these algorithmsalways check its distances to all or a subset of data points,and, therefore, a data point is checked multiple times, de-pending on its order in the trajectory and the number ofdirected line segments formed. Hence, an appropriate localdistance checking approach is needed in the ﬁrst place for one-passtrajectory simpliﬁcation. .We develop a local distance checking method, referredto as ﬁtting function , such that a data point is checked onlyonce in the entire process. Based on the ﬁtting function,we develop one-pass error bounded trajectory simpliﬁcationalgorithms OPERB and

OPERB - A that scan each data pointin a trajectory once and only once, allowing interpolatingnew data points or not, respectively. By comparing our algo-rithms with FBQS (the fastest existing LS online algorithm[32]) and DP (the best existing LS batch algorithm in termsof compression ratio [12]), our one-pass algorithms OPERB and

OPERB - A are over four times faster than FBQS , andhave comparable compression ratios with DP . (3) Dense Temporal Subgraph Computation [39] . We studydense subgraphs in a special type of temporal networks whosenodes and edges are kept ﬁxed, but edge weights constantlyand regularly vary with timestamps [39]. Essentially, atemporal network with T timestamps can be viewed as T snapshots of a static network such that the network nodesand edges are kept the same among these T snapshots,while the edge weights vary with network snapthots. Roadtrafﬁc networks typically fall into this category of temporalnetworks, and dense subgraphs are used for road trafﬁcanalyses that are of particular importance for transportationmanagement of large cities.Dense subgraphs are a general concept, and their con-crete semantics highly depends on the studied problems andapplications. Though dense subgraphs have been widelystudied in static networks, how to properly deﬁne theirsemantics over temporal networks is still in the early stage, not to mention effective and efﬁcient analytic algorithms.We adopt the form of dense temporal subgraphs initiallydeﬁned and studied in [5], such that a temporal subgraphcorresponds to a connected subgraph measured by the sumof all its edge weights in a time interval, i.e., a continuoussequence of timestamps. Intuitively, a dense subgraph thatwe consider corresponds to a collection of connected highlyslow or jam roads ( i.e., a jam area) in road networks, lastingfor a continuous sequence of snapshots.The problem of ﬁnding dense subgraphs in temporalnetworks is non-trivial, and it is already NP -complete evenfor a temporal network with a single snapshot and with +1 or − edge weights only, as observed in [5]. Even worse,it remains hard to approximate for temporal networks withsingle snapshots [39]. Moreover, given a temporal networkwith T timestamps, there are a total number of T ∗ ( T + 1) / time intervals to consider, which further aggravates thedifﬁculty. The state of the art solution MEDEN [5] adoptsa Filter-And-Veriﬁcation (

FAV ) framework that even if a largeportion of time intervals are ﬁltered, there often remain a largenumber of time intervals to verify . Hence, this method isnot big data friendly, and is not scalable when temporalnetworks have a large number of nodes/edges or a largenumber T of timestamps.We develop a data-driven approach (referred to as FIDES ), instead of ﬁlter-and-veriﬁcation, to identifying themost possible k time intervals from T × ( T + 1) / timeintervals, in which T is the number of snapshots and kis a small constant, e.g.,

10. This is achieved by exploringthe characteristics of time intervals involved with densesubgraphs based on the observation of evolving convergencephenomenon in trafﬁc data, inspired by the convergent evolu-tion in nature . That is, our method provides time intervalswith probabilistic guarantees, instead of exact ones as FAV .Using both real-life and synthetic data, we experimentallyshow that our method

FIDES is over 1000 times faster than

MEDEN [5], while the quality of dense subgraphs found iscomparable with

MEDEN . ATA A PPROXIMATION

Big data has a large volume, and, hence, the space com-plexity [11] of big data analytic tasks starts raising moreconcerns. Given a class Q of queries on data D , data ap-proximation is to transform D into smaller D (cid:48) such that Q on D (cid:48) returns a sufﬁcient or satisﬁable approximate answerin a more efﬁcient way. Further, it is typically common thatquery Q needs to be (slightly) modiﬁed to Q (cid:48) to accommo-date the changes of D to D (cid:48) , as shown in Fig. 2. Similar toquery approximation, data approximation needs to reach abalance between the query efﬁciency and answer quality.The rationale behind data approximation has roots in thePareto principle that “states that, for many events, roughly80% of the effects come from 20% of the causes”. The criticalthing for data approximation is to determine which partof data is relevant to tasks (belong to the 20%). By thisprinciple, for many big analytic tasks, one may only needto keep a small amount of the data to derive high quality https://en.wikipedia.org/wiki/Convergent evolution https://en.wikipedia.org/wiki/Pareto principle Figure 2. Data approximation answers. For example, when we are to build a predictivemodel on the stock of razers for an online store based onthe order history of customers, orders from men are goodenough. While on the stock of lipsticks, those from womenare good enough. That is to say, it is not necessary to use theentire data for certain data analytic tasks.However, it should be pointed out that there are dataanalytic tasks such that data approximation could not workwell. For example, an online store needs to count the totalnumber of goods in its catalog. Essentially entire goodsshould be considered for this task, and if a (small) portionof goods are chosen, it is hard to have a satisﬁable result.We next explain data approximation computation inmore detail using three different data analytic tasks. (1) Proxies for Shortest Paths and Distances [36], [37] .Computing shortest paths and distances is one of the fun-damental problems on graphs. We study the node-to-nodeshortest path ( distance ) problem on large graphs: given aweighted undirected graph G ( V, E ) with non-negative edgeweights, and two nodes of G , the source s and the target t ,ﬁnd the shortest path (distance) from s to t in G . The Dijk-stra’s algorithm with Fibonacci heaps runs in O ( n log n + m ) due to Fredman & Tarjan [11], where n and m denote thenumbers of nodes and edges in a graph, respectively, whichremains asymptotically the fastest known solution on ar-bitrary undirected graphs with non-negative edge weights.However, computing shortest paths and distances remainsa challenging problem, in terms of both time and space cost,on large-scale graphs. Hence, various optimizations havebeen developed to speed-up the computation.To speed-up shortest path and distance queries, wepropose proxies that have the following properties: (a) eachproxy captures a set of nodes in a graph, referred to as DRA , (b) a small number of proxies can represent a largenumber of nodes in a graph, (c) shortest paths and distancesinvolved within the set of nodes being represented by thesame proxies can be answered efﬁciently, and, (d) the prox-ies and the set of nodes being represented can be computedefﬁciently in linear time.The framework for speeding-up shortest path and dis-tance queries with proxies consists of two module, prepro-cessing and query answering, as follows.(a)

Preprocessing : Given graph G ( V, E ) , it ﬁrst computesall DRA s and their maximal proxies in linear time, then itcomputes and stores all the shortest paths and distancesbetween any node and its proxy. It ﬁnally computes thereduced subgraph G (cid:48) by removing all DRA s from graph G , i.e., keeping the proxies only. (b) Query answering . Given two nodes s and t in graph G ( V , E ) and the pre-computed information, the query answeringmodule essentially executes the following.The shortest path path ( s, t ) = path ( s, u s ) / path ( u s , u t ) /path ( u t , t ) , where u s and u t are the proxies of s and t , re-spectively. As path ( s, u s ) and path ( u t , t ) are pre-computed,and path ( u s , u t ) can be computed on the reduced subgraph G (cid:48) by invoking any existing algorithms ( e.g., AH [49]). Theshortest distance dist ( s, t ) = dist ( s, u s ) + dist ( u s , u t ) + dist ( u t , t ) can be computed along the same line.Essentially, we propose a light-weight data reductiontechnique for speeding-up (exact) shortest path and distancequeries on large weighted undirected graphs [36]. We exper-imentally show that about / nodes of real-life social androad networks are captured by proxies. (2) Network Anomaly Detection [25] Anomaly (or outlier)detection aims at identifying those objects in a dataset thatare unusual, i.e., different than the majority and there-fore suspicious resulting from a contamination, error, orfraud [50]. Network anomaly detection has become verypopular recently because of the importance of discoveringkey regions of structural inconsistency in the network. In ad-dition to application-speciﬁc information carried by anoma-lies, the presence of such structural inconsistency is oftenan impediment to the effective application of data miningalgorithms such as community detection and classiﬁcation.Networks are inherently complex entities, and, hence,anomalies may be deﬁned in a wide variety of ways. Ourgoal is to discover structural inconsistencies, i.e., the anomalousnodes that connect to a number of diverse inﬂuential communities ,inspired by the concept of social brokers across groups,which provide social capital in networks [7]. While a varietyof graph embeddings, such as multidimensional scaling [6],are available in the literature, they aim to preserve (global)pairwise similarities and are not optimized to networksand the problem of anomaly detection. Hence, they cannotbe directly used for the detection of structural inconsis-tencies proposed in this paper. We propose a novel graphembedding method, speciﬁcally designed to ferret out theanomalous nodes in large networks.Our embedding approach is based on a model, in whicheach dimension of the embedding corresponds to a clus-tered region in the network. In other words, the similarityof different nodes along a particular dimension, indicatestheir similarity to a particular clustered region. Therefore,this embedding retains a very high level of interpretabilityin terms of the original graph data, which is very usefulfrom an application-speciﬁc perspective. The nature of theembedding also makes it possible to detect anomalousnodes, by examining the interaction of each node with thedifferent regions in terms of the embedding. In particular,we measure the level of anomalousness of a node in terms ofthe embedding imposed on the node and its neighbors.Each node in graph embedding is represented as a d-dimensional vector, and the dimensionality d can be large.Hence, such an approach is rather hard to apply to thecase of large networks, because the complexity of the ap-proach is in proportion to the square of the number ofnodes when optimizing the embedding and because thenoises in the embedding seriously impair the accuracy of detected anomalous nodes. Hence, we incorporate data ap-proximation techniques (sampling, graph partitioning, and,moreover, a novel dimension reduction technique) to makethe approach more scalable and effective for large networks.Essentially, in our graph embedding, d represents thenumber of communities. As the anomalous nodes are onlydetermined by inﬂuential communities and nodes typicallyconnect to a limited number of communities, the completed-dimensions are unnecessary, and a limited number ofcommunities sufﬁce to ascertain anomalies. Thus, our di-mension reduction technique (referred to as k + β reduction)only maintains ( k + β )-dimensions for embedding of eachnode, where k is the maximum number of communities toconnect, β is to tolerate mistakes when determining the k communities and is removed after the computation process.Here k, β (cid:28) d , e.g., k = 10 and β = 2 and d = 600 for anetwork with nodes.Using both real-life data and synthetic data , we conductan extensive experimental study. (a) The modularity [10]was increased about 4.9% and 3.6% with our approach andOddBall [3], respectively; (c) Our approach scales to graphgraphs with large number of communities, while traditionalmultidimensional scaling approach [6] ran out of memory. (3) Ensemble Enabled Link Prediction [13], [14] . Linkprediction is the task to predict the formation of future linksin a dynamic and evolving network, and has been exten-sively studied due to its numerous applications, such as therecommendation of friends in a social network, images in amultimedia network, or collaborators in a scientiﬁc network.Link prediction methods are often applied to very largeand sparse networks, which have a large search space O ( n ) ,where n is the number of nodes. Hence, the scalability is abig challenge. In fact, an often overlooked fact is that most exiting link prediction algorithms evaluate the link propensitiesonly over a subset of possibilities rather than all propensitiesover the entire network . Consider a large network with nodes. Its number of possibilities for links is of the order of . Therefore, a 3GHz processor would require at least days just to allocate one machine cycle to every pair of nodes.This implies that in order to determine the top-ranked linkpredictions over the entire network , the running time wouldbe much more than days.It is noteworthy that most existing link prediction algo-rithms are not designed to search over the entire O ( n ) pos-sibilities. A closer examination of the relevant studies showsthat even for networks of modest size, these algorithms per-form benchmark evaluations over a sample of the possibilities for links. In other words, the complete ranking problem for linkprediction in very large networks remains challenging at leastfrom a computational point of view .Latent factor models have proven a great success forcollaborative ﬁltering, but not link prediction in spite of theobvious similarity and the obvious effectiveness of latentfactor models. One of the reasons why latent factor modelsare rarely used for link prediction is due to their complexity.In collaborative ﬁltering applications, items have a fewhundred thousand dimensions, whereas even the smallestreal-world networks contain more than a million nodes.Even worse, we also experientially verify that the qualityof link prediction for latent factor models decreases with the increase of data sparsity, and networks typically becomesparser when their sizes grow larger.We explore an ensemble approach to making latent fac-tor models practical for link prediction by decomposingthe search space into a set of smaller matrices with threestructural bagging methods with performance guarantees,which has obvious effectiveness advantages. In this way,latent factor models only need to deal with networks withsmall sizes (and denser), and retain their effectiveness andefﬁciency. By incorporating with the characteristics of linkprediction, the bagging methods maintain high predictionaccuracy while reducing the network size via graph sam-pling techniques. Further, the use of an ensemble approachhas obvious robustness advantages as well.We experimentally show that our ensemble approachis over times faster and over more accurate than BIGCLAM [47] using real-life social networks.

EYOND A PPROXIMATION T ECHNIQUES

For big data analytics, there are no one-size-ﬁts-all tech-niques, and it is often necessary to combine different tech-niques to obtain good solutions.We have seen that sampling helps to achieve a balancebetween efﬁciency and effectiveness for approximate queryprocessing [9], [21], [30], [42] and link prediction [13], [14],and other techniques such as incremental computation [17],[38], [44], distributed computing [18], [35], and systemtechniques e.g., caching [46], hardware [1], [22] can also beunitized for big data analytics, and can even be combinedfor designing query and data approximation techniques forbig data analytics.It is worth pointing out that (1) for all kind of techniquesbig data analytics, various computing resources should beseriously considered, e.g., using bounded resources for ap-proximation [8] and for incremental computation [40], and(2) theoretical analyses are also important for developingapproximation techniques. For instance, our query and dataapproximation techniques are based serious theoretical re-sults [31], [33], [34], [36], [37].

ONCLUSIONS

In this article we have systematiclly introduced approxi-mation computation techniques for efﬁcient and effectivebig data analytics. Furthermore, although approximate com-putation does not put theoretical bounds with respect tooptimal solutions, it does expect a balance between efﬁ-ciency and effectiveness. Indeed, (a) our query approxima-tion techniques [31], [33], [34], [39] show that efﬁciency canbe obtained with high accuracy in practice, and (b) our dataapproximation techniques [13], [14], [25], [36], [37] showthat efﬁciency and accuracy can be obtained simultaneouslyfor certain data analytic tasks. That is, though approximatecomputation is for a situation where an approximate resultis sufﬁcient for a purpose, its design policy is not always tosacriﬁce effectiveness for efﬁciency.

Acknowledgement . This work is supported in part by 973program ( ), NSFC (

U1636210 & 61421003 ). Wewould also thank our colleagues Charu Aggarwal, WenfeiFan, Xuelian Lin and our students Yang Cao, Liang Duan,Kaiyu Feng, Renjun Hu, Han Zhang for their joined efforts. R EFERENCES [1] C. R. Aberger, A. Lamb, S. Tu, A. N¨otzli, K. Olukotun, and C. R´e.Emptyheaded: A relational engine for graph processing.

ACMTrans. Database Syst. , 42(4):20:1–20:44, 2017.[2] A. Agrawal, J. Choi, K. Gopalakrishnan, S. Gupta, R. Nair, J. Oh,D. A. Prener, S. Shukla, V. Srinivasan, and Z. Sura. Approximatecomputing: Challenges and opportunities. In

ICRC , 2016.[3] L. Akoglu, H. Tong, and D. Koutra. Graph based anomalydetection and description: a survey.

Data Min. Knowl. Discov. ,29(3):626–688, 2015.[4] G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, Marchetti-Spaccamela, and M. A., Protasi.

Complexity and Approximation:Combinatorial Optimization Problems and Their Approximability Prop-erties . Springer, 1999.[5] P. Bogdanov, M. Mongiov`ı, and A. K. Singh. Mining heavysubgraphs in time-evolving networks. In

ICDM , 2011.[6] I. Borg and P. Groenen.

Modern Multidimensional Scaling: Theoryand Applications (2nd ed.) . Springer, 2005.[7] R. S. Burt. Structural holes and good ideas.

American Journal ofSociology , 110(2):349–399, 2004.[8] Y. Cao and W. Fan. Data driven approximation with boundedresources.

PVLDB , 10(9):973–984, 2017.[9] S. Chaudhuri, B. Ding, and S. Kandula. Approximate queryprocessing: No silver bullet. In

SIGMOD , 2017.[10] A. Clauset, M. E. J. Newman, and C. Moore. Finding communitystructure in very large networks.

Physical Review E , 70:066111,2004.[11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.

Introduc-tion to Algorithms . The MIT Press, 2001.[12] D. H. Douglas and T. K. Peucker. Algorithms for the reduction ofthe number of points required to represent a digitized line or itscaricature.

The Canadian Cartographer , 10(2):112–122, 1973.[13] L. Duan, C. C. Aggarwal, S. Ma, R. Hu, and J. Huai. Scaling uplink prediction with ensembles. In

WSDM , 2016.[14] L. Duan, S. Ma, C. Aggarwal, T. Ma, and J. Huai. An ensembleapproach to link prediction.

TKDE , 29(11):2402–2416, 2017.[15] W. Fan, F. Geerts, and F. Neven. Making queries tractable on bigdata with preprocessing.

PVLDB , 6(9):685–696, 2013.[16] W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding regularexpressions to graph reachability and pattern queries. In

ICDE ,2011.[17] W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph patternmatching: From intractable to polynomial time.

PVLDB , 3(1), 2010.[18] W. Fan, J. Xu, Y. Wu, W. Yu, J. Jiang, Z. Zheng, B. Zhang, Y. Cao,and C. Tian. Parallelizing sequential graph computations. In

SIGMOD , 2017.[19] B. Gallagher. Matching structure and semantics: A survey ongraph-based pattern matching.

AAAI FS. , 2006.[20] M. R. Garey, R. L. Graham, and J. D. Ullman. Worst-case analysisof memory allocation algorithms. In

STOC , 1972.[21] M. N. Garofalakis and P. B. Gibbons. Approximate query process-ing: Taming the terabytes. In

VLDB , 2001.[22] S. Han, L. Zou, and J. X. Yu. Speeding up set intersections in graphalgorithms using SIMD instructions. In

SIGMOD , 2018.[23] M. R. Henzinger, T. A. Henzinger, and P. W. Kopke. Computingsimulations on ﬁnite and inﬁnite graphs. In

FOCS , 1995.[24] D. S. Hochbaum.

Approximation Algorithms for NP-Hard Problems .Springer, 1996.[25] R. Hu, C. C. Aggarwal, S. Ma, and J. Huai. An embeddingapproach to anomaly detection. In

ICDE , 2016.[26] B. W. K. Hung and A. P. Jayasumana. Investigative simulation:Towards utilizing graph pattern matching for investigative search.In

ASONAM , 2016.[27] H. Imai and M. Iri. Computational-geometric methods for polygo-nal approximations of a curve.

Computer Vision, Graphics, and ImageProcessing , 36:31–41, 1986.[28] D. S. Johnson. Approximation algorithms for combinatorial prob-lems.

J. Comput. Syst. Sci. , 9(3):256–278, 1974.[29] M. I. Jordan. Computational thinking, inferential thinking and“big data”. In

PODS , 2015.[30] T. Kraska. Approximate query processing for interactive datascience. In

SIGMOD , 2017. [31] X. Lin, S. Ma, H. Zhang, T. Wo, and J. Huai. One-pass errorbounded trajectory simpliﬁcation.

PVLDB , 10(7):841–852, 2017.[32] J. Liu, K. Zhao, P. Sommer, S. Shang, B. Kusy, and R. Jurdak.Bounded quadrant system: Error-bounded trajectory compressionon the go. In

ICDE , 2015.[33] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Capturing topology ingraph pattern matching.

PVLDB , 5(4):310–321, 2011.[34] S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo. Strong simulation:Capturing topology in graph pattern matching.

TODS , 39(1):4:1–4:46, 2014.[35] S. Ma, Y. Cao, J. Huai, and T. Wo. Distributed graph patternmatching. In

WWW , 2012.[36] S. Ma, K. Feng, J. Li, H. Wang, G. Cong, and J. Huai. Proxies forshortest path and distance queries.

TKDE , 28(7):1835–1850, 2016.[37] S. Ma, K. Feng, J. Li, H. Wang, G. Cong, and J. Huai. Proxies forshortest path and distance queries. In

ICDE , 2017.[38] S. Ma, C. Gong, R. Hu, D. Luo, C. Hu, and J. Huai. Queryindependent scholarly article ranking. In

ICDE , 2018.[39] S. Ma, R. Hu, L. Wang, X. Lin, and J. Huai. Fast computation ofdense temporal subgraphs. In

ICDE , 2017.[40] S. Ma, J. Li, C. Hu, X. Liu, and J. Huai. Graph pattern matchingfor dynamic team formation.

CoRR , abs/1801.01012, 2018.[41] S. Mittal. A survey of techniques for approximate computing.

ACM Comput. Surv. , 48(4):62:1–62:33, 2016.[42] B. Mozafari. Approximate query engines: Commercial challengesand research opportunities. In

SIGMOD , 2017.[43] A. Nibali and Z. He. Trajic: An effective compression system fortrajectory data.

TKDE , 27(11):3138–3151, 2015.[44] G. Ramalingam and T. Reps. On the computational complexity ofdynamic graph problems.

TCS , 158(1-2), 1996.[45] V. V. Vazirani.

Approximation Algorithms . Springer, 2003.[46] J. Wang, Z. Liu, S. Ma, N. Ntarmos, and P. Triantaﬁllou. GC: Agraph caching system for subgraph/supergraph queries.

PVLDB ,11(12):2022–2025, 2018.[47] J. Yang and J. Leskovec. Overlapping community detection atscale: A nonnegative matrix factorization approach. In

WSDM ,2013.[48] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave,X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi,J. Gonzalez, S. Shenker, and I. Stoica. Apache spark: a uniﬁedengine for big data processing.

Commun. ACM , 59(11):56–65, 2016.[49] A. D. Zhu, H. Ma, X. Xiao, S. Luo, Y. Tang, and S. Zhou. Shortestpath and distance queries on road networks: towards bridgingtheory and practice. In

SIGMOD , 2013.[50] A. Zimek and E. Schubert.

Outlier Detection , pages 1–5. SpringerNew York, 2017.

PLACEPHOTOHERE

Shuai Ma is a professor at the School of Com-puter Science and Engineering, Beihang Univer-sity, China. He obtained his PhD degrees fromUniversity of Edinburgh in 2010, and from PekingUniversity in 2004, respectively. He was a post-doctoral research fellow in the database group,University of Edinburgh, a summer intern at Belllabs, Murray Hill, USA and a visiting researcherof MRSA. He is a recipient of the best paperaward for VLDB 2010 and the best challenge pa-per award for WISE 2013. His current researchinterests include database theory and systems, social data and graphanalysis, and data intensive computing.PLACEPHOTOHERE