Zhewei Wei | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Zhewei Wei is active.

Explore More

Publication

Featured researches published by Zhewei Wei.

symposium on principles of database systems | 2012

Mergeable summaries

Pankaj K. Agarwal; Graham Cormode; Zengfeng Huang; Jeff M. Phillips; Zhewei Wei; Ke Yi

We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε) for ε-approximate quantiles, there is a deterministic summary of size O(1 over ε log(εn))that has a restricted form of mergeability, and a randomized one of size O(1 over ε log 3/21 over ε) with full mergeability. We also extend our results to geometric summaries such as ε-approximations and εkernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O(1 over ε log 3/21 over ε, and (2) we demonstrate that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.

very large data bases | 2017

Trajectory similarity join in spatial networks

Shuo Shang; Lisi Chen; Zhewei Wei; Christian S. Jensen; Kai Zheng; Panos Kalnis

The matching of similar pairs of objects, called similarity join, is fundamental functionality in data management. We consider the case of trajectory similarity join (TS-Join), where the objects are trajectories of vehicles moving in road networks. Thus, given two sets of trajectories and a threshold θ, the TS-Join returns all pairs of trajectories from the two sets with similarity above θ. This join targets applications such as trajectory near-duplicate detection, data cleaning, ridesharing recommendation, and traffic congestion prediction. With these applications in mind, we provide a purposeful definition of similarity. To enable efficient TS-Join processing on large sets of trajectories, we develop search space pruning techniques and take into account the parallel processing capabilities of modern processors. Specifically, we present a two-phase divide-and-conquer algorithm. For each trajectory, the algorithm first finds similar trajectories. Then it merges the results to achieve a final result. The algorithm exploits an upper bound on the spatiotemporal similarity and a heuristic scheduling strategy for search space pruning. The algorithms per-trajectory searches are independent of each other and can be performed in parallel, and the merging has constant cost. An empirical study with real data offers insight in the performance of the algorithm and demonstrates that is capable of outperforming a well-designed baseline algorithm by an order of magnitude.

symposium on principles of database systems | 2011

Beyond simple aggregates: indexing for summary queries

Zhewei Wei; Ke Yi

Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the querys conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or near-optimal query cost.

acm symposium on parallel algorithms and architectures | 2009

Dynamic external hashing: the limit of buffering

Zhewei Wei; Ke Yi; Qin Zhang

Hash tables are one of the most fundamental data structures in computer science, in both theory and practice. They are especially useful in external memory, where their query performance approaches the ideal cost of just one disk access. Knuth [16] gave an elegant analysis showing that with some simple collision resolution strategies such as linear probing or chaining, the expected average number of disk I/Os of a lookup is merely 1+1/2Ω(b), where each I/O can read and/or write a disk block containing b items. Inserting a new item into the hash table also costs 1+1/2Ω(b) I/Os, which is again almost the best one can do if the hash table is entirely stored on disk. However, this requirement is unrealistic since any algorithm operating on an external hash table must have some internal memory (at least Ω(1) blocks) to work with. The availability of a small internal memory buffer can dramatically reduce the amortized insertion cost to o(1) I/Os for many external memory data structures. In this paper we study the inherent query-insertion tradeoff of external hash tables in the presence of a memory buffer. In particular, we show that for any constant c>1, if the expected average successful query cost is targeted at 1+O(1/bc) I/Os, then it is not possible to support insertions in less than 1-O(1/bc-1/6) I/Os amortized, which means that the memory buffer is essentially useless. While if the query cost is relaxed to 1+O(1/bc) I/Os for any constant c<1, there is a simple dynamic hash table with o(1) insertion cost.

international conference on management of data | 2016

Matrix Sketching Over Sliding Windows

Zhewei Wei; Xuancheng Liu; Feifei Li; Shuo Shang; Xiaoyong Du; Ji-Rong Wen

Large-scale matrix computation becomes essential for many data data applications, and hence the problem of sketching matrix with small space and high precision has received extensive study for the past few years. This problem is often considered in the row-update streaming model, where the data set is a matrix A -- Rn x d, and the processor receives a row (1 x d) of A at each timestamp. The goal is to maintain a smaller matrix (termed approximation matrix, or simply approximation) B -- Rl x d as an approximation to A, such that the covariance error |AT A - BTB| is small and l ll n. This paper studies continuous tracking approximations to the matrix defined by a sliding window of most recent rows. We consider both sequence-based and time-based window. We show that maintaining ATA exactly requires linear space in the sliding window model, as opposed to O(d2) space in the streaming model. With this observation, we present three general frameworks for matrix sketching on sliding windows. The sampling techniques give random samples of the rows in the window according to their squared norms. The Logarithmic Method converts a mergeable streaming matrix sketch into a matrix sketch on time-based sliding windows. The Dyadic Interval framework converts arbitrary streaming matrix sketch into a matrix sketch on sequence-based sliding windows. In addition to proving all algorithmic properties theoretically, we also conduct extensive empirical study with real data sets to demonstrate the efficiency of these algorithms.

international conference on management of data | 2015

Persistent Data Sketching

Zhewei Wei; Ge Luo; Ke Yi; Xiaoyong Du; Ji-Rong Wen

A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried. Persistent data structures aim at recording all versions accurately, which results in a space requirement that is at least linear to the number of updates. In many of todays big data applications, in particular for high-speed streaming data, the volume and velocity of the data are so high that we cannot afford to store everything. Therefore, streaming algorithms have received a lot of attention in the research community, which use only sublinear space by sacrificing slightly on accuracy. All streaming algorithms work by maintaining a small data structure in memory, which is usually called a em sketch, summary, or synopsis. The sketch is updated upon the arrival of every element in the stream, thus is ephemeral, meaning that it can only answer queries about the current status of the stream. In this paper, we aim at designing persistent sketches, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.

Journal of Computer Science and Technology | 2016

Dynamic Shortest Path Monitoring in Spatial Networks

Shuo Shang; Lisi Chen; Zhewei Wei; Danhuai Guo; Ji-Rong Wen

With the increasing availability of real-time traffic information, dynamic spatial networks are pervasive nowadays and path planning in dynamic spatial networks becomes an important issue. In this light, we propose and investigate a novel problem of dynamically monitoring shortest paths in spatial networks (DSPM query). When a traveler aims to a destination, his/her shortest path to the destination may change due to two reasons: 1) the travel costs of some edges have been updated and 2) the traveler deviates from the pre-planned path. Our target is to accelerate the shortest path computing in dynamic spatial networks, and we believe that this study may be useful in many mobile applications, such as route planning and recommendation, car navigation and tracking, and location-based services in general. This problem is challenging due to two reasons: 1) how to maintain and reuse the existing computation results to accelerate the following computations, and 2) how to prune the search space effectively. To overcome these challenges, filter-and-refinement paradigm is adopted. We maintain an expansion tree and define a pair of upper and lower bounds to prune the search space. A series of optimization techniques are developed to accelerate the shortest path computing. The performance of the developed methods is studied in extensive experiments based on real spatial data.

very large data bases | 2018

Parallel trajectory similarity joins in spatial networks

Shuo Shang; Lisi Chen; Zhewei Wei; Christian S. Jensen; Kai Zheng; Panos Kalnis

The matching of similar pairs of objects, called similarity join, is fundamental functionality in data management. We consider two cases of trajectory similarity joins (TS-Joins), including a threshold-based join (Tb-TS-Join) and a top-k TS-Join (k-TS-Join), where the objects are trajectories of vehicles moving in road networks. Given two sets of trajectories and a threshold

asia-pacific web conference | 2016