Is this you? Create Your Porfile

Bojian Xu

Eastern Washington University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bojian Xu is active.

Explore More

Publication

Featured researches published by Bojian Xu.

principles of distributed computing | 2006

Sketching asynchronous streams over a sliding window

Srikanta Tirthapura; Bojian Xu; Costas Busch

We study the problem of maintaining sketches of recent elements of a data stream. Motivated by applications involving network data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent elements of a stream is modeled by the sliding timestamp window, which is the set of elements with timestamps that are close to the current time. We design algorithms for maintaining sketches of all elements within the sliding timestamp window that can give provably accurate estimates of two basic aggregates, the sum and the median, of a stream of numbers. The space taken by the sketches, the time needed for querying the sketch, and the time for inserting new elements into the sketch are all polylog with respect to the maximum window size and the values of the data items in the window. Our sketches can be easily combined in a lossless and compact way, making them useful for distributed computations over data streams. Previous works on sketching recent elements of a data stream have all considered the more restrictive scenario of synchronous streams, where the observed order of data is the same as the time order in which the data was generated. Our notion of recency of elements is more general than that studied in previous work, and thus our sketches are more robust to network delays and asynchrony.

principles of distributed computing | 2007

Time-decaying sketches for sensor data aggregation

Graham Cormode; Srikanta Tirthapura; Bojian Xu

We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicate-insensitive, i.e. re-insertions of the same data will not affect the sketch, and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketches for sensor data aggregation [26,12], it is also time-decaying, so that the weight of a data item in the sketch can decrease with time according to a user-specified decay function. The sketch can give provably approximate guarantees for various aggregates of data, including the sum, median, quantiles, and frequent elements. The size of the sketch and the time taken to update it are both polylogarithmic in the size of the relevant data. Further, multiple sketches computed over distributed data can be combined without losing the accuracy guarantees. To our knowledge, this is the first sketch that combines all the above properties.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2012

Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

M. O. Kulekci; Jeffrey Scott Vitter; Bojian Xu

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.

data compression, communications and processing | 2011

Wavelet Trees: From Theory to Practice

Roberto Grossi; Jeffrey Scott Vitter; Bojian Xu

The \emph{wavelet tree} data structure is a space-efficient technique for rank and select queries that generalizes from binary characters to an arbitrary multicharacter alphabet. It has become a key tool in modern full-text indexing and data compression because of its capabilities in compressing, indexing, and searching. We present a comparative study of its practical performance regarding a wide range of options on the dimensions of different coding schemes and tree shapes. Our results are both theoretical and experimental: (1)~We show that the run-length

SIAM Journal on Computing | 2009

Time-decaying Sketches for Robust Aggregation of Sensor Data

Graham Cormode; Srikanta Tirthapura; Bojian Xu

\delta

Distributed Computing | 2008

Sketching asynchronous data streams over sliding windows

Bojian Xu; Srikanta Tirthapura; Costas Busch

coding size of wavelet trees achieves the 0-order empirical entropy size of the original string with leading constant 1, when the strings 0-order empirical entropy is asymptotically less than the logarithm of the alphabet size. This result complements the previous works that are dedicated to analyzing run-length

combinatorial pattern matching | 2014

Shortest Unique Substring Query Revisited

Atalay Mert İleri; M. Oğuzhan Külekci; Bojian Xu

\gamma

BMC Genomics | 2011

Ψ-RA: a parallel sparse index for genomic read alignment

M. Oğuzhan Külekci; Wing-Kai Hon; Rahul Shah; Jeffrey Scott Vitter; Bojian Xu

-encoded wavelet trees. It also reveals the scenarios when run-length

Theoretical Computer Science | 2015

A simple yet time-optimal and linear-space algorithm for shortest unique substring queries

Atalay Mert İleri; M. Oğuzhan Külekci; Bojian Xu

\delta

international symposium on algorithms and computation | 2015

An In-place Framework for Exact and Approximate Shortest Unique Substring Queries

Wing-Kai Hon; Sharma V. Thankachan; Bojian Xu

encoding becomes practical. (2)~We introduce a full generic package of wavelet trees for a wide range of options on the dimensions of coding schemes and tree shapes. Our experimental study reveals the practical performance of the various modifications.

Explore More