Bojian Xu
Eastern Washington University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bojian Xu.
principles of distributed computing | 2006
Srikanta Tirthapura; Bojian Xu; Costas Busch
We study the problem of maintaining sketches of recent elements of a data stream. Motivated by applications involving network data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent elements of a stream is modeled by the sliding timestamp window, which is the set of elements with timestamps that are close to the current time. We design algorithms for maintaining sketches of all elements within the sliding timestamp window that can give provably accurate estimates of two basic aggregates, the sum and the median, of a stream of numbers. The space taken by the sketches, the time needed for querying the sketch, and the time for inserting new elements into the sketch are all polylog with respect to the maximum window size and the values of the data items in the window. Our sketches can be easily combined in a lossless and compact way, making them useful for distributed computations over data streams. Previous works on sketching recent elements of a data stream have all considered the more restrictive scenario of synchronous streams, where the observed order of data is the same as the time order in which the data was generated. Our notion of recency of elements is more general than that studied in previous work, and thus our sketches are more robust to network delays and asynchrony.
principles of distributed computing | 2007
Graham Cormode; Srikanta Tirthapura; Bojian Xu
We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicate-insensitive, i.e. re-insertions of the same data will not affect the sketch, and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketches for sensor data aggregation [26,12], it is also time-decaying, so that the weight of a data item in the sketch can decrease with time according to a user-specified decay function. The sketch can give provably approximate guarantees for various aggregates of data, including the sum, median, quantiles, and frequent elements. The size of the sketch and the time taken to update it are both polylogarithmic in the size of the relevant data. Further, multiple sketches computed over distributed data can be combined without losing the accuracy guarantees. To our knowledge, this is the first sketch that combines all the above properties.
IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2012
M. O. Kulekci; Jeffrey Scott Vitter; Bojian Xu
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.
data compression, communications and processing | 2011
Roberto Grossi; Jeffrey Scott Vitter; Bojian Xu
The \emph{wavelet tree} data structure is a space-efficient technique for rank and select queries that generalizes from binary characters to an arbitrary multicharacter alphabet. It has become a key tool in modern full-text indexing and data compression because of its capabilities in compressing, indexing, and searching. We present a comparative study of its practical performance regarding a wide range of options on the dimensions of different coding schemes and tree shapes. Our results are both theoretical and experimental: (1)~We show that the run-length
SIAM Journal on Computing | 2009
Graham Cormode; Srikanta Tirthapura; Bojian Xu
\delta
Distributed Computing | 2008
Bojian Xu; Srikanta Tirthapura; Costas Busch
coding size of wavelet trees achieves the 0-order empirical entropy size of the original string with leading constant 1, when the strings 0-order empirical entropy is asymptotically less than the logarithm of the alphabet size. This result complements the previous works that are dedicated to analyzing run-length
combinatorial pattern matching | 2014
Atalay Mert İleri; M. Oğuzhan Külekci; Bojian Xu
\gamma
BMC Genomics | 2011
M. Oğuzhan Külekci; Wing-Kai Hon; Rahul Shah; Jeffrey Scott Vitter; Bojian Xu
-encoded wavelet trees. It also reveals the scenarios when run-length
Theoretical Computer Science | 2015
Atalay Mert İleri; M. Oğuzhan Külekci; Bojian Xu
\delta
international symposium on algorithms and computation | 2015
Wing-Kai Hon; Sharma V. Thankachan; Bojian Xu
encoding becomes practical. (2)~We introduce a full generic package of wavelet trees for a wide range of options on the dimensions of coding schemes and tree shapes. Our experimental study reveals the practical performance of the various modifications.