Computer Science Data Structures And Algorithms - Researchain

Featured Researches

Breaking the Quadratic Barrier for Matroid Intersection

The matroid intersection problem is a fundamental problem that has been extensively studied for half a century. In the classic version of this problem, we are given two matroids M 1 =(V, I 1 ) and M 2 =(V, I 2 ) on a comment ground set V of n elements, and then we have to find the largest common independent set S??I 1 ??I 2 by making independence oracle queries of the form "Is S??I 1 ?" or "Is S??I 2 ?" for S?�V . The goal is to minimize the number of queries. Beating the existing O ~ ( n 2 ) bound, known as the quadratic barrier, is an open problem that captures the limits of techniques from two lines of work. The first one is the classic Cunningham's algorithm [SICOMP 1986], whose O ~ ( n 2 ) -query implementations were shown by CLS+ [FOCS 2019] and Nguyen [2019]. The other one is the general cutting plane method of Lee, Sidford, and Wong [FOCS 2015]. The only progress towards breaking the quadratic barrier requires either approximation algorithms or a more powerful rank oracle query [CLS+ FOCS 2019]. No exact algorithm with o( n 2 ) independence queries was known. In this work, we break the quadratic barrier with a randomized algorithm guaranteeing O ~ ( n 9/5 ) independence queries with high probability, and a deterministic algorithm guaranteeing O ~ ( n 11/6 ) independence queries. Our key insight is simple and fast algorithms to solve a graph reachability problem that arose in the standard augmenting path framework [Edmonds 1968]. Combining this with previous exact and approximation algorithms leads to our results.

Data Structures And Algorithms

Bucket Oblivious Sort: An Extremely Simple Oblivious Sort

We propose a conceptually simple oblivious sort and oblivious random permutation algorithms called bucket oblivious sort and bucket oblivious random permutation. Bucket oblivious sort uses 6nlogn time (measured by the number of memory accesses) and 2Z client storage with an error probability exponentially small in Z . The above runtime is only 3× slower than a non-oblivious merge sort baseline; for 2 30 elements, it is 5× faster than bitonic sort, the de facto oblivious sorting algorithm in practical implementations.

Data Structures And Algorithms

Budget-Smoothed Analysis for Submodular Maximization

The greedy algorithm for submodular function maximization subject to cardinality constraint is guaranteed to approximate the optimal solution to within a 1??/e factor. For worst-case instances, it is well known that this guarantee is essentially tight -- for greedy and in fact any efficient algorithm. Motivated by the question of why greedy performs better in practice, we introduce a new notion of budget smoothed analysis. Our framework requires larger perturbations of budgets than traditional smoothed analysis for e.g. linear programming. Nevertheless, we show that under realistic budget distributions, greedy and related algorithms enjoy provably better approximation guarantees, that hold even for worst-case submodular functions.

Data Structures And Algorithms

Buffered Streaming Graph Partitioning

Partitioning graphs into blocks of roughly equal size is a widely used tool when processing large graphs. Currently there is a gap in the space of available partitioning algorithms. On the one hand, there are streaming algorithms that have been adopted to partition massive graph data on small machines. In the streaming model, vertices arrive one at a time including their neighborhood and then have to be assigned directly to a block. These algorithms can partition huge graphs quickly with little memory, but they produce partitions with low quality. On the other hand, there are offline (shared-memory) multilevel algorithms that produce partitions with high quality but also need a machine with enough memory to partition a network. In this work, we make a first step to close this gap by presenting an algorithm that computes high-quality partitions of huge graphs using a single machine with little memory. First, we extend the streaming model to a more reasonable approach in practice: the buffered streaming model. In this model, a PE can store a batch of nodes (including their neighborhood) before making assignment decisions. When our algorithm receives a batch of nodes, we build a model graph that represents the nodes of the batch and the already present partition structure. This model enables us to apply multilevel algorithms and in turn compute high-quality solutions of huge graphs on cheap machines. To partition the model, we develop a multilevel algorithm that optimizes an objective function that has previously shown to be effective for the streaming setting. Surprisingly, this also removes the dependency on the number of blocks from the running time. Overall, our algorithm computes on average 55% better solutions than Fennel using a very small batch size. In addition, our algorithm is significantly faster than one of the main one-pass partitioning algorithms for larger amounts of blocks.

Data Structures And Algorithms

Cadences in Grammar-Compressed Strings

Cadences are structurally maximal arithmetic progressions of indices corresponding to equal characters in an underlying string. This paper provides a polynomial time detection algorithm for 3-cadences in grammar-compressed binary strings. This algorithm also translates to a linear time detection algorithm for 3-cadences in uncompressed binary strings. Furthermore, this paper proves that several variants of the cadence detection problem are NP-complete for grammar-compressed strings. As a consequence, the equidistant subsequence matching problem with patterns of length three is NP-complete for grammar-compressed ternary strings.

Data Structures And Algorithms

Cantor Mapping Technique

A new technique specific to String ordering utilizing a method called "Cantor Mapping" is explained in this paper and used to perform string comparative sort in loglinear time while utilizing linear extra space.

Data Structures And Algorithms

Cardinality estimation using Gumbel distribution

Cardinality estimation is the task of approximating the number of distinct elements in a large dataset with possibly repeating elements. LogLog and HyperLogLog (c.f. Durand and Flajolet [ESA 2003], Flajolet et al. [Discrete Math Theor. 2007]) are small space sketching schemes for cardinality estimation, which have both strong theoretical guarantees of performance and are highly effective in practice. This makes them a highly popular solution with many implementations in big-data systems (e.g. Algebird, Apache DataSketches, BigQuery, Presto and Redis). However, despite having simple and elegant formulation, both the analysis of LogLog and HyperLogLog are extremely involved -- spanning over tens of pages of analytic combinatorics and complex function analysis. We propose a modification to both LogLog and HyperLogLog that replaces discrete geometric distribution with a continuous Gumbel distribution. This leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

Data Structures And Algorithms

Chunk List: Concurrent Data Structures

Chunking data is obviously no new concept; however, I had never found any data structures that used chunking as the basis of their implementation. I figured that by using chunking alongside concurrency, I could create an extremely fast run-time in regards to particular methods as searching and/or sorting. By using chunking and concurrency to my advantage, I came up with the chunk list - a dynamic list-based data structure that would separate large amounts of data into specifically sized chunks, each of which should be able to be searched at the exact same time by searching each chunk on a separate thread. As a result of implementing this concept into its own class, I was able to create something that almost consistently gives around 20x-300x faster results than a regular ArrayList. However, should speed be a particular issue even after implementation, users can modify the size of the chunks and benchmark the speed of using smaller or larger chunks, depending on the amount of data being stored.

Data Structures And Algorithms

Circular Trace Reconstruction

Trace reconstruction is the problem of learning an unknown string x from independent traces of x , where traces are generated by independently deleting each bit of x with some deletion probability q . In this paper, we initiate the study of Circular trace reconstruction, where the unknown string x is circular and traces are now rotated by a random cyclic shift. Trace reconstruction is related to many computational biology problems studying DNA, which is a primary motivation for this problem as well, as many types of DNA are known to be circular. Our main results are as follows. First, we prove that we can reconstruct arbitrary circular strings of length n using exp( O ~ ( n 1/3 )) traces for any constant deletion probability q , as long as n is prime or the product of two primes. For n of this form, this nearly matches what was the best known bound of exp(O( n 1/3 )) for standard trace reconstruction when this paper was initially released. We note, however, that Chase very recently improved the standard trace reconstruction bound to exp( O ~ ( n 1/5 )) . Next, we prove that we can reconstruct random circular strings with high probability using n O(1) traces for any constant deletion probability q . Finally, we prove a lower bound of Ω ~ ( n 3 ) traces for arbitrary circular strings, which is greater than the best known lower bound of Ω ~ ( n 3/2 ) in standard trace reconstruction.

Data Structures And Algorithms

Clan Embeddings into Trees, and Low Treewidth Graphs

In low distortion metric embeddings, the goal is to embed a host "hard" metric space into a "simpler" target space while approximately preserving pairwise distances. A highly desirable target space is that of a tree metric. Unfortunately, such embedding will result in a huge distortion. A celebrated bypass to this problem is stochastic embedding with logarithmic expected distortion. Another bypass is Ramsey-type embedding, where the distortion guarantee applies only to a subset of the points. However, both these solutions fail to provide an embedding into a single tree with a worst-case distortion guarantee on all pairs. In this paper, we propose a novel third bypass called \emph{clan embedding}. Here each point x is mapped to a subset of points f(x) , called a \emph{clan}, with a special \emph{chief} point ?(x)?�f(x) . The clan embedding has multiplicative distortion t if for every pair (x,y) some copy y ???�f(y) in the clan of y is close to the chief of x : min y ???�f(y) d( y ??,?(x))?�t?�d(x,y) . Our first result is a clan embedding into a tree with multiplicative distortion O( logn ϵ ) such that each point has 1+ϵ copies (in expectation). In addition, we provide a "spanning" version of this theorem for graphs and use it to devise the first compact routing scheme with constant size routing tables. We then focus on minor-free graphs of diameter prameterized by D , which were known to be stochastically embeddable into bounded treewidth graphs with expected additive distortion ϵD . We devise Ramsey-type embedding and clan embedding analogs of the stochastic embedding. We use these embeddings to construct the first (bicriteria quasi-polynomial time) approximation scheme for the metric ? -dominating set and metric ? -independent set problems in minor-free graphs.

Ready to get started?

Join us today

Archive Your Research