Featured Researches

Data Structures And Algorithms

Frequency Estimation in Data Streams: Learning the Optimal Hashing Scheme

We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random hashing to maintain the frequency distribution of the data steam using limited storage, the proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution. We develop an exact mixed-integer linear optimization formulation, which enables us to compute optimal or near-optimal hashing schemes for elements seen in the observed stream prefix; then, we use machine learning to hash unseen elements. Further, we develop an efficient block coordinate descent algorithm, which, as we empirically show, produces high quality solutions, and, in a special case, we are able to solve the proposed formulation exactly in linear time using dynamic programming. We empirically evaluate the proposed approach both on synthetic datasets and on real-world search query data. We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.

Read more
Data Structures And Algorithms

Fully Dynamic Algorithms for Knapsack Problems with Polylogarithmic Update Time

Knapsack problems are among the most fundamental problems in optimization. In the Multiple Knapsack problem, we are given multiple knapsacks with different capacities and items with values and sizes. The task is to find a subset of items of maximum total value that can be packed into the knapsacks without exceeding the capacities. We investigate this problem and special cases thereof in the context of dynamic algorithms and design data structures that efficiently maintain near-optimal knapsack solutions for dynamically changing input. More precisely, we handle the arrival and departure of individual items or knapsacks during the execution of the algorithm with worst-case update time polylogarithmic in the number of items. As the optimal and any approximate solution may change drastically, we only maintain implicit solutions and support certain queries in polylogarithmic time, such as the packing of an item and the solution value. While dynamic algorithms are well-studied in the context of graph problems, there is hardly any work on packing problems and generally much less on non-graph problems. Given the theoretical interest in knapsack problems and their practical relevance, it is somewhat surprising that Knapsack has not been addressed before in the context of dynamic algorithms and our work bridges this gap.

Read more
Data Structures And Algorithms

Fully Dynamic Approximate Maximum Independent Set on Massive Graphs

Computing a maximum independent set (MaxIS) is a fundamental NP-hard problem in graph theory, which has important applications in a wide range of areas such as social network analysis, graphical information systems and coding theory. Since the underlying graphs of numerous applications are always changing continuously, the problem of maintaining a MaxIS over dynamic graphs has received increasing attention in recent years. Due to the intractability of maintaining an exact MaxIS, this paper studies the problem of maintaining an approximate MaxIS over fully dynamic graphs, where 4 graph update operations are allowed \ie, adding or deleting a vertex or an edge. Based on swap operation, we present a novel framework for maintaining an approximate maximum independent set which contains no k -swaps and make a deep analysis of performance ratio achieved by it. We implement a dynamic ( Δ 2 +1) -approximate algorithm and a more effective algorithm based on one-swap vertex and two-swap vertex set respectively and make a further analysis of their performance based on Power-Law Random graph model. Extensive experiments are conducted over real graphs to confirm the effectiveness and efficiency of the proposed algorithms.

Read more
Data Structures And Algorithms

Fully Dynamic Electrical Flows: Sparse Maxflow Faster Than Goldberg-Rao

We give an algorithm for computing exact maximum flows on graphs with m edges and integer capacities in the range [1,U] in O ? ( m 3 2 ??1 328 logU) time. For sparse graphs with polynomially bounded integer capacities, this is the first improvement over the O ? ( m 1.5 logU) time bound from [Goldberg-Rao JACM `98]. Our algorithm revolves around dynamically maintaining the augmenting electrical flows at the core of the interior point method based algorithm from [M?dry JACM `16]. This entails designing data structures that, in limited settings, return edges with large electric energy in a graph undergoing resistance updates.

Read more
Data Structures And Algorithms

Fully-Dynamic Submodular Cover with Bounded Recourse

In submodular covering problems, we are given a monotone, nonnegative submodular function f: 2 N → R + and wish to find the min-cost set S⊆N such that f(S)=f(N) . This captures SetCover when f is a coverage function. We introduce a general framework for solving such problems in a fully-dynamic setting where the function f changes over time, and only a bounded number of updates to the solution (recourse) is allowed. For concreteness, suppose a nonnegative monotone submodular function g t is added or removed from an active set G (t) at each time t . If f (t) = ∑ g∈ G (t) g is the sum of all active functions, we wish to maintain a competitive solution to SubmodularCover for f (t) as this active set changes, and with low recourse. We give an algorithm that maintains an O(log( f max / f min )) -competitive solution, where f max , f min are the largest/smallest marginals of f (t) . The algorithm guarantees a total recourse of O(log( c max / c min )⋅ ∑ t≤T g t (N)) , where c max , c min are the largest/smallest costs of elements in N . This competitive ratio is best possible even in the offline setting, and the recourse bound is optimal up to the logarithmic factor. For monotone submodular functions that also have positive mixed third derivatives, we show an optimal recourse bound of O( ∑ t≤T g t (N)) . This structured class includes set-coverage functions, so our algorithm matches the known O(logn) -competitiveness and O(1) recourse guarantees for fully-dynamic SetCover. Our work simultaneously simplifies and unifies previous results, as well as generalizes to a significantly larger class of covering problems. Our key technique is a new potential function inspired by Tsallis entropy. We also extensively use the idea of Mutual Coverage, which generalizes the classic notion of mutual information.

Read more
Data Structures And Algorithms

Further Unifying the Landscape of Cell Probe Lower Bounds

In a landmark paper, Pǎtraşcu demonstrated how a single lower bound for the static data structure problem of reachability in the butterfly graph, could be used to derive a wealth of new and previous lower bounds via reductions. These lower bounds are tight for numerous static data structure problems. Moreover, he also showed that reachability in the butterfly graph reduces to dynamic marked ancestor, a classic problem used to prove lower bounds for dynamic data structures. Unfortunately, Pǎtraşcu's reduction to marked ancestor loses a lglgn factor and therefore falls short of fully recovering all the previous dynamic data structure lower bounds that follow from marked ancestor. In this paper, we revisit Pǎtraşcu's work and give a new lossless reduction to dynamic marked ancestor, thereby establishing reachability in the butterfly graph as a single seed problem from which a range of tight static and dynamic data structure lower bounds follow.

Read more
Data Structures And Algorithms

GRMR: Generalized Regret-Minimizing Representatives

Extracting a small subset of representative tuples from a large database is an important task in multi-criteria decision making. The regret-minimizing set (RMS) problem is recently proposed for representative discovery from databases. Specifically, for a set of tuples (points) in d dimensions, an RMS problem finds the smallest subset such that, for any possible ranking function, the relative difference in scores between the top-ranked point in the subset and the top-ranked point in the entire database is within a parameter ε∈(0,1) . Although RMS and its variations have been extensively investigated in the literature, existing approaches only consider the class of nonnegative (monotonic) linear functions for ranking, which have limitations in modeling user preferences and decision-making processes. To address this issue, we define the generalized regret-minimizing representative (GRMR) problem that extends RMS by taking into account all linear functions including non-monotonic ones with negative weights. For two-dimensional databases, we propose an optimal algorithm for GRMR via a transformation into the shortest cycle problem in a directed graph. Since GRMR is proven to be NP-hard even in three dimensions, we further develop a polynomial-time heuristic algorithm for GRMR on databases in arbitrary dimensions. Finally, we conduct extensive experiments on real and synthetic datasets to confirm the efficiency, effectiveness, and scalability of our proposed algorithms.

Read more
Data Structures And Algorithms

Gapped Indexing for Consecutive Occurrences

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range [\alpha,\beta] we can quickly find the consecutive occurrences of P1 and P2 with distance in [\alpha,\beta], i.e., pairs of occurrences immediately following each other and with distance within the range. We present data structures that use ?(n) space and query time ?(|P1|+|P2|+n^(2/3)) for existence and counting and ?(|P1|+|P2|+n^(2/3)*occ^(1/3)) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using ?(n) space must use \tilde{\Omega}}(|P1|+|P2|+\sqrt{n}) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

Read more
Data Structures And Algorithms

Generalized Parametric Path Problems

Parametric path problems arise independently in diverse domains, ranging from transportation to finance, where they are studied under various assumptions. We formulate a general path problem with relaxed assumptions, and describe how this formulation is applicable in these domains. We study the complexity of the general problem, and a variant of it where preprocessing is allowed. We show that when the parametric weights are linear functions, algorithms remain tractable even under our relaxed assumptions. Furthermore, we show that if the weights are allowed to be non-linear, the problem becomes NP-hard. We also study the mutli-dimensional version of the problem where the weight functions are parameterized by multiple parameters. We show that even with two parameters, the problem is NP-hard.

Read more
Data Structures And Algorithms

GeoTree: a data structure for constant time geospatial search enabling a real-time mix-adjusted median property price index

A common problem appearing across the field of data science is k -NN ( k -nearest neighbours), particularly within the context of Geographic Information Systems. In this article, we present a novel data structure, the GeoTree, which holds a collection of geohashes (string encodings of GPS co-ordinates). This enables a constant O(1) time search algorithm that returns a set of geohashes surrounding a given geohash in the GeoTree, representing the approximate k -nearest neighbours of that geohash. Furthermore, the GeoTree data structure retains O(n) memory requirement. We apply the data structure to a property price index algorithm focused on price comparison with historical neighbouring sales, demonstrating an enhanced performance. The results show that this data structure allows for the development of a real-time property price index, and can be scaled to larger datasets with ease.

Read more

Ready to get started?

Join us today