Ambuj K. Singh
University of California, Santa Barbara
international conference on management of data | 1998
K. V. Ravi Kanth; Divyakant Agrawal; Ambuj K. Singh
Databases are increasingly being used to store multi-media objects such as maps, images, audio and video. Storage and retrieval of these objects is accomplished using multi-dimensional index structures such as R*-trees and SS-trees. As dimensionality increases, query performance in these index structures degrades. This phenomenon, generally referred to as the dimensionality curse, can be circumvented by reducing the dimensionality of the data. Such a reduction is however accompanied by a loss of precision of query results. Current techniques such as QBIC use SVD transform-based dimensionality reduction to ensure high query precision. The drawback of this approach is that SVD is expensive to compute, and therefore not readily applicable to dynamic databases. In this paper, we propose novel techniques for performing SVD-based dimensionality reduction in dynamic databases. When the data distribution changes considerably so as to degrade query precision, we recompute the SVD transform and incorporate it in the existing index structure. For recomputing the SVD-transform, we propose a novel technique that uses aggregate data from the existing index rather than the entire data. This technique reduces the SVD-computation time without compromising query precision. We then explore efficient ways to incorporate the recomputed SVD-transform in the existing index structure without degrading subsequent query response times. These techniques reduce the computation time by a factor of 20 in experiments on color and texture image vectors. The error due to approximate computation of SVD is less than 10%.
international conference on data engineering | 2006
Huahai He; Ambuj K. Singh
Graphs have become popular for modeling structured data. As a result, graph queries are becoming common and graph indexing has come to play an essential role in query processing. We introduce the concept of a graph closure, a generalized graph that represents a number of graphs. Our indexing technique, called Closure-tree, organizes graphs hierarchically where each node summarizes its descendants by a graph closure. Closure-tree can efficiently support both subgraph queries and similarity queries. Subgraph queries find graphs that contain a specific subgraph, whereas similarity queries find graphs that are similar to a query graph. For subgraph queries, we propose a technique called pseudo subgraph isomorphism which approximates subgraph isomorphism with high accuracy. For similarity queries, we measure graph similarity through edit distance using heuristic graph mapping methods. We implement two kinds of similarity queries: K-NN query and range query. Our experiments on chemical compounds and synthetic graphs show that for subgraph queries, Closuretree outperforms existing techniques by up to two orders of magnitude in terms of candidate answer set size and index size. For similarity queries, our experiments validate the quality and efficiency of the presented algorithms.
international conference on management of data | 2008
Huahai He; Ambuj K. Singh
With the prevalence of graph data in a variety of domains, there is an increasing need for a language to query and manipulate graphs with heterogeneous attributes and structures. We propose a query language for graph databases that supports arbitrary attributes on nodes, edges, and graphs. In this language, graphs are the basic unit of information and each query manipulates one or more collections of graphs. To allow for flexible compositions of graph structures, we extend the notion of formal languages from strings to the graph domain. We present a graph algebra extended from the relational algebra in which the selection operator is generalized to graph pattern matching and a composition operator is introduced for rewriting matched graphs. Then, we investigate access methods of the selection operator. Pattern matching over large graphs is challenging due to the NP-completeness of subgraph isomorphism. We address this by a combination of techniques: use of neighborhood subgraphs and profiles, joint reduction of the search space, and optimization of the search order. Experimental results on real and synthetic large graphs demonstrate that our graph specific optimizations outperform an SQL-based implementation by orders of magnitude.
Bioinformatics | 2010
Kristian Kvilekval; Dmitry Fedorov; Boguslaw Obara; Ambuj K. Singh; B. S. Manjunath
MOTIVATION Advances in the field of microscopy have brought about the need for better image management and analysis solutions. Novel imaging techniques have created vast stores of images and metadata that are difficult to organize, search, process and analyze. These tasks are further complicated by conflicting and proprietary image and metadata formats, that impede analyzing and sharing of images and any associated data. These obstacles have resulted in research resources being locked away in digital media and file cabinets. Current image management systems do not address the pressing needs of researchers who must quantify image data on a regular basis. RESULTS We present Bisque, a web-based platform specifically designed to provide researchers with organizational and quantitative analysis tools for 5D image data. Users can extend Bisque with both data model and analysis extensions in order to adapt the system to local needs. Bisques extensibility stems from two core concepts: flexible metadata facility and an open web-based architecture. Together these empower researchers to create, develop and share novel bioimage analyses. Several case studies using Bisque with specific applications are presented as an indication of how users can expect to extend Bisque for their own purposes.
international conference on data engineering | 2001
Tamer Kahveci; Ambuj K. Singh
Finding similar patterns in a time sequence is a well-studied problem. Most of the current techniques work well for queries of a prespecified length, but not for variable length queries. We propose a new indexing technique that works well for variable length queries. The central idea is to store index structures at different resolutions for a given dataset. The resolutions are based on wavelets. For a given query, a number of subqueries at different resolutions are generated. The ranges of the subqueries are progressively refined based on results from previous subqueries. Our experiments show that the total cost for our method is 4 to 20 times less than the current techniques including linear scan. Because of the need to store information at multiple resolution levels, the storage requirement of our method could potentially be large. In the second part of the paper we show how the index information can be compressed with minimal information loss. According to our experimental results, even after compressing the size of the index to one fifth, the total cost of our method is 3 to 15 times less than the current techniques.
international conference on data engineering | 2009
Sayan Ranu; Ambuj K. Singh
Graphs are being increasingly used to model a wide range of scientific data. Such widespread usage of graphs has generated considerable interest in mining patterns from graph databases. While an array of techniques exists to mine frequent patterns, we still lack a scalable approach to mine statistically significant patterns, specifically patterns with low p-values, that occur at low frequencies. We propose a highly scalable technique, called GraphSig, to mine significant subgraphs from large graph databases. We convert each graph into a set of feature vectors where each vector represents a region within the graph. Domain knowledge is used to select a meaningful feature set. Prior probabilities of features are computed empirically to evaluate statistical significance of patterns in the feature space. Following analysis in the feature space, only a small portion of the exponential search space is accessed for further analysis. This enables the use of existing frequent subgraph mining techniques to mine significant patterns in a scalable manner even when they are infrequent. Extensive experiments are carried out on the proposed techniques, and empirical results demonstrate that GraphSig is effective and efficient for mining significant patterns. To further demonstrate the power of significant patterns, we develop a classifier using patterns mined by GraphSig. Experimental results show that the proposed classifier achieves superior performance, both in terms of quality and computation cost, over state-of-the-art classifiers.
international conference on data engineering | 2007
Vebjorn Ljosa; Ambuj K. Singh
The ability to store and query uncertain information is of great benefit to databases that infer values from a set of observations, including databases of moving objects, sensor readings, historical business transactions, and biomedical images. These observations are often inexact to begin with, and even if they are exact, a set of observations of an attribute of an object is better represented by a probability distribution than by a single number, such as a mean. In this paper, we present adaptive, piecewise-linear approximations (APLAs), which represent arbitrary probability distributions compactly with guaranteed quality. We also present the APLA-tree, an index structure for APLAs. Because APLA is more precise than existing approximation techniques, the APLA-tree can answer probabilistic range queries twice as fast. APLA generalizes to multiple dimensions, and the APLA-tree can index multivariate distributions using either one-dimensional or multidimensional APLAs. Finally, we propose a new definition of k-NN queries on uncertain data. The new definition allows APLA and the APLA-tree to answer k-NN queries quickly, even on arbitrary probability distributions. No efficient k-NN search was previously possible on such distributions.
international conference on data engineering | 2003
Ahmet Bulut; Ambuj K. Singh
The problem of statistics and aggregate maintenance over data streams has gained popularity in recent years especially in telecommunications network monitoring, trend-related analysis, Web-click streams, stock tickers, and other time-variant data. The amount of data generated in such applications can become too large to store, or if stored too large to scan multiple times. We consider queries over data streams that are biased towards the more recent values. We develop a technique that summarizes a dynamic stream incrementally at multiple resolutions. This approximation can be used to answer point queries, range queries, and inner product queries. Moreover, the precision of answers can be changed adoptively by a client. Later, we extend the above technique to work in a distributed setting, specifically in a large network where a central site summarizes the stream and clients ask queries. We minimize the message overhead by deciding what and where to replicate by using an adaptive replication scheme. We maintain a hierarchy of approximations that change adoptively based on the query and update rates. We show experimentally that our technique performs better than existing techniques: up to 50 times better in terms of approximation quality, up to four orders of magnitude times better in response time, and up to five times better in terms of message complexity.
conference on object-oriented programming systems, languages, and applications | 1997
Raimondas Lencevicius; Urs Hölzle; Ambuj K. Singh
Object relationships in modem software systems are becoming increasingly numerous and complex. Programmers who try to find violations of such relationships need new tools that allow them to explore objects in a large system more efficiently. Many existing debuggers present only a low-level, one-object-at-a-time view of objects and their relationships. We propose a new solution to overcome these problems: query-based debugging. The implementation of the query-based debugger described here offers programmers an effective query tool that allows efficient searching of large object spaces and quick verification of complex relationships. Even for programs that have large numbers of objects, the debugger achieves interactive response times for common queries by using a combination of fast searching primitives, query optimization, and incremental result delivery.
international conference on data engineering | 2008
Vebjorn Ljosa; Ambuj K. Singh
Probabilistic data have recently become popular in applications such as scientific and geospatial databases. For images and other spatial datasets, probabilistic values can capture the uncertainty in extent and class of the objects in the images. Relating one such dataset to another by spatial joins is an important operation for data management systems. We consider probabilistic spatial join (PSJ) queries, which rank the results according to a score that incorporates both the uncertainties associated with the objects and the distances between them. We present algorithms for two kinds of PSJ queries: Threshold PSJ queries, which return all pairs that score above a given threshold, and top-k PSJ queries, which return the k top-scoring pairs. For threshold PSJ queries, we propose a plane sweep algorithm that, because it exploits the special structure of the problem, runs in 0(n (log n + k)) time, where n is the number of points and k is the number of results. We extend the algorithms to 2-D data and to top-k PSJ queries. To further speed up top-k PSJ queries, we develop a scheduling technique that estimates the scores at the level of blocks, then hands the blocks to the plane sweep algorithm. By finding high-scoring pairs early, the scheduling allows a large portion of the datasets to be pruned. Experiments demonstrate speed-ups of two orders of magnitude.