Nick Koudas | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Nick Koudas is active.

Explore More

Publication

Featured researches published by Nick Koudas.

international conference on data engineering | 2002

Structural joins: a primitive for efficient XML query pattern matching

Shurug Al-Khalifa; H. V. Jagadish; Nick Koudas; Jignesh M. Patel; Divesh Srivastava; Yuqing Wu

XML queries typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. The primitive tree structured relationships are parent-child and ancestor-descendant, and finding all occurrences of these relationships in an XML database is a core operation for XML query processing. We develop two families of structural join algorithms for this task: tree-merge and stack-tree. The tree-merge algorithms are a natural extension of traditional merge joins and the multi-predicate merge joins, while the stack-tree algorithms have no counterpart in traditional relational join processing. We present experimental results on a range of data and queries using the TIMBER native XML query engine built on top of SHORE. We show that while, in some cases, tree-merge algorithms can have performance comparable to stack-tree algorithms, in many cases they are considerably worse. This behavior is explained by analytical results that demonstrate that, on sorted inputs, the stack-tree algorithms have worst-case I/O and CPU complexities linear in the sum of the sizes of inputs and output, while the tree-merge algorithms do not have the same guarantee.

symposium on the theory of computing | 2001

Data-streams and histograms

Sudipto Guha; Nick Koudas; Kyuseok Shim

Histograms have been used widely to capture data distribution, to represent the data by a small number of step functions. Dynamic programming algorithms which provide optimal construction of these histograms exist, albeit running in quadratic time and linear space. In this paper we provide linear time construction of 1 + ε approximation of optimal histograms, running in polylogarithmic space.nOur results extend to the context of data-streams, and in fact generalize to give 1 + ε approximation of several problems in data-streams which require partitioning the index set into intervals. The only assumptions required are that the cost of an interval is monotonic under inclusion (larger interval has larger cost) and that the cost can be computed or approximated in small space. This exhibits a nice class of problems for which we can have near optimal data-stream algorithms.

international conference on management of data | 2002

Dynamic multidimensional histograms

Nitin Thaper; Sudipto Guha; Piotr Indyk; Nick Koudas

Histograms are a concise and flexible way to construct summary structures for large data sets. They have attracted a lot of attention in database research due to their utility in many areas, including query optimization, and approximate query answering. They are also a basic tool for data visualization and analysis.In this paper, we present a formal study of dynamic multidimensional histogram structures over continuous data streams. At the heart of our proposal is the use of a dynamic summary data structure (vastly different from a histogram) maintaining a succinct approximation of the data distribution of the underlying continuous stream. On demand, an accurate histogram is derived from this dynamic data structure. We propose algorithms for extracting such an accurate histogram and we analyze their behavior and tradeoffs. The proposed algorithms are able to provide approximate guarantees about the quality of the estimation of the histograms they extract.We complement our analytical results with a thorough experimental evaluation using real data sets.

international conference on data engineering | 2001

Counting twig matches in a tree

Zhiyuan Chen; H. V. Jagadish; Flip Korn; Nick Koudas; S. Muthukrishnan; Raymond T. Ng; Divesh Srivastava

Describes efficient algorithms for accurately estimating the number of matches of a small node-labeled tree, i.e. a twig, in a large node-labeled tree, using a summary data structure. This problem is of interest for queries on XML and other hierarchical data, to provide query feedback and for cost-based query optimization. Our summary data structure scalably represents approximate frequency information about twiglets (i.e. small twigs) in the data tree. Given a twig query, the number of matches is estimated by creating a set of query twiglets, and combining two complementary approaches: set hashing, used to estimate the number of matches of each query twiglet, and maximal overlap, used to combine the query twiglet estimates into an estimate for the twig query. We propose several estimation algorithms that apply these approaches on query twiglets formed using variations on different twiglet decomposition techniques. We present an extensive experimental evaluation using several real XML data sets, with a variety of twig queries. Our results demonstrate that accurate and robust estimates can be achieved, even with limited space.

international conference on data engineering | 2003

Navigation- vs. index-based XML multi-query processing

Nicolas Bruno; Luis Gravano; Nick Koudas; Divesh Srivastava

XML path queries form the basis of complex filtering of XML data. Most current XML path query processing techniques can be divided in two groups. Navigation-based algorithms compute results by analyzing an input document one tag at a time. In contrast, index-based algorithms take advantage of precomputed numbering schemes over the input XML document. We introduce a new index-based technique, index-filter, to answer multiple XML path queries. Index-filter uses indexes built over the document tags to avoid processing large portions of the input document that are guaranteed not to be part of any match. We analyze index-filter and compare it against Y-filter, a state-of-the-art navigation-based technique. We show that both techniques have their advantages, and we discuss the scenarios under which each technique is superior to the other one. In particular, we show that while most XML path query processing techniques work off SAX events, in some cases it pays off to preprocess the input document, augmenting it with auxiliary information that can be used to evaluate the queries faster. We present experimental results over real and synthetic XML documents that validate our claims.

knowledge discovery and data mining | 2003

Correlating synchronous and asynchronous data streams

Sudipto Guha; Dimitrios Gunopulos; Nick Koudas

In a variety of modern mining applications, data are commonly viewed as infinite time ordered data streams rather as finite data sets stored on disk. This view challenges fundamental assumptions commonly made in the context of several data mining algorithms.In this paper, we study the problem of identifying correlations between multiple data streams. In particular, we propose algorithms capable of capturing correlations between multiple continuous data streams in a highly efficient and accurate manner. Our algorithms and techniques are applicable in the case of both synchronous and asynchronous data streaming environments. We capture correlations between multiple streams using the well known technique of Singular Value Decomposition (SVD). Correlations between data items, and the SVD technique in particular, have been repeatedly utilized in an off-line (non stream) data mining problems, for example forecasting, approximate query answering, and data reduction.We propose a methodology based on a combination of dimensionality reduction and sampling to make the SVD technique suitable for a data stream context. Our techniques are approximate, trading accuracy with performance, and we analytically quantify this tradeoff. We present a through experimental evaluation, using both real and synthetic data sets, from a prototype implementation of our technique, investigating the impact of various parameters in the accuracy of the overall computation. Our results indicate, that correlations between multiple data streams can be identified very efficiently and accurately. The algorithms proposed herein, are presented as generic tools, with a multitude of applications on data stream mining problems.

international conference on data engineering | 2003

Text joins for data cleansing and integration in an RDBMS

Luis Gravano; Panagiotis G. Ipeirotis; Nick Koudas; Divesh Srivastava

An organizations data records are often noisy because of transcription errors, incomplete information, lack of standard formats for textual data or combinations thereof. A fundamental task in a data cleaning system is matching textual attributes that refer to the same entity (e.g., organization name or address). This matching is effectively performed via the cosine similarity metric from the information retrieval field. For robustness and scalability, these text joins are best done inside an RDBMS, which is where the data is likely to reside. Unfortunately, computing an exact answer to a text join can be expensive. We propose an approximate, sampling-based text join execution strategy that can be robustly executed in a standard, unmodified RDBMS.

data and knowledge engineering | 2007

Index structures for matching XML twigs using relational query processors

Zhiyuan Chen; Johannes Gehrke; Flip Korn; Nick Koudas; Jayavel Shanmugasundaram; Divesh Srivastava

Various index structures have been proposed to speed up the evaluation of XML path expressions. However, existing XML path indices suffer from at least one of three limitations: they focus only on indexing the structure (relying on a separate index for node content), they are useful only for simple path expressions such as root-to-leaf paths, or they cannot be tightly integrated with a relational query processor. Moreover, there is no unified framework to compare these index structures. In this paper, we present a framework defining a family of index structures that includes most existing XML path indices. We also propose two novel index structures in this family, with different space-time tradeoffs, that are effective for the evaluation of XML branching path expressions (i.e., twigs) with value conditions. We also show how this family of index structures can be implemented using the access methods of the underlying relational database system. Finally, we present an experimental evaluation that shows the performance tradeoff between index space and matching time. The experimental results show that our novel indices achieve orders of magnitude improvement in performance for evaluating twig queries, albeit at a higher space cost, over the use of previously proposed XML path indices that can be tightly integrated with a relational query processor.

international conference on data engineering | 2001

An efficient approximation scheme for data mining tasks

George Kollios; D. Gunupulos; Nick Koudas; Stefan Berchtold

We investigate the use of biased sampling according to the density of the dataset, to speed up the operation of general data mining tasks, such as clustering and outlier detection in large multidimensional datasets. In density biased sampling, the probability that a given point will be included in the sample depends on the local density of the dataset. We propose a general technique for density-biased sampling that can factor in user requirements to sample for properties of interest, and can be tuned for specific data mining tasks. This allows great flexibility and improved accuracy of the results over simple random sampling. We describe our approach in detail, we analytically evaluate it, and show how it can be optimized for approximate clustering and outlier detection. Finally we present a thorough experimental evaluation of the proposed method, applying density-biased sampling on real and synthetic data sets, and employing clustering and outlier detection algorithms, thus highlighting the utility of our approach.

international conference on data engineering | 2003

Index-based approximate XML joins

Sudipto Guha; Nick Koudas; Divesh Srivastava; Ting Yu

XML data integration tools are facing a variety of challenges for their efficient and effective operation. Among these is the requirement to handle a variety of inconsistencies or mistakes present in the data sets. We study the problem of integrating XML data sources through index assisted join operations, using notions of approximate match in the structure and content of XML documents as the join predicate. We show how a well known and widely deployed index structure, namely the R-tree, can be adopted to improve the performance of such operations. We propose novel search and join algorithms for R-trees adopted to index XML document collections. We also propose novel optimization objectives for R-tree construction, making R-trees better suited for this application.

Explore More