Johannes Gehrke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Johannes Gehrke is active.

Explore More

Publication

Featured researches published by Johannes Gehrke.

international conference on management of data | 1998

Automatic subspace clustering of high dimensional data for data mining applications

Rakesh Agrawal; Johannes Gehrke; Dimitrios Gunopulos; Prabhakar Raghavan

Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate cluster in large high dimensional datasets.

international conference on management of data | 2002

The cougar approach to in-network query processing in sensor networks

Yong Yao; Johannes Gehrke

The widespread distribution and availability of small-scale sensors, actuators, and embedded processors is transforming the physical world into a computing platform. One such example is a sensor network consisting of a large number of sensor nodes that combine physical sensing capabilities such as temperature, light, or seismic sensors with networking and computation capabilities. Applications range from environmental control, warehouse inventory, and health care to military environments. Existing sensor networks assume that the sensors are preprogrammed and send data to a central frontend where the data is aggregated and stored for offline querying and analysis. This approach has two major drawbacks. First, the user cannot change the behavior of the system on the fly. Second, conservation of battery power is a major design factor, but a central system cannot make use of in-network programming, which trades costly communication for cheap local computation.In this paper, we introduce the Cougar approach to tasking sensor networks through declarative queries. Given a user query, a query optimizer generates an efficient query plan for in-network query processing, which can vastly reduce resource usage and thus extend the lifetime of a sensor network. In addition, since queries are asked in a declarative language, the user is shielded from the physical characteristics of the network. We give a short overview of sensor networks, propose a natural architecture for a data management system for sensor networks, and describe open research problems in this area.

foundations of computer science | 2003

Gossip-based computation of aggregate information

David Kempe; Alin Dobra; Johannes Gehrke

Over the last decade, we have seen a revolution in connectivity between computers, and a resulting paradigm shift from centralized to highly distributed systems. With massive scale also comes massive instability, as node and link failures become the norm rather than the exception. For such highly volatile systems, decentralized gossip-based protocols are emerging as an approach to maintaining simplicity and scalability while achieving fault-tolerant information dissemination. In this paper, we study the problem of computing aggregates with gossip-style protocols. Our first contribution is an analysis of simple gossip-based protocols for the computation of sums, averages, random samples, quantiles, and other aggregate functions, and we show that our protocols converge exponentially fast to the true answer when using uniform gossip. Our second contribution is the definition of a precise notion of the speed with which a nodes data diffuses through the network. We show that this diffusion speed is at the heart of the approximation guarantees for all of the above problems. We analyze the diffusion speed of uniform gossip in the presence of node and link failures, as well as for flooding-based mechanisms. The latter expose interesting connections to random walks on graphs.

knowledge discovery and data mining | 2002

Sequential PAttern mining using a bitmap representation

Jay Ayres; Jason Flannick; Johannes Gehrke; Tomi Yiu

We introduce a new algorithm for mining sequential patterns. Our algorithm is especially efficient when the sequential patterns in the database are very long. We introduce a novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms.Our implementation of the search strategy combines a vertical bitmap representation of the database with efficient support counting. A salient feature of our algorithm is that it incrementally outputs new frequent itemsets in an online fashion.In a thorough experimental evaluation of our algorithm on standard benchmark data from the literature, our algorithm outperforms previous work up to an order of magnitude.

mobile data management | 2001

Towards Sensor Database Systems

Philippe Bonnet; Johannes Gehrke; Praveen Seshadri

Sensor networks are being widely deployed for measurement, detection and surveillance applications. In these new applications, users issue long-running queries over a combination of stored data and sensor data. Most existing applications rely on a centralized system for collecting sensor data. These systems lack flexibility because data is extracted in a predefined way; also, they do not scale to a large number of devices because large volumes of raw data are transferred regardless of the queries that are submitted. In our new concept of sensor database system, queries dictate which data is extracted from the sensors. In this paper, we define the concept of sensor databases mixing stored data represented as relations and sensor data represented as time series. Each long-running query formulated over a sensor database defines a persistent view, which is maintained during a given time interval. We also describe the design and implementation of the COUGAR sensor database system.

international conference on data engineering | 2001

MAFIA: a maximal frequent itemset algorithm for transactional databases

Douglas Burdick; Manuel Calimlim; Johannes Gehrke

We present a new algorithm for mining maximal frequent itemsets from a transactional database. Our algorithm is especially efficient when the itemsets in the database are very long. The search strategy of our algorithm integrates a depth-first traversal of the itemset lattice with effective pruning mechanisms. Our implementation of the search strategy combines a vertical bitmap representation of the database with an efficient relative bitmap compression schema. In a thorough experimental analysis of our algorithm on real data, we isolate the effect of the individual components of the algorithm. Our performance numbers show that our algorithm outperforms previous work by a factor of three to five.

IEEE Pervasive Computing | 2004

Query processing in sensor networks

Johannes Gehrke; Samuel Madden

Smart sensors are small wireless computing devices that sense information such as light and humidity at extremely high resolutions. A smart sensor query-processing architecture using database technology can facilitate deployment of sensor networks. Smart-sensor technology enables a broad range of ubiquitous computing applications. Their low cost, small size, and untethered nature lets them sense information at previously unobtainable resolutions. We discuss about query processing in sensor networks.

very large data bases | 2004

Detecting change in data streams

Daniel Kifer; Shai Ben-David; Johannes Gehrke

Detecting changes in a data stream is an important area of research with many applications. In this paper, we present a novel method for the detection and estimation of change. In addition to providing statistical guarantees on the reliability of detected changes, our method also provides meaningful descriptions and quantification of these changes. Our approach assumes that the points in the stream are independently generated, but otherwise makes no assumptions on the nature of the generating distribution. Thus our techniques work for both continuous and discrete data. In an experimental study we demonstrate the power of our techniques.

IEEE Personal Communications | 2000

Querying the physical world

Philippe Bonnet; Johannes Gehrke; Praveen Seshadri

In the next decade, millions of sensors and small-scale mobile devices will integrate processors, memory, and communication capabilities. Networks of devices will be widely deployed for monitoring applications. In these new applications, users need to query very large collections of devices in an ad hoc manner. Most existing systems rely on a centralized system for collecting device data. These systems lack flexibility because data is extracted in a predefined way. Also, they do not scale to a large number of devices because large volumes of raw data are transferred. In our new concept of a device database system, distributed query execution techniques are applied to leverage the computing capabilities of devices, and to reduce communication. We define an abstraction that allows us to represent a device network as a database and we describe how distributed query processing techniques are applied in this new context.

knowledge discovery and data mining | 1999

CACTUS—clustering categorical data using summaries

Venkatesh Ganti; Johannes Gehrke; Raghu Ramakrishnan

Clustering is an important data mining problem. Most of the earlier work on clustering focussed on numeric attributes which have a natural ordering on their attribute values. Recently, clustering data with categorical attributes, whose attribute values do not have a natural ordering, has received some attention. However, previous algorithms do not give a formal description of the clusters they discover and some of them assume that the user post-processes the output of the algorithm to identify the final clusters. In this paper, we introduce a novel formalization of a cluster for categorical attributes by generalizing a definition of a cluster for numerical attributes. We then describe a very fast summarizationbased algorithm called CACTUS that discovers exactly such clusters in the data. CACTUS has two important characteristics. First, the algorithm requires only two scans of the dataset, and hence is very fast and scalable. Our experiments on a variety of datasets show that CACTUS outperforms previous work by a factor of 3 to 10. Second, CACTUS can find clusters in subsets of all attributes and can thus perform a subspace clustering of the data. This feature is important if clusters do not span all attributes, a likely scenario if the number of attributes is very large. In a thorough experimental evaluation, we study the performance of CACTUS on real and synthetic datasets.

Explore More