Arvind Arasu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Arvind Arasu is active.

Explore More

Publication

Featured researches published by Arvind Arasu.

very large data bases | 2006

The CQL continuous query language: semantic foundations and query execution

Arvind Arasu; Shivnath Babu; Jennifer Widom

CQL, a continuous query language, is supported by the STREAM prototype data stream management system (DSMS) at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and stored relations. We begin by presenting an abstract semantics that relies only on “black-box” mappings among streams and relations. From these mappings we define a precise and general interpretation for continuous queries. CQL is an instantiation of our abstract semantics using SQL to map from relations to relations, window specifications derived from SQL-99 to map from streams to relations, and three new operators to map from relations to streams. Most of the CQL language is operational in the STREAM system. We present the structure of CQLs query execution plans as well as details of the most important components: operators, interoperator queues, synopses, and sharing of components among multiple operators and queries. Examples throughout the paper are drawn from the Linear Road benchmark recently proposed for DSMSs. We also curate a public repository of data stream applications that includes a wide variety of queries expressed in CQL. The relative ease of capturing these applications in CQL is one indicator that the language contains an appropriate set of constructs for data stream processing.

international conference on management of data | 2003

Extracting structured data from Web pages

Arvind Arasu; Hector Garcia-Molina

Many web sites contain large sets of pages generated using a common template or layout. For example, Amazon lays out the author, title, comments, etc. in the same way in all its book pages. The values used to generate the pages (e.g., the author, title,...) typically come from a database. In this paper, we study the problem of automatically extracting the database values from such template-generated web pages without any learning examples or other similar human input. We formally define a template, and propose a model that describes how values are encoded into pages using a template. We present an algorithm that takes, as input, a set of template-generated pages, deduces the unknown template used to generate the pages, and extracts, as output, the values encoded in the pages. Experimental evaluation on a large number of real input page collections indicates that our algorithm correctly extracts data in most cases.

ACM Transactions on Internet Technology | 2001

Searching the Web

Arvind Arasu; Junghoo Cho; Hector Garcia-Molina; Andreas Paepcke; Sriram Raghavan

We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.

Data Stream Management | 2016

STREAM: The Stanford Data Stream Management System

Arvind Arasu; Brian Babcock; Shivnath Babu; John Cieslewicz; Mayur Datar; Keith Ito; Rajeev Motwani; Utkarsh Srivastava; Jennifer Widom

Traditional database management systems are best equipped to run one-time queries over finite stored data sets. However, many modern applications such as network monitoring, financial analysis, manufacturing, and sensor networks require long-running, or continuous, queries over continuous unbounded streams of data. In the STREAM project at Stanford, we are investigating data management and query processing for this class of applications. As part of the project we are building a general-purpose prototype Data Stream Management System (DSMS), also called STREAM, that supports a large class of declarative continuous queries over continuous streams and traditional stored data sets. The STREAM prototype targets environments where streams may be rapid, stream characteristics and query loads may vary over time, and system resources may be limited.

symposium on principles of database systems | 2004

Approximate counts and quantiles over sliding windows

Arvind Arasu; Gurmeet Singh Manku

We consider the problem of maintaining ε-approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixed-size sliding window) or variable (variable-size sliding window). In a fixed-size sliding window, both the ends of the window slide synchronously over the stream. In a variable-size sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window.We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O(1/ε polylog(1/ε, N)) space. For quantiles, this space requirement is an improvement over the previous best bound of O(1/ε2 polylog(1/ε, N)). We believe that no previous work on space-efficient approximate counts over sliding windows exists.

database programming languages | 2003

CQL: A Language for Continuous Queries over Streams and Relations

Arvind Arasu; Shivnath Babu; Jennifer Widom

Despite the recent surge of research in query processing over data streams, little attention has been devoted to defining precise semantics for continuous queries over streams. We first present an abstract semantics based on several building blocks: formal definitions for streams and relations, mappings among them, and any relational query language. From these basics we define a precise interpretation for continuous queries over streams and relations. We then propose a concrete language, CQL (for Continuous Query Language), which instantiates the abstract semantics using SQL as the relational query language and window specifications derived from SQL-99 to map from streams to relations. We have implemented most of the CQL language in a Data Stream Management System at Stanford, and we have developed a public repository of data stream applications that includes a wide variety of queries expressed in CQL.

international conference on management of data | 2003

STREAM: the stanford stream data manager (demonstration description)

Arvind Arasu; Brian Babcock; Shivnath Babu; Mayur Datar; Keith Ito; Itaru Nishizawa; Justin Rosenstein; Jennifer Widom

STREAM is a general-purpose relational Data Stream Management System (DSMS). STREAM supports a declarative query language and flexible query execution plans. It is designed to cope with high data rates and large numbers of continuous queries through careful resource allocation and use, and by degrading gracefully to approximate answers as necessary. A description of language design, algorithms, system design, and implementation as of late 2002 can be found in [3]. The demonstration focuses on two aspects:

very large data bases | 2004

Resource sharing in continuous sliding-window aggregates

Arvind Arasu; Jennifer Widom

We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address sliding-window aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques that cover a wide range of possible scenarios: different classes of aggregation functions (algebraic, distributive, holistic), different window types (time-based, tuple-based, suffix, historical), and different input models (single stream, multiple substreams). We provide precise theoretical performance guarantees for our techniques, and show their practical effectiveness through experimental study.

symposium on principles of database systems | 2002

Characterizing memory requirements for queries over continuous data streams

Arvind Arasu; Brian Babcock; Shivnath Babu; Jon McAlister; Jennifer Widom

We consider conjunctive queries with arithmetic comparisons over multiple continuous data streams. We specify an algorithm for determining whether or not a query can be evaluated using a bounded amount of memory for all possible instances of the data streams. When a query can be evaluated using bounded memory, we produce an execution strategy based on constant-sized synopses of the data streams.

international conference on management of data | 2010

On active learning of record matching packages

Arvind Arasu; Michaela Götz; Raghav Kaushik

We consider the problem of learning a record matching package (classifier) in an active learning setting. In active learning, the learning algorithm picks the set of examples to be labeled, unlike more traditional passive learning setting where a user selects the labeled examples. Active learning is important for record matching since manually identifying a suitable set of labeled examples is difficult. Previous algorithms that use active learning for record matching have serious limitations: The packages that they learn lack quality guarantees and the algorithms do not scale to large input sizes. We present new algorithms for this problem that overcome these limitations. Our algorithms are fundamentally different from traditional active learning approaches, and are designed ground up to exploit problem characteristics specific to record matching. We include a detailed experimental evaluation on realworld data demonstrating the effectiveness of our algorithms.

Explore More