Biswanath Panda | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Biswanath Panda is active.

Explore More

Publication

Featured researches published by Biswanath Panda.

very large data bases | 2009

PLANET: massively parallel learning of tree ensembles with MapReduce

Biswanath Panda; Joshua Seth Herbach; Sugato Basu; Roberto J. Bayardo

Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Googles computing infrastructure is based on commodity hardware. In this paper, we describe PLANET: a scalable distributed framework for learning tree models over large datasets. PLANET defines tree learning as a series of distributed computations, and implements each one using the MapReduce model of distributed computation. We show how this framework supports scalable construction of classification and regression trees, as well as ensembles of such models. We discuss the benefits and challenges of using a MapReduce compute cluster for tree learning, and demonstrate the scalability of this approach by applying it to a real world learning task from the domain of computational advertising.

international conference on management of data | 2007

Cayuga: a high-performance event processing engine

Lars Brenna; Alan J. Demers; Johannes Gehrke; Mingsheng Hong; Joel Ossher; Biswanath Panda; Mirek Riedewald; Mohit Thatte; Walker M. White

We propose a demonstration of Cayuga, a complex event monitoring system for high speed data streams. Our demonstration will show Cayuga applied to monitoring Web feeds; the demo will illustrate the expressiveness of the Cayuga query language, the scalability of its query processing engine to high stream rates, and a visualization of the internals of the query processing engine.

international conference on data mining | 2007

High-Speed Function Approximation

Biswanath Panda; Mirek Riedewald; Johannes Gehrke; Stephen B. Pope

We address a new learning problem where the goal is to build a predictive model that minimizes prediction time (the time taken to make a prediction) subject to a constraint on model accuracy. Our solution is a generic framework that leverages existing data mining algorithms without requiring any modifications to these algorithms. We show a first application of our framework to a combustion simulation problem. Our experimental evaluation shows significant improvements over existing methods; prediction time typically is improved by a factor between 2 and 6.

international conference on data engineering | 2010

The model-summary problem and a solution for trees

Biswanath Panda; Mirek Riedewald; Daniel Fink

Modern science is collecting massive amounts of data from sensors, instruments, and through computer simulation. It is widely believed that analysis of this data will hold the key for future scientific breakthroughs. Unfortunately, deriving knowledge from large high-dimensional scientific datasets is difficult. One emerging answer is exploratory analysis using data mining; but data mining models that accurately capture natural processes tend to be very complex and are usually not intelligible. Scientists therefore generate model summaries to find the most important patterns learned by the model. We formalize the model-summary problem and introduce it as a novel problem to the database community. Generating model summaries creates serious data management challenges: Scientists usually want to analyze patterns in different “slices” and “dices” of the data space, comparing the effects of various input variables on the output. We propose novel techniques for efficiently generating such summaries for the popular class of tree-based models. Our techniques leverage workload structure on multiple levels. We also propose a scalable implementation of our techniques in MapReduce. For both sequential and parallel implementation, we achieve speedups of one or more orders of magnitude over the naive algorithm, while guaranteeing the exact same results.

very large data bases | 2008

Large-scale collaborative analysis and extraction of web data

Felix Weigel; Biswanath Panda; Mirek Riedewald; Johannes Gehrke; Manuel Calimlim

Archived web data is a great resource for scientific research, but poses serious challenges in data processing and management. We demonstrate the Web Lab Collaboration Server, a platform and service for large-scale collaborative web data analysis in a distributed computing environment, and show how it seamlessly supports non-technical users during search, data extraction and analysis.

very large data bases | 2013

Scolopax: exploratory analysis of scientific data

Alper Okcan; Mirek Riedewald; Biswanath Panda; Daniel Fink

The formulation of hypotheses based on patterns found in data is an essential component of scientific discovery. As larger and richer data sets become available, new scalable and user-friendly tools for scientific discovery through data analysis are needed. We demonstrate Scolopax, which explores the idea of a search engine for hypotheses. It has an intuitive user interface that supports sophisticated queries. Scolopax can explore a huge space of possible hypotheses, returning a ranked list of those that best match the user preferences. To scale to large and complex data sets, Scolopax relies on parallel data management and mining techniques. These include model training, efficient model summary generation, and novel parallel join techniques that together with traditional approaches such as clustering manipulate massive model-summary collections to find the most interesting hypotheses. This demonstration of Scolopax uses a real observational data set, provided by the Cornell Lab of Ornithology. It contains more than 3.3 million bird sightings reported by citizen scientists and has almost 2500 attributes. Conference attendees have the opportunity to make novel discoveries in this data set, ranging from identifying variables that strongly affect bird populations in specific regions to detecting more sophisticated patterns such as habitat competition and migration.

conference on innovative data systems research | 2007