Leonid Glimcher | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Leonid Glimcher is active.

Explore More

Publication

Featured researches published by Leonid Glimcher.

international conference on data engineering | 2006

New Sampling-Based Estimators for OLAP Queries

Ruoming Jin; Leonid Glimcher; Chris Jermaine; Gagan Agrawal

One important way in which sampling for approximate query processing in a database environment differs from traditional applications of sampling is that in a database, it is feasible to collect accurate summary statistics from the data in addition to the sample. This paper describes a set of sampling-based estimators for approximate query processing that make use of simple summary statistics to to greatly increase the accuracy of sampling-based estimators. Our estimators are able to give tight probabilistic guarantees on estimation accuracy. They are suitable for low or high dimensional data, and work with categorical or numerical attributes. Furthermore, the information used by our estimators can easily be gathered in a single pass, making them suitable for use in a streaming environment.

Journal of Parallel and Distributed Computing | 2008

Middleware for data mining applications on clusters and grids

Leonid Glimcher; Ruoming Jin; Gagan Agrawal

This paper gives an overview of two middleware systems that have been developed over the last 6 years to address the challenges involved in developing parallel and distributed implementations of data mining algorithms. FREERIDE (FRamework for Rapid Implementation of Data mining Engines) focuses on data mining in a cluster environment. FREERIDE is based on the observation that parallel versions of several well-known data mining techniques share a relatively similar structure, and can be parallelized by dividing the data instances (or records or transactions) among the nodes. The computation on each node involves reading the data instances in an arbitrary order, processing each data instance, and performing a local reduction. The reduction involves only commutative and associative operations, which means the result is independent of the order in which the data instances are processed. After the local reduction on each node, a global reduction is performed. This similarity in the structure can be exploited by the middleware system to execute the data mining tasks efficiently in parallel, starting from a relatively high-level specification of the technique. To enable processing of data sets stored in remote data repositories, we have extended FREERIDE middleware into FREERIDE-G (FRamework for Rapid Implementation of Data mining Engines in Grid). FREERIDE-G supports a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. The added functionality in FREERIDE-G aims at abstracting the details of remote data retrieval, movements, and caching from application developers.

international parallel and distributed processing symposium | 2004

Scaling and parallelizing a scientific feature mining application using a cluster middleware

Leonid Glimcher; Xuan Zhang; Gagan Agrawal

Summary form only given. As scientific simulations are generating large amounts of data, analyzing this data to gain insights into scientific phenomenon is increasingly becoming a challenge. We present a case study on the use of a cluster middleware for rapidly creating a scalable and parallel implementation of a scientific data analysis application. Using FREERIDE (framework for rapid implementation of data mining engines), we parallelize as well as scale to disk-resident datasets a feature extraction algorithm. We have developed a parallel algorithm for this problem which matches the communication and computation structure supported by the FREERIDE system. The main observations from our experimental results are as follows: 1) the overhead of using the middleware is quite small in most cases, 2) there is an overhead associated with breaking the datasets into more partitions or chunks, and 3) if the dataset is partitioned into the same number of chunks, the execution time stays proportional to the size of the dataset and inversely proportional to the number of nodes, i.e. the overhead of communication or reading disk-resident datasets is very small.

european conference on parallel processing | 2004

Parallelizing EM Clustering Algorithm on a Cluster of SMPs

Leonid Glimcher; Gagan Agrawal

In this paper, we report on parallelization of the EM clustering algorithm using the FREERIDE middleware developed in our prior work. FREERIDE is based upon the observation that the processing structure of a large number of data mining algorithms involves generalized reductions. FREERIDE offers a high-level interface and support both distributed memory and shared memory parallelization, besides efficient execution on disk-resident datasets. We show how the main processing loops in both the E and M steps of the EM algorithm essentially involve a generalized reduction, and therefore, the algorithm can be parallelized using FREERIDE.

ieee international conference on high performance computing, data, and analytics | 2009

Supporting load balancing for distributed data-intensive applications

Leonid Glimcher; Vignesh T. Ravi; Gagan Agrawal

In data-intensive computing, an important problem that has received relatively little attention is of transparent processing of data stored in remote data repositories. Interesting load balancing considerations arise for these scenarios. Particularly, based on where data is generated and how it is shared, a dataset of interest can be divided across multiple data repositories, which may be geographically distributed and the data may be partitioned in a number of ways. This paper focuses on enabling such distributed processing of data from distributed resources. We have developed a load balancing algorithm, which minimizes the total time spent on processing the data. We consider weighted sum of two factors, a load balancing factor and a term that captures the amount of time spent by processing nodes waiting for the data. Our solutions have been implemented and evaluated in the context of FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid). We have extensively evaluated our techniques using two data-intensive applications.

international parallel and distributed processing symposium | 2007

A Performance Prediction Framework for Grid-Based Data Mining Applications

Leonid Glimcher; Gagan Agrawal

For a grid middleware to perform resource allocation, prediction models are needed, which can determine how long an application will take for completion on a particular platform or configuration. In this paper, we take the approach that by focusing on the characteristics of the class of applications a middleware is suited for, we can develop simple performance models that can be very accurate in practice. The particular middleware we consider is FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid), which supports a high-level interface for developing data mining and scientific data processing applications that involve data stored in remote repositories. The FREERIDE-G system needs detailed performance models for performing resource selection, i.e., choosing computing nodes and replica of the dataset. This paper presents and evaluates such a performance model. By exploiting the fact that the processing structure of data mining and scientific data analysis applications developed on FREERIDE-G involves generalized reductions, we are able to develop an accurate performance prediction model. We have evaluated our model using implementations of three wellknown data mining algorithms and two scientific data analysis applications developed using FREERIDE-G. Results from these five applications show that we are able to accurately predict execution times for applications as we vary the number of storage nodes, number of nodes available for computation, the dataset size, the network bandwidth, and the underlying hardware.

international parallel and distributed processing symposium | 2005

Parallelizing a defect detection and categorization application

Leonid Glimcher; Gagan Agrawal; Sameep Mehta; Ruoming Jin; Raghu Machiraju

This paper presents a case study in creating a parallel and scalable implementation of a scientific data analysis application. We focus on a defect detection and categorization application which analyzes datasets produced by molecular dynamics (MD) simulations. In parallelizing this application, we had the following three goals. First, we obviously wanted to achieve high parallel efficiency. Second, we wanted to create an implementation that can scale to disk-resident datasets. Third, we wanted to create an easy to maintain and modify implementation, which is possible only through using high-level interfaces. We used a number of techniques for organizing the input data, achieving load balance, and efficiently parallelizing the step for updating and matching with the defect catalog. To meet our third goal, we used a system called FREERIDE (FRamework for Rapid Implementation of Datamining Engines), which was originally developed for parallelizing data mining algorithms. We have carried out a detailed evaluation of our implementation. The main observations from our experiments are as follows: 1) our implementation achieves high parallel efficiency, 2) the execution time remains proportional to the amount of computation even as the dataset becomes disk-resident, and 3) our scheme for load balancing and the method we use for parallelizing updating and matching of the defect catalog are crucial for parallel efficiency of the defect categorization phase.

cluster computing and the grid | 2008

A Middleware for Developing and Deploying Scalable Remote Mining Services

Leonid Glimcher; Gagan Agrawal

In this paper, we consider the problem of developing service-oriented implementations of data-intensive applications that process data on remote servers. While the existing grid and web-service frameworks allow interoperability and flexible resource utilization, achieving efficiency and scalability remains a critical challenge. Similarly, the existing grid and web-service frameworks do not provide transparency in accessing and processing data from grid-based data servers. We present design and evaluation of a system that supports a high-level interface for developing data mining and scientific data processing grid- services and targets data residing on SRB servers. Results of our evaluation using two data mining and one scientific data processing applications show two important observations. First, each of applications we evaluated demonstrated good scalability with respect to dataset size, as well as changing numbers of both data host and compute nodes. Second, there is only a small overhead associated with deploying our middleware- based applications using MPICH-G2 and Globus. This overhead varied between 14% and 22% and is primarily because of a larger memory footprint. Thus, overall, our work shows that it is feasible to develop and deploy scalable and efficient grid-services that process data from remote servers.

Proceedings of the 2008 international workshop on Data-aware distributed computing | 2008

FREERIDE-G: enabling distributed processing of large datasets

Leonid Glimcher; Gagan Agrawal

We have been developing a middleware which enables development, support, and deployment of services that can transparently access and process data from remote servers, are compatible with grid standards and frameworks, and yet are efficient and scalable. Our middleware is referred to as FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid). We have integrated the middleware with the grid computing standards through the use of the Globus Toolkit, more specifically, MPICH-G2. Another possibility that our middleware needs to consider is that the available data may be spread across multiple clusters. Thus, we need to develop schedules for data movement and processing, which minimize the overheads and achieve load balancing. Since the datasets may be vertically partitioned, we also need to generate wrappers automatically to bridge format differences.

Archive | 2006