Sangmi Lee Pallickara

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sangmi Lee Pallickara is active.

Explore More

Publication

Featured researches published by Sangmi Lee Pallickara.

Future Generation Computer Systems | 2013

Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals

Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara

We describe the design of a high-throughput storage system, Galileo, for data streams generated in observational settings. To cope with data volumes, the shared nothing architecture in Galileo supports incremental assimilation of nodes, while accounting for heterogeneity in their capabilities. To achieve efficient storage and retrievals of data, Galileo accounts for the geospatial and chronological characteristics of such time-series observational data streams. Our benchmarks demonstrate that Galileo supports high-throughput storage and efficient retrievals of specific portions of large datasets while supporting different types of queries.

Computing in Science and Engineering | 2005

Cooperating services for data-driven computational experimentation

Beth Plale; Dennis Gannon; Yi Huang; Gopi Kandaswamy; Sangmi Lee Pallickara; Aleksander Slominski

The Linked Environments for Atmospheric Discovery (LEAD) project seeks to provide on-demand weather forecasting. A triad of cooperating services provides the core functionality needed to execute experiments and manage the data. In this article, we focus on three MyLEAD services - the metadata catalog service, notification service, and workflow service - that together form the core services for managing complex experimental meteorological investigations and managing the data products used in and generated during the computational experimentation. We show how the services work together on the users behalf, easing the technological burden on the scientists and freeing them to focus on more of the science that compels them.

ieee/acm international conference utility and cloud computing | 2013

Polygon-Based Query Evaluation over Geospatial Data Using Distributed Hash Tables

Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara

Data volumes in the geosciences and related domains have grown significantly as sensing equipment designed to continuously gather readings and produce data streams for geographic regions have proliferated. The storage requirements imposed by these datasets vastly outstrip the capabilities of a single computing resource, leading to the use and development of distributed storage frameworks composed of commodity hardware. In this paper, we explore the challenges associated with supporting geospatial retrievals constrained by arbitrary polygonal bounds on a distributed hash table architecture. Our solution involves novel distribution and partitioning of these voluminous datasets, thus enabling the use of a lightweight, distributed spatial indexing structure, the geoavailability grid. Geoavailability grids provide global, coarse-grained representations of the spatial information stored within these ever-expanding datasets, allowing the search space of distributed queries to be reduced by eliminating storage resources that do not hold relevant information. This results in improved response times and more effective utilization of available resources. Geoavailability grids are also applicable in non-distributed settings for local lookup functionality, performing competitively with other leading spatial indexing technology.

utility and cloud computing | 2012

Expressive Query Support for Multidimensional Data in Distributed Hash Tables

Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara

The quantity and precision of geospatial and time series observational data being collected has increased in tandem with the steady expansion of processing and storage capabilities in modern computing hardware. The storage requirements for this information are vastly greater than the capabilities of a single computer, and are primarily met in a distributed manner. However, distributed solutions often impose strict constraints on retrieval semantics. In this paper, we investigate the factors that influence storage and retrieval operations on large datasets in a cloud setting, and propose a lightweight data partitioning and indexing scheme to facilitate these operations. Our solution provides expressive retrieval support through range-based and exact-match queries and can be applied over massive quantities of multidimensional data. We provide benchmarks to illustrate the relative advantage of using our solution over an established cloud storage engine in a distributed network of heterogeneous computing resources.

ieee international conference on escience | 2008

SWARM: Scheduling Large-Scale Jobs over the Loosely-Coupled HPC Clusters

Sangmi Lee Pallickara; Marlon E. Pierce

Compute-intensive scientific applications are heavily reliant on the available quantity of computing resources. The Grid paradigm provides a large scale computing environment for scientific users. However, conventional Grid job submission tools do not provide a high-level job scheduling environment for these users across multiple institutions. For extremely large number of jobs, a more scalable job scheduling framework that can leverage highly distributed clusters and supercomputers is required. In this paper, we propose a high-level job scheduling Web service framework, Swarm. Swarm is developed for scientific applications that must submit massive number of high-throughput jobs or workflows to highly distributed computing clusters. The Swarm service itself is designed to be extensible, lightweight, and easily installable on a desktop or small server. As a Web service, derivative services based on Swarm can be straightforwardly integrated with Web portals and science gateways. This paper provides the motivation for this research, the architecture of the Swarm framework, and a performance evaluation of the system prototype.

international conference on information technology coding and computing | 2005

An analysis of reliable delivery specifications for Web services

Shrideep Pallickara; Geoffrey C. Fox; Sangmi Lee Pallickara

Reliable delivery of messages is now a key component of the Web Services roadmap, with two promising, and competing, specifications in this area viz. WS-reliability (WSR) from OASIS and WS-reliable messaging (WSRM) from IBM and Microsoft. In this paper we provide an analysis of these specifications. Our investigations have been aimed at identifying the similarities and divergence in philosophies of these specifications. We also include a gap analysis and recommendations regarding the gaps identified by the gap analysis.

utility and cloud computing | 2011

Galileo: A Framework for Distributed Storage of High-Throughput Data Streams

Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara

We describe the design of a high-throughput storage system, Galileo, for data streams generated in observational settings. The shared-nothing architecture in Galileo supports incremental assimilation of nodes, while accounting for heterogeneity in their capabilities, to cope with data volumes. To achieve efficient storage and retrievals of data, Galileo accounts for the geospatial and chronological characteristics of such time-series observational data streams. Our benchmarks demonstrate that Galileo supports high-throughput storage and efficient retrievals of specific portions of large datasets while supporting different types of queries.

Future Generation Computer Systems | 2016

Predictive analytics using statistical, learning, and ensemble methods to support real-time exploration of discrete event simulations

Walid Budgaga; Matthew Malensek; Sangmi Lee Pallickara; Neil Harvey; F. Jay Breidt; Shrideep Pallickara

Discrete event simulations (DES) provide a powerful means for modeling complex systems and analyzing their behavior. DES capture all possible interactions between the entities they manage, which makes them highly expressive but also compute-intensive. These computational requirements often impose limitations on the breadth and/or depth of research that can be conducted with a discrete event simulation.This work describes our approach for leveraging the vast quantity of computing and storage resources available in both private organizations and public clouds to enable real-time exploration of discrete event simulations. Rather than directly targeting simulation execution speeds, we autonomously generate and execute novel scenario variants to explore a representative subset of the simulation parameter space. The corresponding outputs from this process are analyzed and used by our framework to produce models that accurately forecast simulation outcomes in real time, providing interactive feedback and facilitating exploratory research.Our framework distributes the workloads associated with generating and executing scenario variants across a range of commodity hardware, including public and private cloud resources. Once the models have been created, we evaluate their performance and improve prediction accuracy by employing dimensionality reduction techniques and ensemble methods. To make these models highly accessible, we provide a user-friendly interface that allows modelers and epidemiologists to modify simulation parameters and see projected outcomes in real time. Our approach enables fast, accurate forecasts of discrete event simulations.The framework copes with high dimensionality and voluminous datasets.We facilitate simulation execution with cycle scavenging and cloud resources.We create and evaluate several predictive models, including ensemble methods.Our framework is made accessible to end users through an interactive web interface.

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference on | 2013

Autonomously improving query evaluations over multidimensional data in distributed hash tables

Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara

The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.

Future Generation Computer Systems | 2012

Towards efficient data search and subsetting of large-scale atmospheric datasets

Sangmi Lee Pallickara; Shrideep Pallickara; Milija Zupanski

Discovering the correct dataset in an efficient fashion is critical for effective simulations in the atmospheric sciences. Unlike text-based web documents, many of the large scientific datasets often contain binary encoded data that is hard to discover using popular search engines. In the atmospheric sciences, there has been a significant growth in public data hosting services. However, the ability to index and search has been limited by the metadata provided by the data host. We have developed an infrastructure-Atmospheric Data Discovery System (ADDS)-that provides an efficient data discovery environment for observational datasets in the atmospheric sciences. To support complex querying capabilities, we automatically extract and index fine-grained metadata. Datasets are indexed based on periodic crawling of popular sites and also of files requested by the users. Users are allowed to access subsets of a large dataset through our data customization feature. Our focus is the overall architecture, data subsetting scheme, and a performance evaluation of our system.

Explore More