Matthew Malensek
Colorado State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Matthew Malensek.
Future Generation Computer Systems | 2013
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
We describe the design of a high-throughput storage system, Galileo, for data streams generated in observational settings. To cope with data volumes, the shared nothing architecture in Galileo supports incremental assimilation of nodes, while accounting for heterogeneity in their capabilities. To achieve efficient storage and retrievals of data, Galileo accounts for the geospatial and chronological characteristics of such time-series observational data streams. Our benchmarks demonstrate that Galileo supports high-throughput storage and efficient retrievals of specific portions of large datasets while supporting different types of queries.
ieee/acm international conference utility and cloud computing | 2013
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
Data volumes in the geosciences and related domains have grown significantly as sensing equipment designed to continuously gather readings and produce data streams for geographic regions have proliferated. The storage requirements imposed by these datasets vastly outstrip the capabilities of a single computing resource, leading to the use and development of distributed storage frameworks composed of commodity hardware. In this paper, we explore the challenges associated with supporting geospatial retrievals constrained by arbitrary polygonal bounds on a distributed hash table architecture. Our solution involves novel distribution and partitioning of these voluminous datasets, thus enabling the use of a lightweight, distributed spatial indexing structure, the geoavailability grid. Geoavailability grids provide global, coarse-grained representations of the spatial information stored within these ever-expanding datasets, allowing the search space of distributed queries to be reduced by eliminating storage resources that do not hold relevant information. This results in improved response times and more effective utilization of available resources. Geoavailability grids are also applicable in non-distributed settings for local lookup functionality, performing competitively with other leading spatial indexing technology.
utility and cloud computing | 2012
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
The quantity and precision of geospatial and time series observational data being collected has increased in tandem with the steady expansion of processing and storage capabilities in modern computing hardware. The storage requirements for this information are vastly greater than the capabilities of a single computer, and are primarily met in a distributed manner. However, distributed solutions often impose strict constraints on retrieval semantics. In this paper, we investigate the factors that influence storage and retrieval operations on large datasets in a cloud setting, and propose a lightweight data partitioning and indexing scheme to facilitate these operations. Our solution provides expressive retrieval support through range-based and exact-match queries and can be applied over massive quantities of multidimensional data. We provide benchmarks to illustrate the relative advantage of using our solution over an established cloud storage engine in a distributed network of heterogeneous computing resources.
utility and cloud computing | 2011
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
We describe the design of a high-throughput storage system, Galileo, for data streams generated in observational settings. The shared-nothing architecture in Galileo supports incremental assimilation of nodes, while accounting for heterogeneity in their capabilities, to cope with data volumes. To achieve efficient storage and retrievals of data, Galileo accounts for the geospatial and chronological characteristics of such time-series observational data streams. Our benchmarks demonstrate that Galileo supports high-throughput storage and efficient retrievals of specific portions of large datasets while supporting different types of queries.
Future Generation Computer Systems | 2016
Walid Budgaga; Matthew Malensek; Sangmi Lee Pallickara; Neil Harvey; F. Jay Breidt; Shrideep Pallickara
Discrete event simulations (DES) provide a powerful means for modeling complex systems and analyzing their behavior. DES capture all possible interactions between the entities they manage, which makes them highly expressive but also compute-intensive. These computational requirements often impose limitations on the breadth and/or depth of research that can be conducted with a discrete event simulation.This work describes our approach for leveraging the vast quantity of computing and storage resources available in both private organizations and public clouds to enable real-time exploration of discrete event simulations. Rather than directly targeting simulation execution speeds, we autonomously generate and execute novel scenario variants to explore a representative subset of the simulation parameter space. The corresponding outputs from this process are analyzed and used by our framework to produce models that accurately forecast simulation outcomes in real time, providing interactive feedback and facilitating exploratory research.Our framework distributes the workloads associated with generating and executing scenario variants across a range of commodity hardware, including public and private cloud resources. Once the models have been created, we evaluate their performance and improve prediction accuracy by employing dimensionality reduction techniques and ensemble methods. To make these models highly accessible, we provide a user-friendly interface that allows modelers and epidemiologists to modify simulation parameters and see projected outcomes in real time. Our approach enables fast, accurate forecasts of discrete event simulations.The framework copes with high dimensionality and voluminous datasets.We facilitate simulation execution with cycle scavenging and cloud resources.We create and evaluate several predictive models, including ensemble methods.Our framework is made accessible to end users through an interactive web interface.
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference on | 2013
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
The proliferation of observational devices and sensors with networking capabilities has led to growth in both the rates and sources of data that ultimately contribute to extreme-scale data volumes. Datasets generated in such settings are often multidimensional, with each dimension accounting for a feature of interest. We posit that efficient evaluation of queries over such datasets must account for both the distribution of data values and the patterns in the queries themselves. Configuring query evaluation by hand is infeasible given the data volumes, dimensionality, and the rates at which new data and queries arrive. In this paper, we describe our algorithm to autonomously improve query evaluations over voluminous, distributed datasets. Our approach autonomously tunes for the most dominant query patterns and distribution of values across a dimension. We evaluate our algorithm in the context of our system, Galileo, which is a hierarchical distributed hash table used for managing geospatial, time-series data. Our system strikes a balance between memory utilization, fast evaluations, and search space reductions. Empirical evaluations reported here are performed on a dataset that is multidimensional and comprises a billion files. The schemes described in this work are broadly applicable to any system that leverages distributed hash tables as a storage mechanism.
ieee international conference on cloud computing technology and science | 2017
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
Networked observational devices and remote sensing equipment continue to proliferate and contribute to the accumulation of extreme-scale datasets. Both the rate and resolution of the readings produced by these devices have grown over time, exacerbating the issues surrounding their storage and management. In many cases, the sheer scale of the information being maintained makes timely analysis infeasible due to the computational workloads required to process the data. While distributed solutions provide a scalable way to cope with data volumes, the communication and latency involved when inspecting large portions of an overall dataset limit applications that require frequent or rapid responses to incoming queries. This study investigates the challenges associated with providing approximate or exploratory answers to distributed queries. In many situations, this requires striking a balance between response times and error rates to produce meaningful results. To enable these use cases, we outline several expressive query constructs and describe their implementation; rather than relying on summary tables or pre-computed samples, our solution involves a coarse-grained global index that maintains statistics and models the relationships across dimensions in the dataset. To illustrate the benefits of these techniques, we include performance benchmarks on a real-world dataset in a production environment.
IEEE Transactions on Knowledge and Data Engineering | 2016
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
As remote sensing equipment and networked observational devices continue to proliferate, their corresponding data volumes have surpassed the storage and processing capabilities of commodity computing hardware. This trend has led to the development of distributed storage frameworks that incrementally scale out by assimilating resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of research: extracting insights, relationships, and models from the underlying datasets. The focus of this study is twofold: exploratory and predictive analytics over voluminous, multidimensional datasets in a distributed environment. Both of these types of analysis represent a higher-level abstraction over standard query semantics; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. The algorithms presented in this work were evaluated empirically on a real-world geospatial time-series dataset in a production environment, and are broadly applicable across other storage frameworks.
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference on | 2013
Matthew Malensek; Zhiquan Sui; Neil Harvey; Shrideep Pallickara
Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes. In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.
IEEE Cloud Computing | 2016
Matthew Malensek; Sangmi Lee Pallickara; Shrideep Pallickara
The breadth and depth of information being generated and stored continues to grow rapidly, causing an information explosion. Observational devices and remote sensing equipment are no exception here, giving researchers new avenues for detecting and predicting phenomena at a global scale. To cope with increasing storage loads, hybrid clouds offer an elastic solution that also satisfies processing and budgetary needs. In this article, the authors describe their algorithms and system design for dealing with voluminous datasets in a hybrid cloud setting. Their distributed storage framework autonomously tunes in-memory data structures and query parameters to ensure efficient retrievals and minimize resource consumption. To circumvent processing hotspots, they predict changes in incoming traffic and federate their query resolution structures to the public cloud for processing. They demonstrate their frameworks efficacy on a real-world, petabyte dataset consisting of more than 20 billion files.