Tyler A. Simon
University of Maryland, Baltimore County
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tyler A. Simon.
ACM Transactions on Storage | 2006
Sudharshan S. Vazhkudai; Xiaosong Ma; Vincent W. Freeh; Jonathan W. Strickland; Nandan Tammineedi; Tyler A. Simon; Stephen L. Scott
High-end computing is suffering a data deluge from experiments, simulations, and apparatus that creates overwhelming application dataset sizes. This has led to the proliferation of high-end mass storage systems, storage area clusters, and data centers. These storage facilities offer a large range of choices in terms of capacity and access rate, as well as strong data availability and consistency support. However, for most end-users, the “last mile” in their analysis pipeline often requires data processing and visualization at local computers, typically local desktop workstations. End-user workstations---despite having more processing power than ever before---are ill-equipped to cope with such data demands due to insufficient secondary storage space and I/O rates. Meanwhile, a large portion of desktop storage is unused.We propose the FreeLoader framework, which aggregates unused desktop storage space and I/O bandwidth into a shared cache/scratch space, for hosting large, immutable datasets and exploiting data access locality. This article presents the FreeLoader architecture, component design, and performance results based on our proof-of-concept prototype. Its architecture comprises contributing benefactor nodes, steered by a management layer, providing services such as data integrity, high performance, load balancing, and impact control. Our experiments show that FreeLoader is an appealing low-cost solution to storing massive datasets by delivering higher data access rates than traditional storage facilities, namely, local or remote shared file systems, storage systems, and Internet data repositories. In particular, we present novel data striping techniques that allow FreeLoader to efficiently aggregate a workstations network communication bandwidth and local I/O bandwidth. In addition, the performance impact on the native workload of donor machines is small and can be effectively controlled. Further, we show that security features such as data encryptions and integrity checks can be easily added as filters for interested clients. Finally, we demonstrate how legacy applications can use the FreeLoader API to store and retrieve datasets.
utility and cloud computing | 2012
Phuong Nguyen; Tyler A. Simon; Milton Halem; David R. Chapman; Quang Le
The specific choice of workload task schedulers for Hadoop MapReduce applications can have a dramatic effect on job workload latency. The Hadoop Fair Scheduler (FairS) assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Thus, it addresses the problem with a FIFO scheduler when short jobs have to wait for long running jobs to complete. We show that even for the FairS, jobs are still forced to wait significantly when the MapReduce system assigns equal sharing of resources due to dependencies between Map, Shuffle, Sort, Reduce phases. We propose a Hybrid Scheduler (HybS) algorithm based on dynamic priority in order to reduce the latency for variable length concurrent jobs, while maintaining data locality. The dynamic priorities can accommodate multiple task lengths, job sizes, and job waiting times by applying a greedy fractional knapsack algorithm for job task processor assignment. The estimated runtime of Map and Reduce tasks are provided to the HybS dynamic priorities from the historical Hadoop log files. In addition to dynamic priority, we implement a reordering of task processor assignment to account for data availability to automatically maintain the benefits of data locality in this environment. We evaluate our approach by running concurrent workloads consisting of the Word-count and Terasort benchmarks, and a satellite scientific data processing workload and developing a simulator. Our evaluation shows the HybS system improves the average response time for the workloads approximately 2.1x faster over the Hadoop FairS with a standard deviation of 1.4x.
international conference on supercomputing | 2006
Xiaosong Ma; Vincent W. Freeh; Tao Yang; Sudharshan S. Vazhkudai; Tyler A. Simon; Stephen L. Scott
Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. End-users of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of wide-area data movement.Scientific data caches, however, traditionally cache entire data-sets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download over-head and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets.
ieee international conference on high performance computing data and analytics | 2007
Tyler A. Simon; Sam B. Cable; Mahin Mahmoodi
The US Army Engineer Research and Development Center Major Shared Resource Center has recently upgraded its Cray XT3 system from single-core to dual- core AMD Opteron processors and has procured a quad-core XT4 supercomputer for installation in early 2008. This paper provides performance analysis of several representative Department of Defense applications executed on single-core and dual-core AMD Opteron processors. The authors provide a detailed strong- scaling study that focuses on addressing some areas of contention that may lead to increased job run times on applications running on many thousands of processors. The authors intend to use the results of this study as a guide for determining application performance on the quad-core CrayXT4.
irregular applications: architectures and algorithms | 2016
Thomas B. Rolinger; Tyler A. Simon; Christopher D. Krieger
Tensor decomposition, the higher-order analogue to singular value decomposition, has emerged as a useful tool for finding relationships in large, sparse, multidimensional data sets. As this technique matures and is applied to increasingly larger data sets, the need for high performance implementations becomes critical. In this work, we perform an objective empirical evaluation of three popular parallel implementations of the Candecomp/Parafac Alternating Least Squares (CP-ALS) tensor decomposition algorithm, namely SPLATT, DFacTo, and ENSIGN. We conduct performance studies across a variety of data sets, comparing the total memory required, the runtime, and the parallel scalability of each implementation. We find that the approach taken by SPLATT results in the fastest runtimes across the data sets, performing 5–22.64 times faster than the other tools. Additionally, SPLATT consumes 1.16–8.62 times less memory than the other tools. When tested on up to 20 cores or nodes, SPLATT using distributed memory parallelism exhibits the best strong scaling.
Concurrency and Computation: Practice and Experience | 2012
Tyler A. Simon; William A. Ward; Alan P. Boss
This paper provides a performance evaluation and investigation of the astrophysics code FLASH for a variety of Intel multiprocessors. This work was performed at the NASA Center for Computational Sciences (NCCS) on behalf of the Carnegie Institution of Washington (CIW) as a study preliminary to the acquisition of a high‐performance computing (HPC) system at the CIW and for the NCCS itself to measure the relative performance of a recently acquired Intel Nehalem‐based system against previously installed multicore HPC resources. A brief overview of computer performance evaluation is provided, followed by a description of the systems under test, a description of the FLASH test problem, and the test results. Additionally, the paper characterizes some of the effects of load imbalance imposed by adaptive mesh refinement. Copyright
ieee high performance extreme computing conference | 2017
Tom Henretty; Muthu Manikandan Baskaran; David Bruns-Smith; Tyler A. Simon
With the recent explosion of systems capable of generating and storing large quantities of GPS data, there is an opportunity to develop novel techniques for analyzing and gaining meaningful insights into this spatiotemporal data. In this paper we examine the application of tensor decompositions, a high-dimensional data analysis technique, to georeferenced data sets. Guidance is provided on fitting spatiotemporal data into the tensor model and analyzing the results. We find that tensor decompositions provide insight and that future research into spatiotemporal tensor decompositions for pattern detection, clustering, and anomaly detection is warranted.
ieee high performance extreme computing conference | 2017
Thomas B. Rolinger; Tyler A. Simon; Christopher D. Krieger
Tensor decompositions, which are factorizations of multi-dimensional arrays, are becoming increasingly important in large-scale data analytics. A popular tensor decomposition algorithm is Canonical Decomposition/Parallel Factorization using alternating least squares fitting (CP-ALS). Tensors that model real-world applications are often very large and sparse, driving the need for high performance implementations of decomposition algorithms, such as CP-ALS, that can take advantage of many types of compute resources. In this work we present ReFacTo, a heterogeneous distributed tensor decomposition implementation based on DeFacTo, an existing distributed memory approach to CP-ALS. DFacTo reduces the critical routine of CP-ALS to a series of sparse matrix-vector multiplications (SpMVs). ReFacTo leverages GPUs within a cluster via MPI to perform these SpMVs and uses OpenMP threads to parallelize other routines. We evaluate the performance of ReFacTo when using NVIDIAs GPU-based cuSPARSE library and compare it to an alternative implementation that uses Intels CPU-based Math Kernel Library (MKL) for the SpMV. Furthermore, we provide a discussion of the performance challenges of heterogeneous distributed tensor decompositions based on the results we observed. We find that on up to 32 nodes, the SpMV of ReFacTo when using MKL is up to 6.8× faster than ReFacTo when using cuSPARSE.
Concurrency and Computation: Practice and Experience | 2009
Tyler A. Simon; James W. McGalliard
high performance computing symposium | 2013
Tyler A. Simon; Phuong Nguyen; Milton Halem