Phillip M. Dickens
University of Maine
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Phillip M. Dickens.
IEEE Transactions on Parallel and Distributed Systems | 1996
Phillip M. Dickens; Philip Heidelberger; David M. Nicol
As massively parallel computers proliferate, there is growing interest in finding ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing compilers, parallel performance monitoring, and parallel algorithm development. In this paper, we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine, such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization, specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, LAPSE (Large Application Parallel Simulation Environment), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well, typically within 10% relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.
workshop on parallel and distributed simulation | 1994
Phillip M. Dickens; Philip Heidelberger; David M. Nicol
This paper describes a tool, LAPSE (Large Application Parallel Simulation Environment), that allows one to use a small number of parallel processors to simulate the behavior of a message-passing code running on a large number of processors, for the purposes of scalability studies and performance tuning. LAPSE is implemented on the Intel Paragon, and has achieved small slowdowns (relative to native code) and high speed-ups on large problems.
international parallel processing symposium | 1999
Phillip M. Dickens; Rajeev Thakur
Massively parallel computers are increasingly being used to solve large, I/O intensive applications in many different fields. For such applications, the I/O requirements quite often present a significant obstacle in the way of achieving good performance, and an important area of current research is the development of techniques by which these costs can be reduced. One such approach is collective I/O, where the processors cooperatively develop an I/O strategy that reduces the number, and increases the size, of I/O requests, making a much better use of the I/O subsystem. Collective I/O has been shown to significantly reduce the cost of performing I/O in many large, parallel applications, and for this reason serves as an important base upon which we can explore other mechanisms which can further reduce these costs. One promising approach is to use threads to perform the collective I/O in the background while the main thread continues with other computation in the foreground. In this paper we explore the issues associated with implementing collective I/O in the background using threads. The most natural approach is to simple, spawn off an I/O thread to perform the collective I/O in the background while the main thread continues with other computation. However our research demonstrates that this approach is frequently the worst implementation option, often performing much more poorly than just executing collective I/O completely in the foreground. To improve the performance of thread-based collective I/O, we developed an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this new technique can significantly improve the performance of thread-based collective I/O, providing up to an 80% improvement over sequential collective I/O (where there is no attempt to overlap computation with I/O). Also, we discuss one very important application of this research which is the implementation of the split-collective parallel I/O operations defined in MPI 2.0.
european conference on parallel processing | 2003
Phillip M. Dickens
In this paper, we discuss our work on developing an efficient, lightweight application-level communication protocol for the high-bandwidth, high-delay network environments typical of computational grids. The goal of this research is to provide congestion-control algorithms that allow the protocol to obtain a large percentage of the underlying bandwidth when it is available, and to be responsive (and eventually proactive) to developing contention for system resources. Towards this end, we develop and evaluate two application-level congestion-control algorithms, one of which incorporates historical knowledge and one that only uses current information. We compare the performance of these two algorithms with respect to each other and with respect to TCP.
ACM Transactions on Modeling and Computer Simulation | 1996
Phillip M. Dickens; David M. Nicol; Paul F. Reynolds; John Mark Duva
This article studies an analytic model of parallel discrete-event simulation, comparing the YAWNS conservative synchronization protocol with Bounded Time Warp. The assumed simulation problem is a heavily loaded queuing network where the probability of an idle server is closed to zero. We model workload and job routing in standard ways, then develop and validate methods for computing approximated performance measures as a function of the degree of optimism allowed, overhead costs of state-saving, rollback, and barrier synchronization, and workload aggregation. We find that Bounded Time Warp is superior when the number of servers per physical processor is low (i.e., sparse load), but that aggregating workload improves YAWNS relative performance.
high performance distributed computing | 2009
Phillip M. Dickens; Jeremy Logan
It is widely known that MPI-IO performs poorly in a Lustre file system environment, although the reasons for such performance are currently not well understood. The research presented in this paper strongly supports our hypothesis that MPI-IO performs poorly in this environment because of the fundamental assumptions upon which most parallel I/O optimizations are based. In particular, it is almost universally believed that parallel I/O performance is optimized when aggregator processes perform large, contiguous I/O operations in parallel. Our research shows that this approach generally provides the worst performance in a Lustre environment, and that the best performance is often obtained when the aggregator processes perform a large number of small, non-contiguous I/O operations. In this paper, we first demonstrate and explain these non-intuitive results. We then present a user-level library, termed Y-lib, which redistributes data in a way that conforms much more closely with the Lustre storage architecture than does the data redistribution pattern employed by MPI-IO. We then provide experimental results showing that Y-lib can increase performance between 300% and 1000% depending on the number of aggregator processes and file size. Finally, we cause MPI-IO itself to use our data redistribution scheme, and show that doing so results in an increase in performance of a similar magnitude when compared to the current MPI-IO data redistribution algorithms.
Journal of Parallel and Distributed Computing | 2001
Phillip M. Dickens; Rajeev Thakur
In this paper, we evaluate the impact on performance of various implementation techniques for collective I/O operations, and we do so across four important parallel architectures. We show that a naive implementation of collective I/0 does not result in significant performance gains for any of the architectures, but that an optimized implementation does provide excellent performance across all of the platforms under study. Furthermore, we demonstrate that there exists a single implementation strategy that provides the best performance for all four computational platforms. Next, we evaluate implementation techniques for thread-based collective I/O operations. We show that the most obvious implementation technique, which is to spawn a thread to execute the whole collective I/O operation in the background, frequently provides the worst performance, often performing much worse than just executing the collective I/O routine entirely in the foreground. To improve performance, we explore an alternate approach where part of the collective I/O operation is performed in the background, and part is performed in the foreground. We demonstrate that this implementation technique can provide significant performance gains, offering up to a 50% improvement over implementations that do not attempt to overlap collective I/O and computation.
Concurrency and Computation: Practice and Experience | 2000
George K. Thiruvathukal; Phillip M. Dickens; Shahzad Bhatti
Networks of workstations are a dominant force in the distributed computing arena, due primarily to the excellent price/performance ratio of such systems when compared to traditionally massively parallel architectures. It is therefore critical to develop programming languages and environments that can potentially harness the raw computational power availab le on these systems. In this article, we present JavaNOW (Java on Networks of Workstations), a Java based framework for parallel programming on networks of workstations. It creates a virtual parallel machine similar to the MPI (Message Passing Interface) model, and provides distributed associative shared memory similar to Linda memory model but with a flexible set of primitive operations. JavaNOW provides a simple yet powerful framework for performing computation on networks of workstations. In addition to the Linda memory model, it provides for shared objects, implicit multithreading, implicit synchronization, object dataflow, and collective communications similar to those defined in MPI. JavaNOW is also a component of the Computational Neighborhood [63], a Java-enabled suite of services for desktop computational sharing. The intent of JavaNOW is to present an environment for parallel computing that is both expressive and reliable and ultimately can deliver good to excellent performance. As JavaNOW is a work in progress, this article emphasizes the expressive potential of the JavaNOW environment and presents preliminary performance results only.
european conference on parallel processing | 1998
Phillip M. Dickens
Massively parallel computers are increasingly being used to solve large, I/O intensive applications in many different fields. For such applications, the I/O subsystem represents a significant obstacle in the way of achieving good performance. While massively parallel architectures do, in general, provide parallel I/O hardware, this alone is not sufficient to guarantee good performance. The problem is that in many applications each processor initiates many small I/O requests rather than fewer larger requests, resulting in significant performance penalties due to the high latency associated with I/O. However, it is often the case that in the aggregate the I/O requests are significantly fewer and larger. Two-phase I/O is a technique that captures and exploits this aggregate information to recombine I/O requests such that fewer and larger requests are generated, reducing latency and improving performance. In this paper, we describe our efforts to obtain high performance using two-phase I/O. In particular, we describe our first implementation which produced a sustained bandwidth of 78 MBytes per second, and discuss the steps taken to increase this bandwidth to 420 MBytes per second.
OTM '08 Proceedings of the OTM 2008 Confederated International Conferences, CoopIS, DOA, GADA, IS, and ODBASE 2008. Part I on On the Move to Meaningful Internet Systems: | 2008
Phillip M. Dickens; Jeremy Logan
Lustre is becoming an increasingly important file system for large-scale computing clusters. The problem is that many data-intensive applications use MPI-IO for their I/O requirements, and it has been well documented that MPI-IO performs poorly in a Lustre file system environment. However, the reasons for such poor performance are not currently well understood. We believe that the primary reason for poor performance is that the assumptions underpinning most of the parallel I/O optimizations implemented in MPI-IO do not hold in a Lustre environment. Perhaps the most important assumption that appears to be incorrect is that optimal performance is obtained by performing large, contiguous I/O operations. Our research suggests that this is often the worst approach to take in a Lustre file system. In fact, we found that the best performance is sometimes achieved when each process performs a series of smaller, non-contiguous I/O requests. In this paper, we provide experimental results showing that such assumptions do not apply in Lustre, and explore new approaches that appear to provide significantly better performance.