Alan L. Cox | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alan L. Cox is active.

Explore More

Publication

Featured researches published by Alan L. Cox.

IEEE Computer | 1996

TreadMarks: shared memory computing on networks of workstations

Cristiana Amza; Alan L. Cox; Sandhya Dwarkadas; Peter J. Keleher; Honghui Lu; Ramakrishnan Rajamony; Weimin Yu; Willy Zwaenepoel

Shared memory facilitates the transition from sequential to parallel processing. Since most data structures can be retained, simply adding synchronization achieves correct, efficient programs for many applications. We discuss our experience with parallel computing on networks of workstations using the TreadMarks distributed shared memory system. DSM allows processes to assume a globally shared virtual memory even though they execute on nodes that do not physically share memory. We illustrate a DSM system consisting of N networked workstations, each with its own memory. The DSM software provides the abstraction of a globally shared memory, in which each processor can access any data item without the programmer having to worry about where the data is or how to obtain its value.

international symposium on computer architecture | 1992

Lazy release consistency for software distributed shared memory

Peter J. Keleher; Alan L. Cox; Willy Zwaenepoel

Relaxed memory consistency models, such as release consistency, were introduced in order to reduce the impact of remote memory access latency in both software and hardware distributed shared memory (DSM). However, in a software DSM, it is also important to reduce the number of messages and the amount of data exchanged for remote memory access. Lazy release consistency is a new algorithm for implementing release consistency that lazily pulls modifications across the interconnect only when necessary. Trace-driven simulation using the SPLASH benchmarks indicates that lazy release consistency reduces both the number of messages and the amount of data transferred between processors. These reductions are especially significant for programs that exhibit false sharing and make extensive use of locks.

virtual execution environments | 2008

Scheduling I/O in virtual machine monitors

Diego Ongaro; Alan L. Cox; Scott Rixner

This paper explores the relationship between domain scheduling in avirtual machine monitor (VMM) and I/O performance. Traditionally, VMM schedulers have focused on fairly sharing the processor resources among domains while leaving the scheduling of I/O resources as asecondary concern. However, this can resultin poor and/or unpredictable application performance, making virtualization less desirable for applications that require efficient and consistent I/O behavior. This paper is the first to study the impact of the VMM scheduler on performance using multiple guest domains concurrently running different types of applications. In particular, different combinations of processor-intensive, bandwidth-intensive, andlatency-sensitive applications are run concurrently to quantify the impacts of different scheduler configurations on processor and I/O performance. These applications are evaluated on 11 different scheduler configurations within the Xen VMM. These configurations include a variety of scheduler extensions aimed at improving I/O performance. This cross product of scheduler configurations and application types offers insight into the key problems in VMM scheduling for I/O and motivates future innovation in this area.

international symposium on computer architecture | 1993

Adaptive cache coherency for detecting migratory shared data

Alan L. Cox; Robert J. Fowler

Parallel programs exhibit a small number of distinct data-sharing patterns. A common data-sharing pattern, migratory access, is characterized by exclusive read and write access by one processor at a time to a shared datum. We describe a family of adaptive cache coherency protocols that dynamically identify migratory shared data in order to reduce the cost of moving them. The protocols use a standard memory model and processor-cache interface. They do not require any compile-time or run-time software support. We describe implementations for bus-based multiprocessors and for shared-memory multiprocessors that use directory-based caches. These implementations are simple and would not significantly increase hardware cost. We use trace- and execution-driven simulation to compare the performance of the adaptive protocols to standard write-invalidate protocols. These simulations indicate that, compared to conventional protocols, the use of the adaptive protocol can almost halve the number of inter-node messages on some applications. Since cache coherency traffic represents a larger part of the total communication as cache size increases, the relative benefit of using the adaptive protocol also increases.

international symposium on performance analysis of systems and software | 2010

The Hadoop distributed filesystem: Balancing portability and performance

Jeffrey Shafer; Scott Rixner; Alan L. Cox

Hadoop is a popular open-source implementation of MapReduce for the analysis of large datasets. To manage storage resources across the cluster, Hadoop uses a distributed user-level filesystem. This filesystem - HDFS - is written in Java and designed for portability across heterogeneous hardware and software platforms. This paper analyzes the performance of HDFS and uncovers several performance issues. First, architectural bottlenecks exist in the Hadoop implementation that result in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Second, portability limitations prevent the Java implementation from exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions about how the native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior. This paper investigates the root causes of these performance bottlenecks in order to evaluate tradeoffs between portability and performance in the Hadoop distributed filesystem.

Concurrency and Computation: Practice and Experience | 1997

Java/DSM: A platform for heterogeneous computing

Weimin Yu; Alan L. Cox

In this paper we describe a system for programming heterogeneous computing environments based upon Java and software distributed shared memory (DSM). Compared with existing approaches for heterogeneous computing, our system transparently handles both the hardware differences and the distributed nature of the system. It is therefore much easier to program. Java is a good candidate for heterogeneous programming because of its portability. Java provides the remote method invocation (RMI) mechanism for distributed computing. However, our early experience with RMI showed that the programmer must expend significant effort on such problems as data replication and the optimization of the remote interface to improve communication efficiency. Furthermore, the need for reference marshaling is not completely eliminated by RMIs effort to facilitate the sharing of linked structures between machines. A DSM system provides a shared memory abstraction over a group of physically distributed machines. It automatically handles data communication between machines and eliminates the need for the programmer to write message-passing code. A multithreaded Java program written for a single machine will require fewer changes to run on a Java/DSM combination than with RMI. We have been implementing a JDK-1.0.2 based parallel Java Virtual Machine on the TreadMarks DSM system. Our implementation includes a distributed garbage collector and supports the Java API with very few changes. In this paper we describe our motivation and the implementation, and report our early experience with programming under both RMI and DSM.

high-performance computer architecture | 2007

Concurrent Direct Network Access for Virtual Machine Monitors

Paul Willmann; Jeffrey Shafer; David Carr; Aravind Menon; Scott Rixner; Alan L. Cox; Willy Zwaenepoel

This paper presents hardware and software mechanisms to enable concurrent direct network access (CDNA) by operating systems running within a virtual machine monitor. In a conventional virtual machine monitor, each operating system running within a virtual machine must access the network through a software-virtualized network interface. These virtual network interfaces are multiplexed in software onto a physical network interface, incurring significant performance overheads. The CDNA architecture improves networking efficiency and performance by dividing the tasks of traffic multiplexing, interrupt delivery, and memory protection between hardware and software in a novel way. The virtual machine monitor delivers interrupts and provides protection between virtual machines, while the network interface performs multiplexing of the network data. In effect, the CDNA architecture provides the abstraction that each virtual machine is connected directly to its own network interface. Through the use of CDNA, many of the bottlenecks imposed by software multiplexing can be eliminated without sacrificing protection, producing substantial efficiency improvements

operating systems design and implementation | 2002

Practical, transparent operating system support for superpages

Sitaram Iyer; Peter Druschel; Alan L. Cox

Most general-purpose processors provide support for memory pages of large sizes, called superpages. Superpages enable each entry in the translation lookaside buffer (TLB) to map a large physical memory region into a virtual address space. This dramatically increases TLB coverage, reduces TLB misses, and promises performance improvements for many applications. However, supporting superpages poses several challenges to the operating system, in terms of superpage allocation and promotion tradeoffs, fragmentation control, etc. We analyze these issues, and propose the design of an effective superpage management system. We implement it in FreeBSD on the Alpha CPU, and evaluate it on real workloads and benchmarks. We obtain substantial performance benefits, often exceeding 30%; these benefits are sustained even under stressful workload scenarios.

symposium on operating systems principles | 1989

The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum

Alan L. Cox; Robert J. Fowler

PLATINUM is an operating system kernel with a novel memory management system for Non-Uniform Memory Access (NUMA) multiprocessor architectures. This memory management system implements a coherent memory abstraction. Coherent memory is uniformly accessible from all processors in the system. When used by applications coded with appropriate programming styles it appears to be nearly as fast as local physical memory and it reduces memory contention. Coherent memory makes programming NUMA multiprocessors easier for the user while attaining a level of performance comparable with hand-tuned programs. This paper describes the design and implementation of the PLATINUM memory management system, emphasizing the coherent memory. We measure the cost of basic operations implementing the coherent memory. We also measure the performance of a set of application programs running on PLATINUM. Finally, we comment on the interaction between architecture and the coherent memory system. PLATINUM currently runs on the BBN Butterfly Plus Multiprocessor.

international symposium on computer architecture | 1993

Evaluation of release consistent software distributed shared memory on emerging network technology

Sandhya Dwarkadas; Peter J. Keleher; Alan L. Cox; Willy Zwaenepoel

We evaluate the effect of processor speed, network characteristics, and software overhead on the performance of release-consistent software distributed shared memory. We examine five different protocols for implementing release consistency: eager update, eager invalidate, lazy update, lazy invalidate, and a new protocol called lazy hybrid. This lazy hybrid protocol combines the benefits of both lazy update and lazy invalidate. Our simulations indicate that with the processors and networks that are becoming available, coarse-grained applications such as Jacobi and TSP perform well, more or less independent of the protocol used. Medium-grained applications, such as Water, can achieve good performance, but the choice of protocol is critical. For sixteen processors, the best protocol, lazy hybrid, performed more than three times better than the worst, the eager update. Fine-grained applications such as Cholesky achieve little speedup regardless of the protocol used because of the frequency of synchronization operations and the high latency involved. While the use of relaxed memory models, lazy implementations, and multiple-writer protocols has reduced the impact of false sharing, synchronization latency remains a serious problem for software distributed shared memory systems. These results suggest that the future work on software DSMs should concentrate on reducing the amount of synchronization or its effect.

Explore More