Michael R. Marty | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michael R. Marty is active.

Explore More

Publication

Featured researches published by Michael R. Marty.

ACM Sigarch Computer Architecture News | 2005

Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

Milo M. K. Martin; Daniel J. Sorin; Bradford M. Beckmann; Michael R. Marty; Min Xu; Alaa R. Alameldeen; Kevin E. Moore; Mark D. Hill; David A. Wood

The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers. We leverage an existing full-system functional simulation infrastructure (Simics [14]) as the basis around which to build a set of timing simulator modules for modeling the timing of the memory system and microprocessors. This simulator infrastructure enables us to run architectural experiments using a suite of scaled-down commercial workloads [3]. To enable other researchers to more easily perform such research, we have released these timing simulator modules as the Multifacet General Execution-driven Multiprocessor Simulator (GEMS) Toolset, release 1.0, under GNU GPL [9].

IEEE Computer | 2008

Amdahl's Law in the Multicore Era

Mark D. Hill; Michael R. Marty

Augmenting Amdahls law with a corollary for multicore hardware makes it relevant to future generations of chips with multiple processor cores. Obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster.

international symposium on computer architecture | 2010

Energy proportional datacenter networks

Dennis Abts; Michael R. Marty; Philip M. Wells; Peter Michael Klausler; Hong Liu

Numerous studies have shown that datacenter computers rarely operate at full utilization, leading to a number of proposals for creating servers that are energy proportional with respect to the computation that they are performing. In this paper, we show that as servers themselves become more energy proportional, the datacenter network can become a significant fraction (up to 50%) of cluster power. In this paper we propose several ways to design a high-performance datacenter network whose power consumption is more proportional to the amount of traffic it is moving -- that is, we propose energy proportional datacenter networks. We first show that a flattened butterfly topology itself is inherently more power efficient than the other commonly proposed topology for high-performance datacenter networks. We then exploit the characteristics of modern plesiochronous links to adjust their power and performance envelopes dynamically. Using a network simulator, driven by both synthetic workloads and production datacenter traces, we characterize and understand design tradeoffs, and demonstrate an 85% reduction in power --- which approaches the ideal energy-proportionality of the network. Our results also demonstrate two challenges for the designers of future network switches: 1) We show that there is a significant power advantage to having independent control of each unidirectional channel comprising a network link, since many traffic patterns show very asymmetric use, and 2) system designers should work to optimize the high-speed channel designs to be more energy efficient by choosing optimal data rate and equalization technology. Given these assumptions, we demonstrate that energy proportional datacenter communication is indeed possible.

international symposium on microarchitecture | 2006

ASR: Adaptive Selective Replication for CMP Caches

Bradford M. Beckmann; Michael R. Marty; David A. Wood

The large working sets of commercial and scientific workloads stress the L2 caches of chip multiprocessors (CMPs). Some CMPs use a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses. Others use private L2 caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals use selective replication to balance latency and capacity, but their static replication rules result in performance degradation for some combinations of workloads and system configurations. This paper proposes adaptive selective replication (ASR), a mechanism that dynamically monitors workload behavior to control replication. ASR replicates cache blocks only when it estimates the benefit of replication (lower L2 hit latency) exceeds the cost (more L2 misses). Full-system simulations of 8-processor CMPs show that ASR provides robust performance: improving performance by as much as 29% versus shared caches, 19% versus private caches, and 12% versus CMP-NuRapid and Victim Replication. Furthermore, while ASR does not improve the performance of all workloads, it provides performance stability by always performing at least comparably to the best alternative including cooperative caching

international symposium on computer architecture | 2007

Virtual hierarchies to support server consolidation

Michael R. Marty; Mark D. Hill

Server consolidation is becoming an increasingly populartechnique to manage and utilize systems. This paper develops CMPmemory systems for server consolidation where most sharing occurswithin Virtual Machines (VMs). Our memory systems maximize sharedmemory accesses serviced within a VM, minimize interference amongseparate VMs, facilitate dynamic reassignment of VMs to processorsand memory, and support content-based page sharing among VMs. Webegin with a tiled architecture where each of 64 tiles contains aprocessor, private L1 caches, and an L2 bank. First, we reveal whysingle-level directory designs fail to meet workload consolidationgoals. Second, we develop the papers central idea of imposing atwo-level virtual (or logical) coherence hierarchy on a physicallyflat CMP that harmonizes with VM assignment. Third, we show thatthe best of our two virtual hierarchy (VH) variants performs 12-58%better than the best alternative flat directory protocol whenconsolidating Apache, OLTP, and Zeus commel workloads on oursimulated 64-core CMP.

high-performance computer architecture | 2005

Improving multiple-CMP systems using token coherence

Michael R. Marty; Jesse D. Bingham; Mark D. Hill; Alan J. Hu; Milo M. K. Martin; David A. Wood

Improvements in semiconductor technology now enable chip multiprocessors (CMPs). As many future computer systems will use one or more CMPs and support shared memory, such systems will have caches that must be kept coherent. Coherence is a particular challenge for multiple-CMP (M-CMP) systems. One approach is to use a hierarchical protocol that explicitly separates the intra-CMP coherence protocol from the inter-CMP protocol, but couples them hierarchically to maintain coherence. However, hierarchical protocols are complex, leading to subtle, difficult-to-verify race conditions. Furthermore, most previous hierarchical protocols use directories at one or both levels, incurring indirections - and thus extra latency - for sharing misses, which are common in commercial workloads. In contrast, this paper exploits the separation of correctness substrate and performance policy in the recently-proposed token coherence protocol to develop the first M-CMP coherence protocol that is flat for correctness, but hierarchical for performance. Via model checking studies, we show that flat correctness eases verification. Via simulation with micro-benchmarks, we make new protocol variants more robust under contention. Finally, via simulation with commercial workloads on a commercial operating system, we show that new protocol variants can be 10-50% faster than a hierarchical directory protocol.

international symposium on microarchitecture | 2006

Coherence Ordering for Ring-based Chip Multiprocessors

Michael R. Marty; Mark D. Hill

Ring interconnects may be an attractive solution for future chip multiprocessors because they can enable faster links than buses and simpler switches than arbitrary switched interconnects. Moreover, a ring naturally orders requests sufficiently to enable directory-less coherence, but not in the total order that buses provide for snooping coherence. Existing cache coherence protocols for rings either establish a (total) ordering point (ORDERING-POINT) or use a greedy order (GREEDY-ORDER) with unbounded retries. In this work, we propose a new class of ring protocols, RING-ORDER, in which requests complete in ring position order to achieve two benefits. First, RING-ORDER improves performance relative to ORDERING-POINT by activating requests immediately instead of waiting for them to reach the ordering point. Second, it improves performance stability relative to GREEDY-ORDER by not using retries. Thus, the new RING-ORDER combines the best of ORDERING-POINT (good performance stability) with the best of GREEDY-ORDER (good average performance)

international conference on parallel architectures and compilation techniques | 2010

Approximating age-based arbitration in on-chip networks

Michael Mihn-Jong Lee; John Kim; Dennis Abts; Michael R. Marty; Jae W. Lee

The on-chip network of emerging many-core CMPs enables the sharing of numerous on-chip components. This on-chip network needs to ensure fairness when accessing the shared resources. In this work, we propose providing equality of service (EoS) in future many-core CMPs on-chip networks by leveraging distance, or hop count, to approximate the age of packets in the network. We propose probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics - fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a complexity-effective mechanism for achieving EoS.

international symposium on microarchitecture | 2010

Probabilistic Distance-Based Arbitration: Providing Equality of Service for Many-Core CMPs

Michael Mihn-Jong Lee; John Kim; Dennis Abts; Michael R. Marty; Jae W. Lee

Emerging many-core chip multiprocessors will integrate dozens of small processing cores with an on-chip interconnect consisting of point-to-point links. The interconnect enables the processing cores to not only communicate, but to share common resources such as main memory resources and I/O controllers. In this work, we propose an arbitration scheme to enable equality of service (EoS) in access to a chip’s shared resources. That is, we seek to remove any bias in a core’s access to a shared resource based on its location in the CMP. We propose using probabilistic arbitration combined with distance-based weights to achieve EoS and overcome the limitation of conventional round-robin arbiter. We describe how nonlinear weights need to be used with probabilistic arbiters and propose three different arbitration weight metrics – fixed weight, constantly increasing weight, and variably increasing weight. By only modifying the arbitration of an on-chip router, we do not require any additional buffers or virtual channels and create a simple, low-cost mechanism for achieving EoS. We evaluate our arbitration scheme across a wide range of traffic patterns. In addition to providing EoS, the proposed arbitration has additional benefits which include providing quality-of-service features (such as differentiated service) and providing fairness in terms of both throughput and latency that approaches the global fairness achieved with age-base arbitration – thus, providing a more stable network by achieving high sustained throughput beyond saturation.

IEEE Micro | 2008

Virtual Hierarchies

Michael R. Marty; Mark D. Hill

Abundant cores per chip will encourage a greater use of space sharing, where work stays on a group of cores for long time intervals. Virtual hierarchies can improve performance and performance isolation of space-shared workloads, while still supporting globally shared memory to facilitate dynamic partitioning and content-based page sharing.

Explore More