Ahmad Samih
North Carolina State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ahmad Samih.
international symposium on performance analysis of systems and software | 2012
Anil Krishna; Ahmad Samih; Yan Solihin
Analytical modeling is becoming an increasingly important technique used in the design of chip multiprocessors. Most such models assume multi-programmed workload mixes and either ignore or oversimplify the behavior of multi-threaded applications. In particular, data sharing observed in multi-threaded applications, and its impact on chip design decisions, has not been well characterized in prior analytical modeling work. In this work we describe why data sharing behavior is hard to capture in an analytical model, and study why, and by how much, past attempts have fallen short. We propose a new methodology to measure the impact of data sharing, which quantifies the reduction in on-chip cache miss rates attributable solely to the presence of data sharing. We then extend an existing analytical performance model for a many-core chip by incorporating into it the impact of data sharing in contemporary multi-threaded workloads. We use this analytical model to explore the chip design space for a hypothetical many-core chip of the future. We find that the optimal design point is substantially different when the impact of data sharing is modeled compared to when it is not. Data sharing can enable reassigning a significant fraction of the total chip area (up to 16%, per our model of a future many-core) from cache resources to core resources, which, in turn, improves the overall chip throughput (by up to 58%).
ACM Transactions on Architecture and Code Optimization | 2011
Ahmad Samih; Yan Solihin; Anil Krishna
Chip Multiprocessors (CMP) with distributed L2 caches suffer from a cache fragmentation problem; some caches may be overutilized while others may be underutilized. To avoid such fragmentation, researchers have proposed capacity sharing mechanisms where applications that need additional cache space can place their victim blocks in remote caches. However, we found that only allowing victim blocks to be placed on remote caches tends to cause a high number of remote cache hits relative to local cache hits. In this article, we show that many of the remote cache hits can be converted into local cache hits if we allow newly fetched blocks to be selectively placed directly in a remote cache, rather than in the local cache. To demonstrate this, we use future trace information to estimate the near-upperbound performance that can be gained from combined placement and replacement decisions in capacity sharing. Motivated by encouraging experimental results, we design a simple, predictor-based, scheme called Adaptive Placement Policy (APP) that learns from past cache behavior to make a better decision on whether to place a newly fetched block in the local or remote cache. We found that across 50 multiprogrammed workload mixes running on a 4-core CMP, APPs capacity sharing mechanism increases aggregate performance by 29% on average. At the same time, APP outperforms the state-of-the-art capacity sharing mechanism that uses only replacement-based decisions by up to 18.2%, with a maximum degradation of only 0.5%, and an average improvement of 3%.
Proceedings of the 1st Workshop on Architectures and Systems for Big Data | 2011
Ahmad Samih; Ren Wang; Christian Maciocco; Tsung-Yuan Charlie Tai; Yan Solihin
With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.
cluster computing and the grid | 2012
Ahmad Samih; Ren Wang; Christian Maciocco; Tsung-Yuan Charlie Tai; Ronghui Duan; Jiangang Duan; Yan Solihin
With the fast development of highly-integrated distributed systems (cluster systems), designers face interesting memory hierarchy design choices while attempting to avoid the notorious disk swapping. Swapping to the free remote memory through Memory Collaboration has demonstrated its cost-effectiveness compared to over provisioning the cluster for peak load requirements. Recent memory collaboration studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration of the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. Further, as the interest in memory collaboration grows, it is crucial to understand the existing performance bottlenecks, overheads, and potential optimization. In this paper we address these two issues. First, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time to optimize performance. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3× performance speedup compared to a non-collaborative memory system without perceivable performance impact on nodes that provide memory. Second, we analyze, in depth, the end-to-end memory collaboration overhead and pinpoint the corresponding bottlenecks.
Archive | 2012
Ren Wang; Ahmad Samih; Christian Maciocco; Tsung-Yuan Tai
Archive | 2013
Ren Wang; Ahmad Samih; Eric Delano; Pinkesh J. Shah; Zeshan Chishti; Christian Maciocco; Tsung-Yuan Charlie Tai
Archive | 2012
Ren Wang; Jr-Shian Tsai; Maziar H. Manesh; Tsung-Yuan C. Tai; Ahmad Samih
Archive | 2012
Ren Wang; Ahmad Samih; Christian Maciocco; Tsung-Yuan Charlie Tai; James Jimbo Alexander; Prashant R. Chandra
Archive | 2011
Ren Wang; Christian Maciocco; Tsung-Yuan C. Tai; Ahmad Samih; Mona Vij; Arun Raghunath; John Keys; Scott Hahn; Raj Yavatkar
Archive | 2011
Ahmad Samih; Ren Wang; Christian Maciocco; Tsung-Yuan C. Tai