Stavros Volos
École Polytechnique Fédérale de Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Stavros Volos.
architectural support for programming languages and operating systems | 2012
Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi
Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that todays predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while todays predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
international symposium on computer architecture | 2013
Djordje Jevdjic; Stavros Volos; Babak Falsafi
Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip. This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the pages residency in the cache --- i.e., the pages footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.
international symposium on computer architecture | 2012
Pejman Lotfi-Kamran; Boris Grot; Michael Ferdman; Stavros Volos; Onur Kocberber; Javier Picorel; Almutaz Adileh; Djordje Jevdjic; Sachin Satish Idgunji; Emre Özer; Babak Falsafi
Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.
networks on chips | 2012
Stavros Volos; Ciprian Seiculescu; Boris Grot; Naser Khosro Pour; Babak Falsafi; Giovanni De Micheli
Many core chips are emerging as the architecture of choice to provide power efficiency and improve performance, while riding Moores Law. In these architectures, on-chip inter-connects play a pivotal role in ensuring power and performance scalability. As supply voltages begin to level off in future technologies, chip designs in general and interconnects in particular will require specialization to meet power and performance objectives. In this work, we make the observation that cache-coherent many core server chips exhibit a duality in on-chip network traffic. Request traffic largely consists of simple control messages, while response traffic often carries cache-block-sized payloads. We present Cache-Coherence Network-on-Chip (CCNoC), a design that specializes the NoC to fit the demands of server workloads via a pair of asymmetric networks tuned to the type of traffic traversing them. The networks differ in their data path width, router micro architecture, flow control strategy, and delay. The resulting heterogeneous CCNoC architecture enables significant gains in power efficiency over conventional NoC designs at similar performance levels. Our evaluation reveals that a 4×4 mesh-based chip multiprocessor with the proposed CCNoC organization running commercial server workloads is 15-28% more energy efficient than various state-of-the-art single- and dual-network organizations.
international symposium on microarchitecture | 2014
Stavros Volos; Javier Picorel; Babak Falsafi; Boris Grot
With the end of Den nard scaling, server power has emerged as the limiting factor in the quest for more capable data enters. Without the benefit of supply voltage scaling, it is essential to lower the energy per operation to improve server efficiency. As the industry moves to lean-core server processors, the energy bottleneck is shifting toward main memory as a chief source of server energy consumption in modern data enters. Maximizing the energy efficiency of todays DRAM chips and interfaces requires amortizing the costly DRAM page activations over multiple row buffer accesses. This work introduces Bulk Memory Access Prediction and Streaming, or BuMP. We make the observation that a significant fraction (59-79%) of all memory accesses fall into DRAM pages with high access density, meaning that the majority of their cache blocks will be accessed within a modest time frame of the first access. Accesses to high-density DRAM pages include not only memory reads in response to load instructions, but also reads stemming from store instructions as well as memory writes upon a dirty LLC eviction. The remaining accesses go to low-density pages and virtually unpredictable reference patterns (e.g., Hashed key lookups). BuMP employs a low-cost predictor to identify high-density pages and triggers bulk transfer operations upon the first read or write to the page. In doing so, BuMP enforces high row buffer locality where it is profitable, thereby reducing DRAM energy per access by 23%, and improves server throughput by 11% across a wide range of server applications.
ACM Transactions on Computer Systems | 2012
Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi
Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today’s predominant processor microarchitecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core microarchitecture. Moreover, while today’s predominant microarchitecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key microarchitectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
IEEE Micro | 2014
Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi
Emerging scale-out workloads need extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and requiring improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency. In this work, we demonstrate that modern server processors are highly inefficient for running cloud workloads. To address this problem, we investigate the microarchitectural behavior of scale-out workloads and present opportunities to enable specialized processor designs that closely match the needs of the cloud.
IEEE Micro | 2017
Stavros Volos; Djordje Jevdjic; Babak Falsafi; Boris Grot
Emerging scale-out servers are characterized by massive memory footprints and bandwidth requirements. On-chip stacked DRAM caches have been proposed to provide the required bandwidth for manycore servers through caching of secondary data working sets. However, the disparity between provided capacity and working set sizes precludes their effective deployment in servers, calling for high-capacity cache architectures. High-capacity caches--enabled by the emergence of high-bandwidth memory technologies--exhibit high spatiotemporal locality due to coarse-grained access patterns and long cache residency periods stemming from skewed dataset access distributions. The observed spatiotemporal behavior favors a page-based organization that naturally exploits spatial locality while minimizing tag storage requirements and enabling a practical in-SRAM tag array architecture. By storing tags in SRAM, caches avoid the complexity of in-DRAM metadata found in state-of-the-art DRAM caches.
symposium on cloud computing | 2018
Ian A. Kash; Greg O'Shea; Stavros Volos
Public cloud datacenters implement a distributed computing environment built for economy at scale, with hundreds of thousands of compute and storage servers and a large population of predominantly small customers often densely packed to a compute server. Several recent contributions have investigated how equitable sharing and differentiated services can be achieved in this multi-resource environment, using the Extended Dominant Resource Fairness (EDRF) algorithm. However, we find that EDRF requires prohibitive execution time when employed at datacenter scale due to its iterative nature and polynomial time complexity; its closed-form expression does not alter its asymptotic complexity. In response, we propose Deadline-Constrained DRF, or DC-DRF, an adaptive approximation of EDRF designed to support centralized multi-resource allocation at datacenter scale in bounded time. The approximation introduces error which can be reduced using a high-performance implementation, drawing on parallelization techniques from the field of High-Performance Computing and vector arithmetic instructions available in modern server processors. We evaluate DC-DRF at scales that exceed those previously reported by several orders of magnitude, calculating resource allocations for one million predominantly small tenants and one hundred thousand resources, in seconds. Our parallel implementation preserves the properties of EDRF up to a small error, and empirical results show that the error introduced by approximation is insignificant for practical purposes.
Archive | 2012
Pejman Lotfi-Kamran; Boris Grot; Michael Ferdman; Stavros Volos; Onur Kocberber; Javier Picorel; Almutaz Adileh; Djordje Jevdjic; Sachin Satish Idgunji; Emre Özer; Babak Falsafi