Djordje Jevdjic
École Polytechnique Fédérale de Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Djordje Jevdjic.
architectural support for programming languages and operating systems | 2012
Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi
Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that todays predominant processor micro-architecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core micro-architecture. Moreover, while todays predominant micro-architecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
international symposium on computer architecture | 2013
Djordje Jevdjic; Stavros Volos; Babak Falsafi
Recent research advocates using large die-stacked DRAM caches to break the memory bandwidth wall. Existing DRAM cache designs fall into one of two categories --- block-based and page-based. The former organize data in conventional blocks (e.g., 64B), ensuring low off-chip bandwidth utilization, but co-locate tags and data in the stacked DRAM, incurring high lookup latency. Furthermore, such designs suffer from low hit ratios due to poor temporal locality. In contrast, page-based caches, which manage data at larger granularity (e.g., 4KB pages), allow for reduced tag array overhead and fast lookup, and leverage high spatial locality at the cost of moving large amounts of data on and off the chip. This paper introduces Footprint Cache, an efficient die-stacked DRAM cache design for server processors. Footprint Cache allocates data at the granularity of pages, but identifies and fetches only those blocks within a page that will be touched during the pages residency in the cache --- i.e., the pages footprint. In doing so, Footprint Cache eliminates the excessive off-chip traffic associated with page-based designs, while preserving their high hit ratio, small tag array overhead, and low lookup latency. Cycle-accurate simulation results of a 16-core server with up to 512MB Footprint Cache indicate a 57% performance improvement over a baseline chip without a die-stacked cache. Compared to a state-of-the-art block-based design, our design improves performance by 13% while reducing dynamic energy of stacked DRAM by 24%.
international symposium on computer architecture | 2012
Pejman Lotfi-Kamran; Boris Grot; Michael Ferdman; Stavros Volos; Onur Kocberber; Javier Picorel; Almutaz Adileh; Djordje Jevdjic; Sachin Satish Idgunji; Emre Özer; Babak Falsafi
Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.
international symposium on microarchitecture | 2014
Djordje Jevdjic; Gabriel H. Loh; Cansu Kaynak; Babak Falsafi
Recent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use of off-chip bandwidth. Todays stacked DRAM cache designs fall into two categories based on the granularity at which they manage data: block-based and page-based. The state-of-the-art block-based design, called Alloy Cache, collocates a tag with each data block (e.g., 64B) in the stacked DRAM to provide fast access to data in a single DRAM access. However, such a design suffers from low hit rates due to poor temporal locality in the DRAM cache. In contrast, the state-of-the-art page-based design, called Footprint Cache, organizes the DRAM cache at page granularity (e.g., 4KB), but fetches only the blocks that will likely be touched within a page. In doing so, the Footprint Cache achieves high hit rates with moderate on-chip tag storage and reasonable lookup latency. However, multi-gigabyte stacked DRAM caches will soon be practical and needed by server applications, thereby mandating tens of MBs of tag storage even for page-based DRAM caches. We introduce a novel stacked-DRAM cache design, Unison Cache. Similar to Alloy Caches approach, Unison Cache incorporates the tag metadata directly into the stacked DRAM to enable scalability to arbitrary stacked-DRAM capacities. Then, leveraging the insights from the Footprint Cache design, Unison Cache employs large, page-sized cache allocation units to achieve high hit rates and reduction in tag overheads, while predicting and fetching only the useful blocks within each page to minimize the off-chip traffic. Our evaluation using server workloads and caches of up to 8GB reveals that Unison cache improves performance by 14% compared to Alloy Cache due to its high hit rate, while outperforming the state-of-the art page-based designs that require impractical SRAM-based tags of around 50MB.
extending database technology | 2013
Pinar Tözün; Ippokratis Pandis; Cansu Kaynak; Djordje Jevdjic; Anastasia Ailamaki
Introduced in 2007, TPC-E is the most recently standardized OLTP benchmark by TPC. Even though TPC-E has already been around for six years, it has not gained the popularity of its predecessor TPC-C: all the published results for TPC-E use a single database vendors product. TPC-E is significantly different than its predecessors. Some of its distinguishing characteristics are the non-uniform input creation, longer-running and more complicated transactions, more difficult partitioning etc. These factors slow down the adoption of TPC-E. In turn, there is little knowledge in the community about how TPC-E behaves micro-architecturally and within the database engine. To shed light on TPC-E, we implement it on top of a scalable open-source database engine, Shore-MT, and perform a workload characterization study, comparing it with the previous, much better known OLTP benchmarks of TPC: TPC-B and TPC-C. In parallel, we study the evolution of the OLTP benchmarks throughout the decades. Our results demonstrate that TPC-E exhibits similar micro-architectural behavior to TPC-B and TPC-C, even though it incurs less stall time and higher instructions per cycle. On the other hand, within the database engine it suffers more from logical lock contention. Therefore, we argue that, on the hardware side, TPC-E needs less aggressive processors. Whereas on the software side it can benefit from designs based on intra-transaction parallelism, logical partitioning, and optimistic concurrency control to minimize the effects of lock contention without introducing distributed transactions.
ACM Transactions on Computer Systems | 2012
Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi
Emerging scale-out workloads require extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and calling for improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency to ensure that server hardware closely matches the needs of scale-out workloads. In this work, we introduce CloudSuite, a benchmark suite of emerging scale-out workloads. We use performance counters on modern servers to study scale-out workloads, finding that today’s predominant processor microarchitecture is inefficient for running these workloads. We find that inefficiency comes from the mismatch between the workload needs and modern processors, particularly in the organization of instruction and data memory systems and the processor core microarchitecture. Moreover, while today’s predominant microarchitecture is inefficient when executing scale-out workloads, we find that continuing the current trends will further exacerbate the inefficiency in the future. In this work, we identify the key microarchitectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
international conference on computer design | 2012
Dragomir Milojevic; Sachin Satish Idgunji; Djordje Jevdjic; Emre Özer; Pejman Lotfi-Kamran; Andreas Panteli; Andreas Prodromou; Chrysostomos Nicopoulos; Damien Hardy; Babak Falsari; Yiannakis Sazeides
We propose a power-efficient many-core server-on-chip system with 3D-stacked Wide I/O DRAM targeting cloud workloads in datacenters. The integration of 3D-stacked Wide I/O DRAM on top of a logic die increases available memory bandwidth by using dense and fast Through-Silicon Vias (TSVs) instead of off-chip IOs, enabling faster data transfers at much lower energy per bit. We demonstrate a methodology that includes full-system microarchitectural modeling and rapid virtual physical prototyping with emphasis on the thermal analysis. Our findings show that while executing CPU-centric benchmarks (e.g. SPECInt and Dhrystone), the temperature in the server-on-chip (logic+DRAM) is in the range of 175-200°C at a power consumption of less than 20W, exceeding the reliable operating bounds without any cooling solutions, even with embedded cores. However, with real cloud workloads, the power density in the server-on-chip remains much below the temperatures reached by the CPU-centric workloads as a result of much lower power burnt by memory-intensive cloud workloads. We show that such a server-on-chip system is feasible with a low-cost passive heat sink eliminating the need for a high-cost active heat sink with an attached fan, creating an opportunity for overall cost and energy savings in datacenters.
IEEE Micro | 2014
Michael Ferdman; Almutaz Adileh; Onur Kocberber; Stavros Volos; Mohammad Alisafaee; Djordje Jevdjic; Cansu Kaynak; Adrian Popescu; Anastasia Ailamaki; Babak Falsafi
Emerging scale-out workloads need extensive amounts of computational resources. However, data centers using modern server hardware face physical constraints in space and power, limiting further expansion and requiring improvements in the computational density per server and in the per-operation energy. Continuing to improve the computational resources of the cloud while staying within physical constraints mandates optimizing server efficiency. In this work, we demonstrate that modern server processors are highly inefficient for running cloud workloads. To address this problem, we investigate the microarchitectural behavior of scale-out workloads and present opportunities to enable specialized processor designs that closely match the needs of the cloud.
IEEE Micro | 2017
Stavros Volos; Djordje Jevdjic; Babak Falsafi; Boris Grot
Emerging scale-out servers are characterized by massive memory footprints and bandwidth requirements. On-chip stacked DRAM caches have been proposed to provide the required bandwidth for manycore servers through caching of secondary data working sets. However, the disparity between provided capacity and working set sizes precludes their effective deployment in servers, calling for high-capacity cache architectures. High-capacity caches--enabled by the emergence of high-bandwidth memory technologies--exhibit high spatiotemporal locality due to coarse-grained access patterns and long cache residency periods stemming from skewed dataset access distributions. The observed spatiotemporal behavior favors a page-based organization that naturally exploits spatial locality while minimizing tag storage requirements and enabling a practical in-SRAM tag array architecture. By storing tags in SRAM, caches avoid the complexity of in-DRAM metadata found in state-of-the-art DRAM caches.
architectural support for programming languages and operating systems | 2017
Djordje Jevdjic; Karin Strauss; Luis Ceze; Henrique S. Malvar
The popularization of video capture devices has created strong storage demand for encoded videos. Approximate storage can ease this demand by enabling denser storage at the expense of occasional errors. Unfortunately, even minor storage errors, such as bit flips, can result in major visual damage in encoded videos. Similarly, video encryption, widely employed for privacy and digital rights management, may create long dependencies between bits that show little or no tolerance to storage errors. In this paper we propose VideoApp, a novel and efficient methodology to compute bit-level reliability requirements for encoded videos by tracking visual and metadata dependencies within encoded bitstreams. We further show how VideoApp can be used to trade video quality for storage density in an optimal way. We integrate our methodology into a popular H.264 encoder to partition an encoded video stream into multiple streams that can receive different levels of error correction according to their reliability needs. When applied to a dense and highly error-prone multi-level cell storage substrate, our variable error correction mechanism reduces the error correction overhead by half under the most error-intolerant encoder settings, achieving quality/density points that neither compression nor approximation can achieve alone. Finally, we define the basic invariants needed to support encrypted approximate video storage. We present an analysis of block cipher modes of operation, showing that some are fully compatible with approximation, enabling approximate and secure video storage systems.