Svilen Kanev | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Svilen Kanev is active.

Explore More

Publication

Featured researches published by Svilen Kanev.

international symposium on computer architecture | 2015

Profiling a warehouse-scale computer

Svilen Kanev; Juan Pablo Darago; Kim M. Hazelwood; Parthasarathy Ranganathan; Tipp Moseley; Gu-Yeon Wei; David M. Brooks

With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This “datacenter tax” can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.

ieee international symposium on workload characterization | 2014

Tradeoffs between power management and tail latency in warehouse-scale applications

Svilen Kanev; Kim M. Hazelwood; Gu-Yeon Wei; David M. Brooks

The growth in datacenter computing has increased the importance of energy-efficiency in servers. Techniques to reduce power have brought server designs close to achieving energy-proportional computing. However, they stress the inherent tradeoff between aggressive power management and quality of service (QoS) - the dominant metric of performance in datacenters. In this paper, we characterize this tradeoff for 15 benchmarks representing workloads from Googles datacenters. We show that 9 of these benchmarks often toggle their cores between short bursts of activity and sleep. In doing so, they stress sleep selection algorithms and can cause tail latency degradation or missed potential for power savings of up to 10% on important workloads like web search. However, improving sleep selection alone is not sufficient for large efficiency gains on current server hardware. To guide the direction needed for such large gains, we profile datacenter applications for susceptibility to dynamic voltage and frequency scaling (DVFS). We find the largest potential in DVFS which is cognizant of latency/power tradeoffs on a workload-per-workload basis.

international symposium on low power electronics and design | 2013

Characterizing and evaluating voltage noise in multi-core near-threshold processors

Xuan Zhang; Tao Tong; Svilen Kanev; Sae Kyu Lee; Gu-Yeon Wei; David M. Brooks

Lowering the supply voltage to improve energy efficiency leads to higher load current and elevated supply sensitivity. In this paper, we provide the first quantitative analysis of voltage noise in multi-core near-threshold processors in a future 10nm technology across SPEC CPU2006 benchmarks. Our results reveal larger guardband requirement and significant energy efficiency loss due to power delivery nonidealities at near threshold, and highlight the importance of accurate voltage noise characterization for design exploration of energy-centric computing systems using near-threshold cores.

international symposium on computer architecture | 2014

HELIX-RC: an architecture-compiler co-design for automatic parallelization of irregular programs

Simone Campanoni; Kevin Brownell; Svilen Kanev; Timothy M. Jones; Gu-Yeon Wei; David M. Brooks

Data dependences in sequential programs limit parallelization because extracted threads cannot run independently. Although thread-level speculation can avoid the need for precise dependence analysis, communication overheads required to synchronize actual dependences counteract the benefits of parallelization. To address these challenges, we propose a lightweight architectural enhancement co-designed with a parallelizing compiler, which together can decouple communication from thread execution. Simulations of these approaches, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.

IEEE Micro | 2011

Voltage Noise in Production Processors

Vijay Janapa Reddi; Svilen Kanev; Wonyoung Kim; Simone Campanoni; Michael D. Smith; Gu-Yeon Wei; David M. Brooks

Voltage variations are a major challenge in processor design. Here, researchers characterize the voltage noise characteristics of programs as they run to completion on a production Core 2 Duo processor. Furthermore, they characterize the implications of resilient architecture design for voltage variation in future systems.

international symposium on low power electronics and design | 2012

XIOSim: power-performance modeling of mobile x86 cores

Svilen Kanev; Gu-Yeon Wei; David M. Brooks

Simulation is one of the main vehicles of computer architecture research. In this paper, we present XIOSim - a highly detailed microarchitectural simulator targeted at mobile x86 microprocessors. The simulator execution model that we propose is a blend between traditional user-level simulation and full-system simulation. Our current implementation features detailed power and performance core models which allow microarchitectural exploration. Using a novel validation methodology, we show that XIOSims performance models manage to stay well within 10% of real hardware for the whole SPEC CPU2006 suite. Furthermore, we validate power models against measured data to show a deviation of less than 5% in terms of average power consumption.

architectural support for programming languages and operating systems | 2017

Mallacc: Accelerating Memory Allocation

Svilen Kanev; Sam Likun Xi; Gu-Yeon Wei; David M. Brooks

Recent work shows that dynamic memory allocation consumes nearly 7% of all cycles in Google datacenters. With the trend towards increased specialization of hardware, we propose Mallacc, an in-core hardware accelerator designed for broad use across a number of high-performance, modern memory allocators. The design of Mallacc is quite different from traditional throughput-oriented hardware accelerators. Because memory allocation requests tend to be very frequent, fast, and interspersed inside other application code, accelerators must be optimized for latency rather than throughput and area overheads must be kept to a bare minimum. Mallacc accelerates the three primary operations of a typical memory allocation request: size class computation, retrieval of a free memory block, and sampling of memory usage. Our results show that malloc latency can be reduced by up to 50% with a hardware cost of less than 1500 um2 of silicon area, less than 0.006% of a typical high-performance processor core.

international symposium on performance analysis of systems and software | 2011

Portable trace compression through instruction interpretation

Svilen Kanev; Robert Cohn

Execution traces are a useful tool in studying processor and program behavior. However, the amount of information that needs to be stored makes them impractical in uncompressed form. This is especially true for full-state traces that can capture up to kilobytes of processor state for every instruction. In this paper we present Zcompr — a compression scheme that allows practical usage of full-state traces that are billions of instructions long. It allows complete state reproducibility, sufficient even for validation purposes, that is fully portable between different operating systems and host platforms. The compression scheme exploits the general similarity between compression and prediction. A simplified functional simulator is used to predict instruction effects in a repeatable manner. Its predictions can be used to reproduce those effects at decompression time, limiting the amount of information that needs to be stored per instruction. Final trace densities achieved by our scheme are on the order of two bits per instruction, with typical decompression speeds of 300 KIPS.

IEEE Computer Architecture Letters | 2017

CARB: A C-State Power Management Arbiter for Latency-Critical Workloads

Xin Zhan; Reza Azimi; Svilen Kanev; David M. Brooks; Sherief Reda

Latency-critical workloads in datacenters have tight response time requirements to meet service-level agreements (SLAs). Sleep states (c-states) enable servers to reduce their power consumption during idle times; however entering and exiting c-states is not instantaneous, leading to increased transaction latency. In this paper we propose a c-state arbitration technique, CARB, that minimizes response time, while simultaneously realizing the power savings that could be achieved from enabling c-states. CARB adapts to incoming request rates and processing times and activates the smallest number of cores for processing the current load. CARB reshapes the distribution of c-states and minimizes the latency cost of sleep by avoiding going into deep sleeps too often. We quantify the improvements from CARB with memcached running on an 8-core Haswell-based server.

IEEE Micro | 2016

Profiling a Warehouse-Scale Computer

Svilen Kanev; Juan Pablo Darago; Kim M. Hazelwood; Parthasarathy Ranganathan; Tipp Moseley; Gu-Yeon Wei; David M. Brooks

Data centers are quickly becoming the platform of choice for modern applications. In order to understand how data center software utilizes the hardware and to improve future server processor performance, the authors profiled more than 20,000 Google machines over a three-year period, while serving the requests of billions of users.

Explore More