Kevin Skadron | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin Skadron is active.

Explore More

Publication

Featured researches published by Kevin Skadron.

ieee international symposium on workload characterization | 2009

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che; Michael Boyer; Jiayuan Meng; David Tarjan; Jeremy W. Sheaffer; Sang-Ha Lee; Kevin Skadron

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeleys dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

international conference on computer graphics and interactive techniques | 2008

Scalable Parallel Programming with CUDA

John R. Nickolls; Ian Buck; Michael Garland; Kevin Skadron

Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.

IEEE Transactions on Very Large Scale Integration Systems | 2006

HotSpot: a compact thermal modeling methodology for early-stage VLSI design

Wei Huang; Shougata Ghosh; Sivakumar Velusamy; Karthik Sankaranarayanan; Kevin Skadron; Mircea R. Stan

This paper presents HotSpot-a modeling methodology for developing compact thermal models based on the popular stacked-layer packaging scheme in modern very large-scale integration systems. In addition to modeling silicon and packaging layers, HotSpot includes a high-level on-chip interconnect self-heating power and thermal model such that the thermal impacts on interconnects can also be considered during early design stages. The HotSpot compact thermal modeling approach is especially well suited for preregister transfer level (RTL) and presynthesis thermal analysis and is able to provide detailed static and transient temperature information across the die and the package, as it is also computationally efficient.

Journal of Parallel and Distributed Computing | 2008

A performance study of general-purpose applications on graphics processors using CUDA

Shuai Che; Michael Boyer; Jiayuan Meng; David Tarjan; Jeremy W. Sheaffer; Kevin Skadron

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIAs C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.

design automation conference | 2004

Compact thermal modeling for temperature-aware design

Wei Huang; Mircea R. Stan; Kevin Skadron; Karthik Sankaranarayanan; Shougata Ghosh; Sivakumar Velusamy

Thermal design in sub-100nm technologies is one of the major challenges to the CAD community. In this paper, we first introduce the idea of temperature-aware design. We then propose a compact thermal model which can be integrated with modern CAD tools to achieve a temperature-aware design methodology. Finally, we use the compact thermal model in a case study of microprocessor design to show the importance of using temperature as a guideline for the design. Results from our thermal model show that a temperature-aware design approach can provide more accurate estimations, and therefore better decisions and faster design convergence.

international symposium on microarchitecture | 2011

Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Jason Mars; Lingjia Tang; Robert Hundt; Kevin Skadron; Mary Lou Soffa

As much of the worlds computing continues to move into the cloud, the overprovisioning of computing resources to ensure the performance isolation of latency-sensitive tasks, such as web search, in modern datacenters is a major contributor to low machine utilization. Being unable to accurately predict performance degradation due to contention for shared resources on multicore systems has led to the heavy handed approach of simply disallowing the co-location of high-priority, latency-sensitive tasks with other tasks. Performing this precise prediction has been a challenging and unsolved problem. In this paper, we present Bubble-Up, a characterization methodology that enables the accurate prediction of the performance degradation that results from contention for shared resources in the memory subsystem. By using a bubble to apply a tunable amount of “pressure” to the memory subsystem on processors in production datacenters, our methodology can predict the performance interference between co-locate applications with an accuracy within 1% to 2% of the actual performance degradation. Using this methodology to arrive at “sensible” co-locations in Googles production datacenters with real-world large-scale applications, we can improve the utilization of a 500-machine cluster by 50% to 90% while guaranteeing a high quality of service of latency-sensitive applications.

IEEE Transactions on Computers | 2007

Dynamic Voltage Scaling in Multitier Web Servers with End-to-End Delay Control

Tibor Horvath; Tarek F. Abdelzaher; Kevin Skadron; Xue Liu

The energy and cooling costs of Web server farms are among their main financial expenditures. This paper explores the benefits of dynamic voltage scaling (DVS) for power management in server farms. Unlike previous work, which addressed DVS on individual servers and on load-balanced server replicas, this paper addresses DVS in multistage service pipelines. Contemporary Web server installations typically adopt a three-tier architecture in which the first tier presents a Web interface, the second executes scripts that implement business logic, and the third serves database accesses. From a users perspective, only the end-to-end response across the entire pipeline is relevant. This paper presents a rigorous optimization methodology and an algorithm for minimizing the total energy expenditure of the multistage pipeline subject to soft end-to-end response-time constraints. A distributed power management service is designed and evaluated on a real three-tier server prototype for coordinating DVS settings in a way that minimizes global energy consumption while meeting end-to-end delay constraints. The service is shown to consume as much as 30 percent less energy compared to the default (Linux) energy saving policy

international symposium on computer architecture | 2010

Dynamic warp subdivision for integrated branch and memory divergence tolerance

Jiayuan Meng; David Tarjan; Kevin Skadron

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth x SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private L1 caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.

symposium on application specific processors | 2008

Accelerating Compute-Intensive Applications with GPUs and FPGAs

Shuai Che; Jie Li; Jeremy W. Sheaffer; Kevin Skadron; John Lach

Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. Applications typically exhibit vastly different performance characteristics depending on the accelerator. This is an inherent problem attributable to architectural design, middleware support and programming style of the target platform. For the best application-to-accelerator mapping, factors such as programmability, performance, programming cost and sources of overhead in the design flows must be all taken into consideration. In general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources. We present a performance study of three diverse applications - Gaussian elimination, data encryption standard (DES), and Needleman-Wunsch - on an FPGA, a GPU and a multicore CPU system. We perform a comparative study of application behavior on accelerators considering performance and code complexity. Based on our results, we present an application characteristic to accelerator platform mapping, which can aid developers in selecting an appropriate target architecture for their chosen application.

real-time systems symposium | 2003

Power-aware QoS management in Web servers

Vivek Sharma; Arun Thomas; Tarek F. Abdelzaher; Kevin Skadron; Zhijian Lu

Power management in data centers has become an increasingly important concern. Large server installations are designed to handle peak load, which may be significantly larger than in off-peak conditions. The increasing cost of energy consumption and cooling incurred in farms of high-performance Web servers make low-power operation during off-peak hours desirable. This paper investigates adaptive algorithms for dynamic voltage scaling in QoS-enabled Web servers to minimize energy consumption subject to service delay constraints. We implement these algorithms inside the Linux kernel. The instrumented kernel supports multiple client classes with per-class deadlines. Energy consumption is minimized by using a feedback loop that regulates frequency and voltage levels to keep the synthetic utilization around the aperiodic schedulability bound derived in an earlier publication. Enforcing the bound ensures that deadlines are met. Our evaluation of an Apache server running on the modifier Linux kernel shows that non-trivial off-peak energy savings are possible without sacrificing timeliness.

Explore More