Per Hammarlund | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Per Hammarlund is active.

Explore More

Publication

Featured researches published by Per Hammarlund.

international symposium on computer architecture | 2010

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Victor W. Lee; Changkyu Kim; Jatin Chhugani; Michael E. Deisher; Daehyun Kim; Anthony D. Nguyen; Nadathur Satish; Mikhail Smelyanskiy; Srinivas Chennupaty; Per Hammarlund; Ronak Singhal; Pradeep Dubey

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for todays multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

IEEE Micro | 2014

Haswell: The Fourth-Generation Intel Core Processor

Per Hammarlund; Alberto J. Martinez; Atiq Bajwa; David L. Hill; Erik G. Hallnor; Hong Jiang; Martin G. Dixon; Michael N. Derr; Mikal C. Hunsaker; Rajesh Kumar; Randy B. Osborne; Ravi Rajwar; Ronak Singhal; Reynold V. D'Sa; Robert S. Chappell; Shiv Kaushik; Srinivas Chennupaty; Stephan J. Jourdan; Steve H. Gunther; Thomas A. Piazza; Ted Burton

Haswell, Intels fourth-generation core processor architecture, delivers a range of client parts, a converged core for the client and server, and technologies used across many products. It uses an optimized version of Intel 22-nm process technology. Haswell provides enhancements in power-performance efficiency, power management, form factor and cost, core and uncore microarchitecture, and the cores instruction set.

field programmable gate arrays | 2009

Intel nehalem processor core made FPGA synthesizable

Graham Schelle; Jamison D. Collins; Ethan Schuchman; Perry H. Wang; Xiang Zou; Gautham N. Chinya; Ralf Plate; Thorsten Mattner; Franz Olbrich; Per Hammarlund; Ronak Singhal; Jim Brayton; Sebastian Steibl; Hong Wang

We present an FPGA-synthesizable version of the Intel Atom processor core, synthesized to a Virtex-5 based FPGA emulation system. To make the production Atom design in SystemVerilog synthesizable through industry standard EDA tool flow, we transformed and mapped latches in the design, converted clock gating, and replaced nonsynthesizable constructs with FPGA-synthesizable counterparts. Additionally, as the target FPGA emulator is hosted on a PC platform with the Pentium-based CPU socket that supports a significantly different front side bus (FSB) protocol from that of the Atom processor, we replaced the existing bus control logic in the Atom core with an alternate FSB protocol to communicate with the rest of the PC platform. With these efforts, we succeeded in synthesizing the entire Atom processor core to fit within a single Virtex-5 LX330 FPGA. The synthesizable Atom core runs at 50Mhz on the Pentium PC motherboard with fully functional I/O peripherals. It is capable of booting off-the-shelf MS-DOS, Windows XP and Linux operating systems, and executing standard x86 workloads.

high-performance computer architecture | 2004

Hardware Support for Prescient Instruction Prefetch

Tor M. Aamodt; Paul Chow; Per Hammarlund; Hong Wang; John Paul Shen

This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.

measurement and modeling of computer systems | 2003

A framework for modeling and optimization of prescient instruction prefetch

Tor M. Aamodt; Pedro Marcuello; Paul Chow; Antonio González; Per Hammarlund; Hong Wang; John Paul Shen

This paper describes a framework for modeling macroscopic program behavior and applies it to optimizing prescient instruction prefetch -- novel technique that uses helper threads to improve single-threaded application performance by performing judicious and timely instruction prefetch. A helper thread is initiated when the main thread encounters a spawn point, and prefetches instructions starting at a distant target point. The target identifies a code region tending to incur I-cache misses that the main thread is likely to execute soon, even though intervening control flow may be unpredictable. The optimization of spawn-target pair selections is formulated by modeling program behavior as a Markov chain based on profile statistics. Execution paths are considered stochastic outcomes, and aspects of program behavior are summarized via path expression mappings. Mappings for computing reaching, and posteriori probability; path length mean, and variance; and expected path footprint are presented. These are used with Tarjans fast path algorithm to efficiently estimate the benefit of spawn-target pair selections. Using this framework we propose a spawn-target pair selection algorithm for prescient instruction prefetch. This algorithm has been implemented, and evaluated for the Itanium Processor Family architecture. A limit study finds 4.8%to 17% speedups on an in-order simultaneous multithreading processor with eight contexts, over nextline and streaming I-prefetch for a set of benchmarks with high I-cache miss rates. The framework in this paper is potentially applicable to other thread speculation techniques.

Archive | 2003

Speculative multi-threading for instruction prefetch and/or trace pre-build

Hong Wang; Tor M. Aamodt; Pedro Marcuello; Jared Stark; John Paul Shen; Antonio González; Per Hammarlund; Gerolf F. Hoflehner; Perry H. Wang; Steve Shih-wei Liao

Archive | 2005