Jens Leenstra | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jens Leenstra is active.

Explore More

Publication

Featured researches published by Jens Leenstra.

very large data bases | 2013

DB2 with BLU acceleration: so much more than just a column store

Vijayshankar Raman; Gopi K. Attaluri; Ronald J. Barber; Naresh K. Chainani; David Kalmuk; Vincent Kulandaisamy; Jens Leenstra; Sam Lightstone; Shaorong Liu; Guy M. Lohman; Tim R Malkemus; Rene Mueller; Ippokratis Pandis; Berni Schiefer; David C. Sharpe; Richard S. Sidle; Adam J. Storm; Liping Zhang

DB2 with BLU Acceleration deeply integrates innovative new techniques for defining and processing column-organized tables that speed read-mostly Business Intelligence queries by 10 to 50 times and improve compression by 3 to 10 times, compared to traditional row-organized tables, without the complexity of defining indexes or materialized views on those tables. But DB2 BLU is much more than just a column store. Exploiting frequency-based dictionary compression and main-memory query processing technology from the Blink project at IBM Research - Almaden, DB2 BLU performs most SQL operations - predicate application (even range predicates and IN-lists), joins, and grouping - on the compressed values, which can be packed bit-aligned so densely that multiple values fit in a register and can be processed simultaneously via SIMD (single-instruction, multipledata) instructions. Designed and built from the ground up to exploit modern multi-core processors, DB2 BLUs hardware-conscious algorithms are carefully engineered to maximize parallelism by using novel data structures that need little latching, and to minimize data-cache and instruction-cache misses. Though DB2 BLU is optimized for in-memory processing, database size is not limited by the size of main memory. Fine-grained synopses, late materialization, and a new probabilistic buffer pool protocol for scans minimize disk I/Os, while aggressive prefetching reduces I/O stalls. Full integration with DB2 ensures that DB2 with BLU Acceleration benefits from the full functionality and robust utilities of a mature product, while still enjoying order-of-magnitude performance gains from revolutionary technology without even having to change the SQL, and can mix column-organized and row-organized tables in the same tablespace and even within the same query.

international solid-state circuits conference | 2005

A streaming processing unit for a CELL processor

Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; P. Hotstee; Gilles Gervais; Roy Moonseuk Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; H. Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano

The design of a 4-way SIMD streaming data processor emphasizes achievable performance in area and power. Software controls data movement and instruction flow, and improves data bandwidth and pipeline utilization. The micro-architecture minimizes instruction latency and provides fine-grain clock control to reduce power.

Ibm Journal of Research and Development | 2011

IBM POWER7 multicore server processor

Balaram Sinharoy; Ronald Nick Kalla; William J. Starke; Hung Q. Le; R. Cargnoni; J. A. Van Norstrand; B. J. Ronchetti; Jeffrey A. Stuecheli; Jens Leenstra; G. L. Guthrie; D. Q. Nguyen; Bart Blaner; C. F. Marino; E. Retter; Peter Williams

The IBM POWER® processor is the dominant reduced instruction set computing microprocessor in the world today, with a rich history of implementation and innovation over the last 20 years. In this paper, we describe the key features of the POWER7® processor chip. On the chip is an eight-core processor, with each core capable of four-way simultaneous multithreaded operation. Fabricated in IBMs 45-nm silicon-on-insulator (SOI) technology with 11 levels of metal, the chip contains more than one billion transistors. The processor core and caches are significantly enhanced to boost the performance of both single-threaded response-time-oriented, as well as multithreaded, throughput-oriented applications. The memory subsystem contains three levels of on-chip cache, with SOI embedded dynamic random access memory (DRAM) devices used as the last level of cache. A new memory interface using buffered double-data-rate-three DRAM and improvements in reliability, availability, and serviceability are discussed

IEEE Journal of Solid-state Circuits | 2006

The microarchitecture of the synergistic processor for a cell processor

Brian Flachs; Shigehiro Asano; Sang Hoo Dhong; Harm Peter Hofstee; Gilles Gervais; Roy Kim; T. Le; Peichun Liu; Jens Leenstra; John Samuel Liberty; Brad W. Michael; Hwa-Joon Oh; Silvia Melitta Mueller; Osamu Takahashi; A. Hatakeyama; Yukio Watanabe; Naoka Yano; Daniel Alan Brokenshire; Mohammad Peyravian; Vandung To; E. Iwata

This paper describes an 11 FO4 streaming data processor in the IBM 90-nm SOI-low-k process. The dual-issue, four-way SIMD processor emphasizes achievable performance per area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction latency while providing for fine grain clock control to reduce power.

Ibm Journal of Research and Development | 2007

IBM POWER6 accelerators: VMX and DFU

Lee Evan Eisen; J. W. Ward Iii; H.-W. Tast; N. Mäding; Jens Leenstra; Stefan Mueller; Christian Jacobi; J. Preiss; Eric M. Schwarz; S. R. Carlough

The IBM POWER6™ microprocessor core includes two accelerators for increasing performance of specific workloads. The vector multimedia extension (VMX) provides a vector acceleration of graphic and scientific workloads. It provides single instructions that work on multiple data elements. The instructions separate a 128-bit vector into different components that are operated on concurrently. The decimal floating-point unit (DFU) provides acceleration of commercial workloads, more specifically, financial transactions. It provides a new number system that performs implicit rounding to decimal radix points, a feature essential to monetary transactions. The IBM POWER™ processor instruction set is substantially expanded with the addition of these two accelerators. The VMX architecture contains 176 instructions, while the DFU architecture adds 54 instructions to the base architecture. The IEEE 754R Binary Floating-Point Arithmetic Standard defines decimal floating-point formats, and the POWER6 processor--on which a substantial amount of area has been devoted to increasing performance of both scientific and commercial workloads--is the first commercial hardware implementation of this format.

Ibm Journal of Research and Development | 2015

IBM POWER8 processor core microarchitecture

Balaram Sinharoy; J. A. Van Norstrand; R. J. Eickemeyer; Hung Q. Le; Jens Leenstra; D. Q. Nguyen; B. Konigsburg; K. Ward; M. D. Brown; J. E. Moreira; D. Levitan; S. Tung; D. Hrusecky; J. W. Bishop; M. Gschwind; M. Boersma; M. Kroener; M. Kaltenbach; T. Karkhanis; K. M. Fernsler

The POWER8™ processor is the latest RISC (Reduced Instruction Set Computer) microprocessor from IBM. It is fabricated using the companys 22-nm Silicon on Insulator (SOI) technology with 15 layers of metal, and it has been designed to significantly improve both single-thread performance and single-core throughput over its predecessor, the POWER7® processor. The rate of increase in processor frequency enabled by new silicon technology advancements has decreased dramatically in recent generations, as compared to the historic trend. This has caused many processor designs in the industry to show very little improvement in either single-thread or single-core performance, and, instead, larger numbers of cores are primarily pursued in each generation. Going against this industry trend, the POWER8 processor relies on a much improved core and nest microarchitecture to achieve approximately one-and-a-half times the single-thread performance and twice the single-core throughput of the POWER7 processor in several commercial applications. Combined with a 50% increase in the number of cores (from 8 in the POWER7 processor to 12 in the POWER8 processor), the result is a processor that leads the industry in performance for enterprise workloads. This paper describes the core microarchitecture innovations made in the POWER8 processor that resulted in these significant performance benefits.

IEEE Journal of Solid-state Circuits | 2011

POWER7™, a Highly Parallel, Scalable Multi-Core High End Server Processor

Dieter Wendel; R Kalla; James D. Warnock; R. Cargnoni; S G Chu; J G Clabes; Daniel M. Dreps; D. Hrusecky; Joshua Friedrich; Saiful Islam; J Kahle; Jens Leenstra; Gaurav Mittal; Jose Angel Paredes; Jürgen Pille; Phillip J. Restle; Balaram Sinharoy; G Smith; W J Starke; S Taylor; J. A. Van Norstrand; Stephen Douglas Weitzel; P G Williams; Victor Zyuban

This paper gives an overview of the latest member of the POWER™ processor family, POWER7™. Eight quad-threaded cores, operating at frequencies up to 4.14 GHz, are integrated together with two memory controllers and high speed system links on a 567 mm die, employing 1.2B transistors in a 45 nm CMOS SOI technology with 11 layers of low-k copper wiring. The technology features deep trench capacitors which are used to build a 32 MB embedded DRAM L3 based on a 0.067 m DRAM cell. The functionally equivalent chip transistor count would have been over 2.7B if the L3 had been implemented with a conventional 6 transistor SRAM cell. (A detailed paper about the eDRAM implementation will be given in a separate paper of this Journal). Deep trench capacitors are also used to reduce on-chip voltage island supply noise. This paper describes the organization of the design and the features of the processor core, before moving on to discuss the circuits used for analog elements, clock generation and distribution, and I/O designs. The final section describes the details of the clocked storage elements, including special features for test, debug, and chip frequency tuning.

Ibm Journal of Research and Development | 2000

Custom circuit design as a driver of microprocessor performance

D. H. Allen; Sang Hoo Dhong; Harm Peter Hofstee; Jens Leenstra; Kevin J. Nowka; D. L. Stasiak; Dieter Wendel

This paper presents a survey of some of the most aggressive custom designs for CMOS processor products and prototypes in IBM. We argue that microprocessor performance growth, which has traditionally been driven primarily by CMOS technology and microarchitectural improvements, can receive a substantial contribution from improvements in circuit design and physical organization. We predict that in future microprocessor designs the floorplan and wire plan will be as important as the microarchitecture, more control logic will be structured and become indistinguishable from dataflow elements, and more circuits will be designed and analyzed at the level of single transistors and wires.

design automation conference | 2008

Scan chain clustering for test power reduction

Melanie Elm; Hans-Joachim Wunderlich; Michael E. Imhof; Christian G. Zoellin; Jens Leenstra; Nicolas Maeding

An effective technique to save power during scan based test is to switch off unused scan chains. The results obtained with this method strongly depend on the mapping of scan flip-flops into scan chains, which determines how many chains can be deactivated per pattern. In this paper, a new method to cluster flip-flops into scan chains is presented, which minimizes the power consumption during test. It is not dependent on a test set and can improve the performance of any test power reduction technique consequently. The approach does not specify any ordering inside the chains and fits seamlessly to any standard tool for scan chain integration. The application of known test power reduction techniques to the optimized scan chain configurations shows significant improvements for large industrial circuits.

design automation conference | 2007

Scan test planning for power reduction

Michael E. Imhof; Christian G. Zoellin; Hans-Joachim Wunderlich; Nicolas Maeding; Jens Leenstra

Many STUMPS architectures found in current chip designs allow disabling of individual scan chains for debug and diagnosis. In a recent paper it has been shown that this feature can be used for reducing the power consumption during test. Here, we present an efficient algorithm for the automated generation of a test plan that keeps fault coverage as well as test time, while significantly reducing the amount of wasted energy. A fault isolation table, which is usually used for diagnosis and debug, is employed to accurately determine scan chains that can be disabled. The algorithm was successfully applied to large industrial circuits and identifies a very large amount of excess pattern shift activity.

Explore More