Rolf Kassa
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rolf Kassa.
field programmable gate arrays | 2007
Shih-Lien L. Lu; Peter Yiannacouras; Rolf Kassa; Michael Konow; Taeweon Suh
Software simulation has been the predominant method for architects to evaluate microprocessor research proposals. There are three tenets in modeling new designs with software models: simulation speed, model accuracy and model completeness. The increasing complexity of the processor and accelerated trend to have multiple processors on a chip are putting burden on simulators to achieve all tenets mentioned, including accurately capturing OS effects. In this work we perform preliminary experimentation/prototyping with an emulation system which overcomes the tension to satisfy all three requirements. The system is an original Socket-7 based desktop processor system with typical hardware peripherals running modern operating systems such as Fedora Core 4 and Windows XP; however we have inserted a Xilinx Virtex-4 in place of the processor that should sit in the motherboard and have used the Virtex-4 to host a complete version of the Pentium® microprocessor (which consumes less than half its resources). We can therefore apply architectural changes to the processor and evaluate their effects on the complete desktop system. We use this FPGA-based emulation system to conduct preliminary architectural experiments including growing the branch target buffer and the level 1 caches. In addition, we experimented with interfacing hardware accelerators such as DES and AES engines which resulted in 27x speedups.
international symposium on microarchitecture | 2009
Gilles Pokam; Cristiano Pereira; Klaus Danne; Rolf Kassa; Ali-Reza Adl-Tabatabai
Prior work on HW support for memory race recording piggybacks time stamps on coherence messages and logs the outcome of memory races using point-to-point or chunk-based approaches. These memory race recorder (MRR) techniques are effective, but they require modifications to the cache coherence protocol that can hurt performance. In addition, prior work has mostly focused on directory coherence and considered only CMP systems with single-level cache hierarchies. Most modern CMP systems shipped today, however, implement snoop coherence and feature multilevel cache hierarchies. To be practical, a MRR must target CMPs with multilevel caches, mitigate the coherence overhead due to piggybacking, and emphasize on replay speed to broaden applicability of deterministic replay. This paper contributes three new solutions for making chunk-based MRR practical for modern CMPs. We show that MRR interactions with a cache hierarchy can degrade performance and present a novel mechanism that mitigates this degradation. We propose new mechanisms for snoop-based caches that eliminate coherence traffic overhead due to piggybacking. We finally propose new techniques for improving replay speed and introduce a novel framework for evaluating the replay speed potential of MRR designs.
international symposium on computer architecture | 2013
Gilles Pokam; Klaus Danne; Cristiano Pereira; Rolf Kassa; Tim Kranich; Shiliang Hu; Justin E. Gottschlich; Nima Honarmand; Nathan Dautenhahn; Samuel T. King; Josep Torrellas
There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality. This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel. This papers focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.
ACM Transactions on Reconfigurable Technology and Systems | 2008
Shih-Lien L. Lu; Peter Yiannacouras; Taeweon Suh; Rolf Kassa; Michael Konow
Advancements in reconfigurable technologies, specifically FPGAs, have yielded faster, more power-efficient reconfigurable devices with enormous capacities. In our work, we provide testament to the impressive capacity of recent FPGAs by hosting a complete Pentium® in a single FPGA chip. In addition we demonstrate how FPGAs can be used for microprocessor design space exploration while overcoming the tension between simulation speed, model accuracy, and model completeness found in traditional software simulator environments. Specifically, we perform preliminary experimentation/prototyping with an original Socket 7 based desktop processor system with typical hardware peripherals running modern operating systems such as Fedora Core 4 and Windows XP; however we have inserted a Xilinx Virtex-4 in place of the processor that should sit in the motherboard and have used the Virtex-4 to host a complete version of the Pentium® microprocessor (which consumes less than half its resources). We can therefore apply architectural changes to the processor and evaluate their effects on the complete desktop system. We use this FPGA-based emulation system to conduct preliminary architectural experiments including growing the branch target buffer and the level 1 caches. In addition, we experimented with interfacing hardware accelerators such as DES and AES engines which resulted in a 27x speedup.
field-programmable logic and applications | 2010
Qigang Wang; Rolf Kassa; Wenbo Shen; Nelson Ijih; Bhushan Chitlur; Michael Konow; Dong Liu; Arthur Sheiman; Prabhat Gupta
This paper introduces a flexible, hybrid emulation platform for processor related emulation. It is based on a modern Xeon server and FPGA. With Intel processors implemented in FPGA and plugged into the Xeon servers processor socket, and with the Xeon BIOS modified to accommodate different cores, this platform is able to boot OS and run applications while interacting with the Xeon servers native hardware components. It enables researchers to do architectural changes in FPGA and quickly evaluate their effect on the whole platform. This platform is an ideal vehicle for research on processor technology, SoC architecture, reconfigurable computing, heterogeneous core architecture, etc.
Archive | 2012
Tim Kranich; Gilles Pokam; Justin E. Gottschlich; Klaus Danne; Rolf Kassa; Shiliang Hu; Cristiano Pereira
Archive | 2013
Justin E. Gottschlich; Klaus Danne; Cristiano Pereira; Gilles Pokam; Rolf Kassa; Shiliang Hu; Tim Kranich
Archive | 2013
Justin E. Gottschlich; Gilles Pokam; Cristiano Pereira; Klaus Danne; Hu Shiliang; Rolf Kassa
Archive | 2014
Rolf Kassa; Justin E. Gottschlich; Shiliang Hu; Gilles Pokam; Robert C. Knauerhase
Archive | 2012
Justin E. Gottschlich; Klaus Danne; Cristiano Pereira; Gilles Pokam; Rolf Kassa; Shiliang Hu; Tim Kranich