Mark P. Mendell
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mark P. Mendell.
Ibm Journal of Research and Development | 2005
José E. Moreira; George S. Almasi; Charles J. Archer; Ralph Bellofatto; Peter Bergner; José R. Brunheroto; Michael Brutman; José G. Castaños; Paul G. Crumley; Manish Gupta; Todd Inglett; Derek Lieber; David Limpert; Patrick McCarthy; Mark Megerian; Mark P. Mendell; Michael Mundy; Don Reed; Ramendra K. Sahoo; Alda Sanomiya; Richard Shok; Brian E. Smith; Greg Stewart
With up to 65,536 compute nodes and a peak performance of more than 360 teraflops, the Blue Gene®/L (BG/L) supercomputer represents a new level of massively parallel systems. The system software stack for BG/L creates a programming and operating environment that harnesses the raw power of this architecture with great effectiveness. The design and implementation of this environment followed three major principles: simplicity, performance, and familiarity. By specializing the services provided by each component of the system architecture, we were able to keep each one simple and leverage the BG/L hardware features to deliver high performance to applications. We also implemented standard programming interfaces and programming languages that greatly simplified the job of porting applications to BG/L. The effectiveness of our approach has been demonstrated by the operational success of several prototype and production machines, which have already been scaled to 16,384 nodes.
Ibm Journal of Research and Development | 2005
Siddhartha Chatterjee; L. R. Bachega; Peter Bergner; K. A. Dockser; John A. Gunnels; Manish Gupta; Fred G. Gustavson; Christopher A. Lapkowski; G. K. Liu; Mark P. Mendell; Ravi Nair; C. D. Wait; T. J. C. Ward; Philip T. Wu
We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.
international conference on parallel architectures and compilation techniques | 2004
Leonardo R. Bachega; Siddhartha Chatterjee; Kenneth Dockser; John A. Gunnels; Manish Gupta; Fred G. Gustavson; Christopher A. Lapkowski; Gary K. Liu; Mark P. Mendell; Charles D. Wait; T. J. Christopher Ward
We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBMs massively parallel BlueGene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment.
Ibm Journal of Research and Development | 2005
Robert F. Enenkel; Blake G. Fitch; Robert S. Germain; Fred G. Gustavson; Andrew K. Martin; Mark P. Mendell; Jed W. Pitera; Mike Pitman; Aleksandr Rayshubskiy; Frank Suits; William C. Swope; T. J. C. Ward
While developing the protein folding application for the IBM Blue Gene®/L supercomputer, some frequently executed computational kernels were encountered. These were significantly more complex than the linear algebra kernels that are normally provided as tuned libraries with modern machines. Using regular library functions for these would have resulted in an application that exploited only 5-10% of the potential floating-point throughput of the machine. This paper is a tour of the functions encountered; they have been expressed in C++ (and could be expressed in other languages such as Fortran or C). With the help of a good optimizing compiler, floating-point efficiency is much closer to 100%. The protein folding application was initially run by the life science researchers on IBM POWER3™ machines while the computer science researchers were designing and bringing up the Blue Gene/L hardware. Some of the work discussed resulted in enhanced compiler optimizations, which now improve the performance of floating-point-intensive applications compiled by the IBM VisualAge® series of compilers for POWER3, POWER4™, POWER4+™, and POWER5™. The implementations are offered in the hope that they may help in other implementations of molecular dynamics or in other fields of endeavor, and in the hope that others may adapt the ideas presented here to deliver additional mathematical functions at high throughput.
extending database technology | 2012
Mark P. Mendell; Howard Nasgaard; Eric Bouillet; Martin Hirzel; Bugra Gedik
General-purpose streaming systems support diverse application domains with powerful and user-defined stream operators. Most general-purpose streaming systems have their own, non-XML, internal data representation. However, streaming input is often either a sequence of small XML documents, or a scan of a huge document. Prior work on XML streaming focuses on filtering, not transforming, XML, and does not describe how to integrate with a general-purpose streaming system. This paper describes how to integrate an XML transformer with a streaming system by designing a specification syntax that is both consistent with the existing system and familiar to XML users. After type-checking the specification, we compile it to an efficient automaton driven by SAX events. Our approach extends the underlying streaming system with XML support without changing its core architecture, and the same technique could be used for other extensions beyond XML.
Ibm Journal of Research and Development | 2013
Martin Hirzel; Henrique Andrade; Bugra Gedik; Gabriela Jacques-Silva; Rohit Khandekar; Vibhore Kumar; Mark P. Mendell; Howard Nasgaard; Scott Schneider; Robert Soulé; Kun-Lung Wu
Archive | 2010
Roch Georges Archambault; Yaoqing Gao; Allan Russell Martin; Mark P. Mendell; Raul Esteban Silvera; Graham Yin
Archive | 2005
Liangxiao Hu; Mark P. Mendell; Raul Esteban Silvera
Archive | 2007
Gheorghe C. Cascaval; Yaoqing Gao; Allen Russell Martin; Mark P. Mendell
Archive | 2004
Ronald Ian McIntosh; Mark P. Mendell