Valentina Salapura | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Valentina Salapura is active.

Explore More

Publication

Featured researches published by Valentina Salapura.

ieee international conference on high performance computing data and analytics | 2011

The IBM Blue Gene/Q interconnection network and message unit

Dong Chen; Noel A. Eisley; Philip Heidelberger; Robert M. Senger; Yutaka Sugawara; Sameer Kumar; Valentina Salapura; David L. Satterfield; Burkhard Steinmacher-Burow; Jeffrey J. Parker

This is the first paper describing the IBM Blue Gene/Q interconnection network and message unit. The Blue Gene/Q system is the third generation in the IBM Blue Gene line of massively parallel supercomputers. The Blue Gene/Q architecture can be scaled to 20 PF/s and beyond. The network and the highly parallel message unit, which provides the functionality of a network interface, are integrated onto the same chip as the processors and cache memory, and consume 8% of the chips area. For better application scalability and performance, we describe new routing algorithms and new techniques to parallelize the injection and reception of packets in the network interface. Measured hardware performance results are also presented.

international symposium on microarchitecture | 2012

The IBM Blue Gene/Q Interconnection Fabric

Dong Chen; Noel A. Eisley; Philip Heidelberger; Robert M. Senger; Yutaka Sugawara; Sameer Kumar; Valentina Salapura; David L. Satterfield; Burkhard Steinmacher-Burow; Jeffrey J. Parker

This article describes the IBM Blue Gene/Q interconnection network and message unit. Blue Gene/Q is the third generation in the IBM Blue Gene line of massively parallel supercomputers and can be scaled to 20 petaflops and beyond. For better application scalability and performance, Blue Gene/Q has new routing algorithms and techniques to parallelize the injection and reception of packets in the network interface.

computing frontiers | 2015

SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark

Min Li; Jian Tan; Yandong Wang; Li Zhang; Valentina Salapura

Spark has been increasingly adopted by industries in recent years for big data analysis by providing a fault tolerant, scalable and easy-to-use in memory abstraction. Moreover, the community has been actively developing a rich ecosystem around Spark, making it even more attractive. However, there is not yet a Spark specify benchmark existing in the literature to guide the development and cluster deployment of Spark to better fit resource demands of user applications. In this paper, we present SparkBench, a Spark specific benchmarking suite, which includes a comprehensive set of applications. SparkBench covers four main categories of applications, including machine learning, graph computation, SQL query and streaming applications. We also characterize the resource consumption, data flow and timing information of each application and evaluate the performance impact of a key configuration parameter to guide the design and optimization of Spark data analytic platform.

IEEE Transactions on Very Large Scale Integration Systems | 2001

FPGA prototyping of a RISC processor core for embedded applications

Michael Karl Gschwind; Valentina Salapura; Dietmar Maurer

Application-specific processors offer an attractive option in the design of embedded systems by providing high performance for a specific application domain. In this work, we describe the use of a reconfigurable processor core based on an RISC architecture as starting point for application-specific processor design. By using a common base instruction set, development cost can be reduced and design space exploration is focused on the application-specific aspects of performance. An important aspect of deploying any new architecture is verification which usually requires lengthy software simulation of a design model. We show how hardware emulation based on programmable logic can be integrated into the hardware/software codesign flow. While previously hardware emulation required massive investment in design effort and special purpose emulators, an emulation approach based on high-density field-programmable gate array (FPGA) devices now makes hardware emulation practical and cost effective for embedded processor designs. To reduce development cost and avoid duplication of design effort, FPGA prototypes and ASIC implementations are derived from a common source: We show how to perform targeted optimizations to fully exploit the capabilities of the target technology while maintaining a common source base.

Ibm Journal of Research and Development | 2005

Blue Gene/L advanced diagnostics environment

Mark E. Giampapa; Ralph Bellofatto; Matthias A. Blumrich; Dong Chen; Marc Boris Dombrowa; Alan Gara; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Ben J. Nathanson; Burkhard Steinmacher-Burow; Martin Ohmacht; Valentina Salapura; Pavlos M. Vranas

This paper describes the Blue Gene®/L advanced diagnostics environment (ADE) used throughout all aspects of the Blue Gene/L project, including design, logic verification, bring-up, diagnostics, and manufacturing test. The Blue Gene/L ADE consists of a lightweight multithreaded coherence-managed kernel, runtime libraries, device drivers, system programming interfaces, compilers, and host-based development tools. It provides complete and flexible access to all features of the Blue Gene/L hardware. Prior to the existence of hardware, ADE was used on Very high-speed integrated circuit Hardware Description Language (VHDL) models, not only for logic verification, but also for performance measurements, code-path analysis, and evaluation of architectural tradeoffs. During early hardware bring-up, the ability to run in a cycle-reproducible manner on both hardware and VHDL proved invaluable in fault isolation and analysis. However, ADE is also capable of supporting high-performance applications and parallel test cases, thereby permitting us to stress the hardware to the limits of its capabilities. This paper also provides insights into system-level and device-level programming of Blue Gene/L to assist developers of high-performance applications o more fully exploit the performance of the machine.

high-performance computer architecture | 2008

Design and implementation of the blue gene/P snoop filter

Valentina Salapura; Matthias A. Blumrich; Alan Gara

As multi-core processors evolve, coherence traffic between cores is becoming problematic, both in terms of performance and power. The negative effects of coherence (snoop) traffic can be significantly mitigated through snoop filtering. Shielding each cache with a device that can squash snoop requests for addresses known not to be in cache improves performance significantly for caches that cannot perform normal load and snoop lookups simultaneously. In addition, reducing snoop lookups yields power savings. This paper describes the design of the Blue Gene/P snoop filters, and presents hardware measurements to demonstrate their effectiveness. The Blue Gene/P snoop filters combine stream registers and snoop caches to capture both the locality of snoop addresses and their streaming behavior. Simulations of SPLASH-2 benchmarks illustrate tradeoffs and strengths of these two techniques. Their combination is shown to be most effective, eliminating 94-99% of all snoop requests using very few stream registers and snoop cache lines. This translates into an average performance improvement of almost 20% for the NAS benchmarks running on an actual Blue Gene/P system.

International Journal of Parallel Programming | 2007

The blue gene/L supercomputer: a hardware and software story

José E. Moreira; Valentina Salapura; George S. Almasi; Charles J. Archer; Ralph Bellofatto; Peter Edward Bergner; Randy Bickford; Matthias A. Blumrich; José R. Brunheroto; Arthur A. Bright; Michael Brian Brutman; José G. Castaños; Dong Chen; Paul W. Coteus; Paul G. Crumley; Sam Ellis; Thomas Eugene Engelsiepen; Alan Gara; Mark E. Giampapa; Tom Gooding; Shawn A. Hall; Ruud A. Haring; Roger L. Haskin; Philip Heidelberger; Dirk Hoenicke; Todd A. Inglett; Gerard V. Kopcsay; Derek Lieber; David Roy Limpert; Patrick Joseph McCarthy

The Blue Gene/L system at the Department of Energy Lawrence Livermore National Laboratory in Livermore, California is the world’s most powerful supercomputer. It has achieved groundbreaking performance in both standard benchmarks as well as real scientific applications. In that process, it has enabled new science that simply could not be done before. Blue Gene/L was developed by a relatively small team of dedicated scientists and engineers. This article is both a description of the Blue Gene/L supercomputer as well as an account of how that system was designed, developed, and delivered. It reports on the technical characteristics of the system that made it possible to build such a powerful supercomputer. It also reports on how teams across the world worked around the clock to accomplish this milestone of high-performance computing.

IEEE Transactions on Fuzzy Systems | 2000

A fuzzy RISC processor

Valentina Salapura

We describe application-specific extensions for fuzzy processing to a general purpose processor. The application-specific instruction set extensions were defined and evaluated using hardware/software codesign techniques. Based on this approach, we have extended the MIPS instruction set architecture with only a few new instructions to significantly speed up fuzzy computation with no increase of the processor cycle time and with only minor increase in chip area. The processor is implemented using a reconfigurable processor core which was designed as a starting point for application-specific processor designs to be used in embedded applications. Performance is presented for three representative applications of varying complexity.

computing frontiers | 2005

Power and performance optimization at the system level

Valentina Salapura; Randy Bickford; Matthias A. Blumrich; Arthur A. Bright; Dong Chen; Paul W. Coteus; Alan Gara; Mark E. Giampapa; Michael Karl Gschwind; Manish Gupta; Shawn A. Hall; Ruud A. Haring; Philip Heidelberger; Dirk Hoenicke; Gerard V. Kopcsay; Martin Ohmacht; Rick A. Rand; Todd E. Takken; Pavlos M. Vranas

The BlueGene/L supercomputer has been designed with a focus on power/performance efficiency to achieve high application performance under the thermal constraints of common data centers. To achieve this goal, emphasis was put on system solutions to engineer a power-efficient system. To exploit thread level parallelism, the BlueGene/L system can scale to 64 racks with a total of 65536 computer nodes consisting of a single compute ASIC integrating all system functions with two industry-standard PowerPC microprocessor cores in a chip multiprocessor configuration. Each PowerPC processor exploits data-level parallelism with a high-performance SIMD oating point unitTo support good application scaling on such a massive system, special emphasis was put on efficient communication primitives by including five highly optimized communification networks. After an initial introduction of the Blue-Gene/L system architecture, we analyze power/performance efficiency for the BlueGene system using performance and power characteristics for the overall system performance (as exemplified by peak performance numbers.To understand application scaling behavior, and its impact on performance and power/performance efficiency, we analyze the NAMD molecular dynamics package using the ApoA1 benchmark. We find that even for strong scaling problems, BlueGene/L systems can deliver superior performance scaling and deliver significant power/performance efficiency. Application benchmark power/performance scaling for the voltage-invariant energy delay 2 power/performance metric demonstrates that choosing a power-efficient 700MHz embedded PowerPC processor core and relying on application parallelism was the right decision to build a powerful, and power/performance efficient system

Ibm Journal of Research and Development | 2005

Blue Gene/L compute chip: memory and Ethernet subsystem

Martin Ohmacht; Reinaldo A. Bergamaschi; Subhrajit Bhattacharya; Alan Gara; Mark E. Giampapa; Balaji Gopalsamy; Ruud A. Haring; Dirk Hoenicke; David John Krolak; James A. Marcella; Ben J. Nathanson; Valentina Salapura; Michael E. Wazlowski

The Blue Gene®/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 (L2 and L3) to reduce memory access time. A Gigabit Ethernet interface driven by direct memory access (DMA) is integrated in the cache hierarchy, requiring only an external physical link layer chip to connect to the media. The integrated L3 cache stores a total of 4 MB of data, using multibank embedded dynamic random access memory (DRAM). The 1,024-bit-wide data port of the embedded DRAM provides 22.4 GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine. To reduce hardware overhead due to cache coherence intervention requests, memory coherence is maintained by software. This is particularly efficient for regular highly parallel applications with partitionable working sets. The system further integrates an on-chip double-data-rate (DDR) DRAM controller for direct attachment of main memory modules to optimize overall memory performance and cost. For booting the system and low-latency interprocessor communication and synchronization, a 16-KB static random access memory (SRAM) and hardware locks have been added to the design.

Explore More