José Carlos Sancho | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where José Carlos Sancho is active.

Explore More

Publication

Featured researches published by José Carlos Sancho.

ieee international conference on high performance computing data and analytics | 2008

Entering the petaflop era: the architecture and performance of Roadrunner

Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho

Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer to run Linpack at a sustained speed in excess of 1 Pflop/s. In this paper we present a detailed architectural description of Roadrunner and a detailed performance analysis of the system. A case study of optimizing the MPI-based application Sweep3D to exploit Roadrunners hybrid architecture is also included. The performance of Sweep3D is compared to that of the code on a previous implementation of the Cell Broadband Engine architecture---the Cell BE---and on multi-core processors. Using validated performance models combined with Roadrunner-specific microbenchmarks we identify performance issues in the early pre-delivery system and infer how well the final Roadrunner configuration will perform once the system software stack has matured.

conference on high performance computing (supercomputing) | 2005

Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

Roberto Gioiosa; José Carlos Sancho; Song Jiang; Fabrizio Petrini; Kei Davis

We describe the software architecture, technical features, and performance of TICK (Transparent Incremental Checkpointer at Kernel level), a system-level checkpointer implemented as a kernel thread, specifi- cally designed to provide fault tolerance in Linux clusters. This implementation, based on the 2.6.11 Linux kernel, provides the essential functionality for transparent, highly responsive, and efficient fault tolerance based on full or incremental checkpointing at system level. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5µs; and it supports incremental and full checkpoints with minimal overhead-less than 6% with full checkpointing to disk performed as frequently as once per minute.

conference on high performance computing (supercomputing) | 2006

Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

José Carlos Sancho; Kevin J. Barker; Darren J. Kerbyson; Kei Davis

The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a large-scale computing system. The major issues center on the trade-off between the network cost and the impact of latency and bandwidth on application performance. One promising technique for extracting maximum application performance given limited network resources is based on overlapping computation with communication, which partially or entirely hides communication delays. While this approach is not new, there are few studies that quantify the potential benefit of such overlapping for large-scale production scientific codes. We address this with an empirical method combined with a network model to quantify the potential overlap in several codes and examine the possible performance benefit. Our results demonstrate, for the codes examined, that a high potential tolerance to network latency and bandwidth exists because of a high degree of potential overlap. Moreover, our results indicate that there is often no need to use fine-grained communication mechanisms to achieve this benefit, since the major source of potential overlap is found in independent work-computation on which pending messages does not depend. This allows for a potentially significant relaxation of network requirements without a consequent degradation of application performance

international parallel and distributed processing symposium | 2004

On the feasibility of incremental checkpointing for scientific computing

José Carlos Sancho; Fabrizio Petrini; Gregory Johnson; Eitan Frachtenberg

Summary form only given. In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total transparency and full flexibility that are necessary to support the new paradigm of autonomic computing - systems able to self-heal and self-repair. We provide an in-depth evaluation of incremental checkpointing for scientific computing. The experimental results, obtained on a state-of-the art cluster running several scientific applications, show that efficient, scalable, automatic and user-transparent incremental checkpointing is within reach with current technology.

IEEE Computer | 2009

Using Performance Modeling to Design Large-Scale Systems

Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho

A methodology for accurately modeling large applications explores the performance of ultrascale systems at different stages in their life cycle, from early design through production use.

international parallel and distributed processing symposium | 2005

Current practice and a direction forward in checkpoint/restart implementations for fault tolerance

José Carlos Sancho; Fabrizio Petrini; Kei Davis; Roberto Gioiosa; Song Jiang

Checkpoint/restart is a general idea for which particular implementations enable various functionalities in computer systems, including process migration, gang scheduling, hibernation, and fault tolerance. For fault tolerance, in current practice, implementations can be at user-level or system-level. User-level implementations are relatively easy to implement and portable, but suffer from a lack of transparency, flexibility, and efficiency, and in particular are unsuitable for the autonomic (self-managing) computing systems envisioned as the next revolutionary development in system management. In contrast, a system-level implementation can exhibit all of these desirable features, at the cost of a more sophisticated implementation, and is seen as an essential mechanism for the next generation of fault tolerant - and ultimately autonomic - large-scale computing systems. Linux is becoming the operating system of choice for the largest-scale machines, but development of system-level checkpoint/restart mechanisms for Linux is still in its infancy, with all extant implementations exhibiting serious deficiencies for achieving transparent fault tolerance. This paper provides a survey of extant implementations in a natural taxonomy, highlighting their strengths and inherent weaknesses.

Parallel Processing Letters | 2008

A PERFORMANCE EVALUATION OF THE NEHALEM QUAD-CORE PROCESSOR FOR SCIENTIFIC COMPUTING

Kevin J. Barker; Kei Davis; Adolfy Hoisie; Darren J. Kerbyson; Michael Lang; Scott Pakin; José Carlos Sancho

In this work we present an initial performance evaluation of Intels latest, second- generation quad-core processor, Nehalem, and provide a comparison to first-generat ion AMD and Intel quad-core processors Barcelona and Tigerton. Nehalem is the first Inxad tel processor to implement a NUMA architecture incorporating QuickPath Interconnect for interconnecting processors within a node, and the first to incorporate an integrated memory controller. We evaluate the suitability of these processors in quad-socket comxad pute nodes as building blocks for large-scale scientific computing clusters. Our analysis of intra-processor and intra-node scalability of microbenchmarks, and a range of large-scale scientific applications, indicates that quad-core processors can deliver an improvement in performance of up to 4x over a single core depending on the workload being processed. However, scalability can be less when considering a full node. We show that Nehalem outperforms Barcelona on memory-intens ive codes by a factor of two for a Nehalem node with 8 cores and a Barcelona node containing 16 cores. Further optimizations are posxad sible with Nehalem, including the use of Simultaneous Multithreading, which improves the performance of some applications by up to 50%.

international parallel and distributed processing symposium | 2008

Analysis of double buffering on two different multicore architectures: Quad-core Opteron and the Cell-BE

José Carlos Sancho; Darren J. Kerbyson

In order to take full advantage of multi-core processors careful attention must be given to the way in which each core interacts with main memory. In data-rich parallel applications multiple transfers between the main memory and local memory (cache or other) of each core will be required. It will be increasingly important to overlap these data transfers with useful computation in order to achieve highperformance. One approach to exploit this compute-transfer overlap is to use double-buffering techniques that require minimal resources in the local memory available to the cores. In this paper, we present optimized buffering techniques and evaluate them for two state-of-the-art multi-core architectures: quad-core Opteron and the Cell-BE. Experimental results show that using double buffering can substantially deliver higher performance for codes with data-parallel loop structures. Performance improvements of 1.4times and 2.2times can be achieved for the quad-core Opteron and Cell- BE respectively. Moreover, this study also provides insight into the application characteristics required for achieving improved performance when using double-buffering, and also the tuning that is required in order to achieve optimal performance.

international parallel and distributed processing symposium | 2004

System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Fabrizio Petrini; Kei Davis; José Carlos Sancho

Summary form only given. As the number of processors for multiteraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore of paramount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. We will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency - requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.

high performance interconnects | 2010

Impact of Inter-application Contention in Current and Future HPC Systems

Ana Jokanovic; German Rodriguez; José Carlos Sancho; Jesús Labarta

Fat-tree networks are the most popular topology among indirect networks in today’s supercomputers. Current supercomputers are generally operated in a shared environment under the control of a job scheduler, executing many parallel applications simultaneously. The competition between these applications to use the same network resources causes a degradation in the applications’ performance. The application that has to wait for the network resources occupied by another application’s messages is said to be experiencing inter-application contention. The extent of degradation caused by inter-application contention is known to depend on multiple factors: the network topology, the routing scheme, the task-placement, etc. Note that these factors also a direct intra-application contention. Our work evaluates the impact of inter-application contention for actual competing HPC workloads under different routing schemes in slimmed Fat Trees. In contrast with previous works, which focus mostly on individual application’s performance, we take a more system-centric view. Our work estimates the amount of system performance loss that inter-application contention contributes in current HPC systems, which we have measured to be around a 10%. We also present a projection of the impact of inter-application contention in the near and mid -term future HPC systems, scaling the node computational power and network link speeds to foreseeable values. Our results suggest that the increase in network speed does not need to keep the same fast pace as the increase in computation power, but it still needs to be scaled up. Our projection for future HPC systems shows that interapplication contention can cause a 15% throughput loss even with link speeds of 40 Gb/s for some application mixes. The difference in impact for a chosen application when running with different mixes leads to the performance variability described in previous works, but our work sets a better bound on the variability than studies performed with an injection of network noise.

Explore More