Paul M. Carpenter | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paul M. Carpenter is active.

Explore More

Publication

Featured researches published by Paul M. Carpenter.

ieee international conference on high performance computing data and analytics | 2013

Supercomputing with commodity CPUs: are mobile SoCs ready for HPC?

Nikola Rajovic; Paul M. Carpenter; Isaac Gelado; Nikola Puzovic; Alex Ramirez; Mateo Valero

In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in high-performance computing. This transformation has been so effective that the June 2013 TOP500 list is still dominated by x86. In 2013, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones and tablets, most of which are built with ARM-based SoCs. This leads to the suggestion that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This paper addresses this question in detail. We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. We also present our experience evaluating performance and efficiency of mobile SoCs, deploying a cluster and evaluating the network and scalability of production applications. In summary, we give a first answer as to whether mobile SoCs are ready for HPC.

compilers, architecture, and synthesis for embedded systems | 2009

Mapping stream programs onto heterogeneous multiprocessor systems

Paul M. Carpenter; Alex Ramirez; Eduard Ayguadé

This paper presents a partitioning and allocation algorithm for an iterative stream compiler, targeting heterogeneous multiprocessors with constrained distributed memory and any communications topology. We introduce a novel definition of connectedness that enables the algorithm to model the capabilities of the compiler. The algorithm uses convexity and connectedness constraints to produce partitions that are easier to compile and require short pipelines. Software pipelining is an effective transformation, but it increases memory footprint and latency, and has a startup overhead. Our algorithm takes account of these downstream costs. We show results for the StreamIt 2.1.1 benchmarks for an SMP, 2*2 mesh, SMP plus accelerator, and IBM QS20 blade, which has two Cell processors. Our results show that the average performance is within 5% of the unrestricted optimum found using a brute force search, while seldom requiring software pipelining. The heuristic is robust, and fast enough to be inside the feedback loop of an iterative compiler.

International Journal of Parallel Programming | 2011

ACOTES project: Advanced compiler technologies for embedded streaming

Eduard Ayguadé; Cédric Bastoul; Paul M. Carpenter; Zbigniew Chamski; Albert Cohen; Marco Cornero; Philippe Dumont; Marc Duranton; Mohammed Fellahi; Roger Ferrer; Razya Ladelsky; Menno Lindwer; Xavier Martorell; Cupertino Miranda; Dorit Nuzman; Andrea Ornstein; Antoniu Pop; Sebastian Pop; Louis-Noël Pouchet; Alex Ramirez; David Ródenas; Erven Rohou; Ira Rosen; Uzi Shvadron; Konrad Trifunovic; Ayal Zaks

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.

international conference on embedded computer systems architectures modeling and simulation | 2007

A streaming machine description and programming model

Paul M. Carpenter; David Ródenas; Xavier Martorell; Alex Ramirez; Eduard Ayguadé

In this paper we present the initial development of a streaming environment based on a programming model and machine description. The stream programming model consists of an extension to the C language and its translation towards a streaming machine. The extensions will be a set of OpenMP-like directives. We show how a serial application can be converted into a streaming parallel application using the proposed annotations. We also show how the machine description can be used to parametrize a cost-model simulator to predict the performance of the stream program. The cost model allows the compiler to determine the best task partitioning and scheduling for each architecture.

brazilian conference on intelligent systems | 2014

Software-Managed Power Reduction in Infiniband Links

Branimir Dickov; Miquel Pericàs; Paul M. Carpenter; Nacho Navarro; Eduard Ayguadé

The backbone of a large-scale supercomputer is the interconnection network. As compute nodes become more energy-efficient, the interconnect is accounting for an increasing proportion of the total system energy consumption. The interconnects energy consumption is, however, only starting to receive serious attention. Some hardware-based schemes have been proposed that exploit idle periods or low utilisation, either by turning off the links or by lowering the frequency and voltage. Although these schemes are effective in certain cases, they do not have enough global information about the applications communication behaviour to efficiently manage the network power consumption. This paper proposes an alternative approach: moving the intelligence into the PMPI layer of the MPI library, and using prediction to discover repetitive patterns in the applications communication behaviour. The core of the prediction algorithm is an n-gram extraction technique, which can accurately predict not only when a link will become unused but also when it will become active again, allowing lanes to be switched off during the idle periods and switched back on again in time to avoid incurring a significant performance degradation. Many HPC applications benefit from prediction, since they have repetitive computation and communication phases. By implementing the energy-saving mechanism inside the MPI library, existing MPI programs do not need to be modified. Using an event-driven simulator, driven by representative HPC workloads, we demonstrate average energy savings in Infiniband switches up to around 33%, while the average execution time increase is only up to 1%.

european conference on parallel processing | 2010

Starsscheck: a tool to find errors in task-based parallel programs

Paul M. Carpenter; Alex Ramirez; Eduard Ayguadé

Star Superscalar is a task-based programming model. The programmer starts with an ordinary C program, and adds pragmas to mark functions as tasks, identifying their inputs and outputs. When the main thread reaches a task, an instance of the task is added to a run-time dependency graph, and later scheduled to run on a processor. Variants of Star Superscalar exist for the Cell Broadband Engine and SMPs. Star Superscalar relies on the annotations provided by the programmer. If these are incorrect, the program may exhibit race conditions or exceptions deep inside the run-time system. This paper introduces Starsscheck, a tool based on Valgrind, which helps debug Star Superscalar programs. Starsscheck verifies that the pragma annotations are correct, producing a warning if a task or the main thread performs an invalid access. The tool can be adapted to support similar programming models such as TPC. For most benchmarks, Starsscheck is faster than memcheck, the default Valgrind tool.

high performance embedded architectures and compilers | 2010

Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors

Paul M. Carpenter; Alex Ramirez; Eduard Ayguadé

Stream programming is a promising way to expose concurrency to the compiler. A stream program is built from kernels that communicate only via point-to-point streams. The stream compiler statically allocates these kernels to processors, applying blocking, fission and fusion transformations. The compiler determines the sizes of the communication buffers, which affects performance since local memories can be small. In this paper, we propose a feedback-directed algorithm that determines the size of each communication buffer, based on i) the stream program that has been mapped onto processors, ii) feedback from an earlier execution, and iii) the memory constraints. The algorithm exposes a trade-off between throughput and latency. It is general, in that it applies to stream programs with unstructured stream graphs, and it supports variable execution times and communication rates. We show results for the StreamIt benchmarks and random graphs. For the StreamIt benchmarks, throughput is optimal after the first iteration. For random graphs with stochastic computation times, throughput is within 3% of optimal after four iterations. Compared with the previous general algorithm, by Basten and Hoogerbrugge, our algorithm has significantly better performance and latency.

design, automation, and test in europe | 2016

EUROSERVER: Share-anything scale-out micro-server design

Manolis Marazakis; John Goodacre; Didier Fuin; Paul M. Carpenter; John Thomson; Emil Matus; Antimo Bruno; Per Stenström; Jérôme Martin; Yves Durand; Isabelle Dor

This paper provides a snapshot summary of the trends in the area of micro-server development and their application in the broader enterprise and cloud markets. Focusing on the technology aspects, we provide an understanding of these trends and specifically the differentiation and uniqueness of the approach being adopted by the EUROSERVER FP7 project. The unique technical contributions of EUROSERVER range from the fundamental system compute unit design architecture, through to the implementation approach both at the chiplet nanotechnological integration, and the everything-close physical form factor. Furthermore, we offer optimizations at the virtualisation layer to exploit the unique hardware features, and other framework optimizations, including exploiting the hardware capabilities at the run-time system and application layers.

local computer networks | 2015

Exploring interconnect energy savings under east-west traffic pattern of mapreduce clusters

Renan Fischer e Silva; Paul M. Carpenter

An important challenge of modern data centers is to reduce energy consumption, of which a substantial proportion is due to the network. Energy Efficient Ethernet (EEE) is a recent standard that aims to reduce network power consumption, but current practice is to disable it in production use, since it has a poorly understood impact on real world application performance. An important application framework commonly used in modern data centers is Apache Hadoop, which implements the MapReduce programming model. This paper is the first to analyse the impact of EEE on MapReduce workloads, in terms of performance overheads and energy savings. We find that optimum energy savings are possible if the links use packet coalescing. Packet coalescing must, however, be carefully configured in order to avoid excessive performance degradation.

local computer networks | 2016

Controlling Network Latency in Mixed Hadoop Clusters: Do We Need Active Queue Management?

Renan Fischer e Silva; Paul M. Carpenter

With the advent of big data, data center applications are processing vast amounts of unstructured and semi-structured data, in parallel on large clusters, across hundreds to thousands of nodes. The highest performance for these batch big data workloads is achieved using expensive network equipment with large buffers, which accommodate bursts in network traffic and allocate bandwidth fairly even when the network is congested. Throughput-sensitive big data applications are, however, often executed in the same data center as latency-sensitive workloads. For both workloads to be supported well, the network must provide both maximum throughput and low latency. Progress has been made in this direction, as modern network switches support Active Queue Management (AQM) and Explicit Congestion Notifications (ECN), both mechanisms to control the level of queue occupancy, reducing the total network latency. This paper is the first study of the effect of Active Queue Management on both throughput and latency, in the context of Hadoop and the MapReduce programming model. We give a quantitative comparison of four different approaches for controlling buffer occupancy and latency: RED and CoDel, both standalone and also combined with ECN and DCTCP network protocol, and identify the AQM configurations that maintain Hadoop execution time gains from larger buffers within 5%, while reducing network packet latency caused by bufferbloat by up to 85%. Finally, we provide recommendations to administrators of Hadoop clusters as to how to improve latency without degrading the throughput of batch big data workloads.

Explore More