Carsten Clauss
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Carsten Clauss.
international conference on high performance computing and simulation | 2011
Carsten Clauss; Stefan Lankes; Pablo Reble; Thomas Bemmerl
Since the beginning of the multicore era, parallel processing has become prevalent across the board. On a traditional multicore system, a single operating system manages all cores and schedules threads and processes among them, inherently supported by hardware-implemented cache coherence protocols. However, a further growth of the number of cores per system implies an increasing chip complexity, especially with respect to the cache coherence protocols. Therefore, a very attractive alternative for future many-core systems is to waive the hardware-based cache coherency and to introduce a software-oriented message-passing based architecture instead: a so-called Cluster-on-Chip architecture. Intels Single-chip Cloud Computer (SCC), a many-core research processor with 48 non-coherent memory-coupled cores, is a very recent example for such a Cluster-on-Chip architecture. The SCC can be configured to run one operating system instance per core by partitioning the shared main memory in a strict manner. However, it is also possible to access the shared main memory in an unsplit and concurrent manner, provided that the cache coherency is then ensured by software. In this paper, we detail our first experiences gained while developing low-level software for message-passing and shared-memory programming on the SCC. In doing so, we evaluate the potential of both programming models and we show how these models can be improved especially with respect to the SCCs many-core architecture.
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface | 2011
Carsten Clauss; Stefan Lankes; Thomas Bemmerl
The Single-Chip Cloud Computer (SCC) experimental processor is a 48-core concept vehicle created by Intel Labs as a platform for many-core software research. Intel provides a customized programming library for the SCC, called RCCE, that allows for fast message-passing between the cores. For that purpose, RCCE offers an application programming interface (API) with a semantics that is derived from the well-established MPI standard. However, while the MPI standard offers a very broad range of functions, the RCCE API is consciously kept small and far from implementing all the features of the MPI standard. For this reason, we have implemented an SCC-customized MPI library, called SCC-MPICH, which in turn is based upon an extension to the SCC-native RCCE communication library. In this contribution, we will present SCC-MPICH and we will show how performance analysis as well as performance tuning for this library can be conducted by means of a prototype of the proposed MPI-3.0 tool information interface.
Lecture Notes in Computer Science | 2006
Boris Bierbaum; Carsten Clauss; Thomas Eickermann; Lidia Kirtchakova; Arnold Krechel; Stephan Springstubbe; Oliver Wäldrich; Wolfgang Ziegler
Running large MPI-applications with resource demands exceeding the local sites cluster capacity could be distributed across a number of clusters in a Grid instead, to satisfy the demand. However, there are a number of drawbacks limiting the applicability of this approach: communication paths between compute nodes of different clusters usually provide lower bandwidth and higher latency than the cluster internal ones, MPI libraries use dedicated I/O-nodes for inter-cluster communication which become a bottleneck, missing tools for co-ordinating the availability of the different clusters across different administrative domains is another issue. To make the Grid approach efficient several prerequisites must be in place: an implementation of MPI providing high-performance communication mechanisms across the borders of clusters, a network connection with high bandwidth and low latency dedicated to the application, compute nodes made available to the application exclusively, and finally a Grid middleware glueing together everything. In this paper we present work recently completed in the VIOLA project: MetaMPICH, user controlled QoS of clusters and interconnecting network, a MetaScheduling Service and the UNICORE integration.
programming models and applications for multicores and manycores | 2012
Stefan Lankes; Pablo Reble; Oliver Sinnen; Carsten Clauss
The growing number of cores per chip implies an increasing chip complexity, especially with respect to hardware-implemented cache coherence protocols. An attractive alternative for future many-core systems is to waive the hardware-based cache coherency and to introduce a software-oriented approach instead: a so-called Cluster-on-Chip architecture. The Single-chip Cloud Computer (SCC) is a recent research processor of such architectures. This paper presents an approach to deal with the missing cache coherence protocol by using a software managed cache coherence system, which is based on the well-established concept of a shared virtual memory (SVM) management system. Through SCCs unique features like a new memory type, which is directly integrated on the processor die, new and capable options exist to realize an SVM system. The convincing performance results presented in this paper show that nearly forgotten concepts will become attractive again for future many-core systems.
Lecture Notes in Computer Science | 2006
Boris Bierbaum; Carsten Clauss; Martin Pöppe; Stefan Lankes; Thomas Bemmerl
MetaMPICH is an MPI implementation which allows the coupling of different computing resources to form a heterogeneous computing system called a meta computer. Such a coupled system may consist of multiple compute clusters, MPPs, and SMP servers, using different network technologies like Ethernet, SCI, and Myrinet. There are several other MPI libraries with similar goals available. We present the three most important of them and contrast their features and abilities to one another and to MetaMPICH. We especially highlight the recent advances made to MetaMPICH, namely the development of the new multidevice architecture for building a meta computer.
international conference on high performance computing and simulation | 2016
Simon Pickartz; Stefan Lankes; Antonello Monti; Carsten Clauss; Jens Breitbart
Application migration is valuable for modern computing centers. Apart from a facilitation of the maintenance process, it enables dynamic load balancing for an improvement of the systems efficiency. Although, the concept is already wide-spread in cloud computing environments, it did not find huge adoption in HPC yet. As major challenges of future exascale systems are resiliency, concurrency, and locality, we expect migration of applications to be one means to cope with these challenges. In this paper we investigate its viability for HPC by deriving respective requirements for this specific field of application. In doing so, we sketch example scenarios demonstrating its potential benefits. Furthermore, we discuss challenges that result from the migration of OS-bypass networks and present a prototype migration mechanism enabling the seamless migration of MPI processes in HPC systems.
international parallel and distributed processing symposium | 2016
Simon Pickartz; Carsten Clauss; Stefan Lankes; Stephan Krempel; Thomas Moschny; Antonello Monti
Load balancing, maintenance, and energy efficiency are key challenges for upcoming supercomputers. An indispensable tool for the accomplishment of these tasks is the ability to migrate applications during runtime. Especially in HPC, where any performance hit is frowned upon, such migration mechanisms have to come with minimal overhead. This constraint is usually not met by current practice adding further abstraction layers to the software stack. In this paper, we propose a concept for the migration of MPI processes communicating over OS-bypass networks such as InfiniBand. While being transparent to the application, our solution minimizes the runtime overhead by introducing a protocol for the shutdown of individual connections prior to the migration. It is implemented on the basis of an MPI library and evaluated using virtual machines based on KVM. Our evaluation reveals that the runtime overhead is negligible small. The migration time itself is mainly determined by the particular migration mechanism, whereas the additional execution time of the presented protocol converges to 2ms per connection if more than a few dozen connections are shut down at a time.
international symposium on parallel and distributed computing | 2012
Carsten Clauss; Simon Pickartz; Stefan Lankes; Thomas Bemmerl
In this paper, we present a prototype implementation of the Multicore Communications API (MCAPI) for the Intel Single-Chip Cloud Computer (SCC). The SCC is a 48 core concept vehicle for future many-core systems that exhibit message-passing oriented architectures. The MCAPI specification, recently developed by the Multicore Association, resembles a lightweight interface for message-passing in todays multicore systems. The presented prototype implementation should be used to evaluate the MCAPIs capability and feasibility for its employment also in future many-core systems.
Concurrency and Computation: Practice and Experience | 2015
Carsten Clauss; Stefan Lankes; Pablo Reble; Thomas Bemmerl
Since the beginning of the multicore era, parallel processing has become prevalent across the board. On a traditional multicore system, a single operating system manages all cores and schedules threads and processes among them, inherently supported by hardware‐implemented cache coherence protocols. However, a further growth of the number of cores per system implies an increasing chip complexity, especially with respect to the cache coherence protocols. Therefore, a very attractive alternative for future many‐core systems is to waive the hardware‐based cache coherency and to introduce a software‐oriented message‐passing based architecture instead: a so‐called Cluster‐on‐Chip architecture. Intels Single‐chip Cloud Computer (SCC), a many‐core research processor with 48 non‐coherent memory‐coupled cores, is a very recent example for such a cluster‐on‐chip architecture. The SCC can be configured to run one operating system instance per core by partitioning the shared main memory in a strict manner. However, it is also possible to access the shared main memory in an unsplit and concurrent manner, provided that either the caches are disabled or the cache coherency is then ensured by software. In this article, we detail our experiences gained while developing low‐level software for message‐passing and shared‐memory programming on the SCC. We present an SCC‐customized MPI library (called SCC‐MPICH) as well as a shared virtual memory system (called MetalSVM) for the SCC. In doing so, we evaluate the potential of both programming models and we show how these models can be improved especially with respect to the SCCs many‐core architecture. Copyright
parallel computing in electrical engineering | 2004
Carsten Clauss; Martin Pöppe; Thomas Bemmerl
Cluster systems built mainly from commodity hardware components have become more and more usable for high performance computing tasks in the past few years. To increase the parallelism for applications, it is often desirable to combine those clusters to a higher lever, commonly called metacomputer. This class of high performance computing platforms can be understood as a cluster of clusters, where each cluster provides different processors, memory performance, cluster interconnect and external networking facilities. Our project called MetaMPICH provides a transparent MPI-1 implementation for those inhomogeneous cluster systems. While this feature of transparency makes porting existing MPI applications to metacomputers quite simple, the slowest network connection and the slowest processor will dominate the performance and scalability due to the lack of opacity. This paper describes our approaches to adapt iterative, grid based simulation algorithms to the structures of such heterogeneous coupled clusters.