Is this you? Create Your Porfile

Matthias Diener

Universidade Federal do Rio Grande do Sul

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matthias Diener is active.

Explore More

Publication

Featured researches published by Matthias Diener.

ieee international conference on cloud computing technology and science | 2012

High Performance Computing in the cloud: Deployment, performance and cost efficiency

Eduardo Roloff; Matthias Diener; Alexandre Carissimi; Philippe Olivier Alexandre Navaux

High-Performance Computing (HPC) in the cloud has reached the mainstream and is currently a hot topic in the research community and the industry. The attractiveness of cloud for HPC is the capability to run large applications on powerful, scalable hardware without needing to actually own or maintain this hardware. In this paper, we conduct a detailed comparison of HPC applications running on three cloud providers, Amazon EC2, Microsoft Azure and Rackspace. We analyze three important characteristics of HPC, deployment facilities, performance and cost efficiency and compare them to a cluster of machines. For the experiments, we used the well-known NAS parallel benchmarks as an example of general scientific HPC applications to examine the computational and communication performance. Our results show that HPC applications can run efficiently on the cloud. However, care must be taken when choosing the provider, as the differences between them are large. The best cloud provider depends on the type and behavior of the application, as well as the intended usage scenario. Furthermore, our results show that HPC in the cloud can have a higher performance and cost efficiency than a traditional cluster, up to 27% and 41%, respectively.

high performance computing and communications | 2010

Evaluating Thread Placement Based on Memory Access Patterns for Multi-core Processors

Matthias Diener; Felipe Lopes Madruga; Eduardo Rocha Rodrigues; Marco Antonio Zanata Alves; Jörg Schneider; Philippe Olivier Alexandre Navaux; Hans-Ulrich Heiss

Process placement is a technique widely used on parallel machines with heterogeneous interconnects to reduce the overall communication time. For instance, two processes which communicate frequently are mapped close to each other. Finding the optimal mapping between threads and cores in a shared-memory environment (for example, OpenMP and Pthreads) is an even more complex task due to implicit communication. In this work, we examine data sharing patterns between threads in different workloads and use those patterns in a similar way as messages are used to map processes in cluster computers. We evaluated our technique on a state-of-the-art multicore processor and achieved moderate improvements in the common case and considerable improvements in some cases, reducing execution time by up to 45%.

international conference on parallel architectures and compilation techniques | 2014

kMAF: automatic kernel-level management of thread and data affinity

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux; Anselm Busse; Hans-Ulrich Heiß

One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).

international parallel and distributed processing symposium | 2012

Using the Translation Lookaside Buffer to Map Threads in Parallel Applications Based on Shared Memory

Eduardo Henrique Molina da Cruz; Matthias Diener; Philippe Olivier Alexandre Navaux

The communication latency between the cores in multiprocessor architectures differs depending on the memory hierarchy and the interconnections. With the increase of the number of cores per chip and the number of threads per core, this difference between the communication latencies is increasing. Therefore, it is important to map the threads of parallel applications taking into account the communication between them. In parallel applications based on the shared memory paradigm, the communication is implicit and occurs through accesses to shared variables. For this reason, it is difficult to detect the communication pattern between the threads. Traditional approaches use simulation to monitor the memory accesses performed by the application, requiring modifications to the source code and drastically increasing the overhead. In this paper, we introduce a new light-weight mechanism to detect the communication pattern of threads using the Translation Look aside Buffer (TLB). Our mechanism relies entirely on hardware features, which makes the thread mapping transparent to the programmer and allows it to be performed dynamically by the operating system. Moreover, no time consuming task, such as simulation, is required. We evaluated our mechanism with the NAS Parallel Benchmarks (NPB) and achieved an accurate representation of the communication patterns. Using the detected communication patterns, we generated thread mappings using a heuristic method based on the Edmonds graph matching algorithm. Running the applications with these mappings resulted in performance improvements of up to 15.3%, reducing the number of cache misses by up to 31.1%.

international parallel and distributed processing symposium | 2013

Communication-Based Mapping Using Shared Pages

Matthias Diener; Eduardo Henrique Molina da Cruz; Philippe Olivier Alexandre Navaux

In current shared memory architectures, the complexity of the cache and memory hierarchies is increasing. Therefore, it is becoming more important to analyze the communication behavior of parallel applications when mapping threads to cores, to improve performance and energy efficiency. However, communication is implicit in most programming models for shared memory, which makes it difficult to detect the communication pattern between the threads in an accurate and low-overhead way. We propose a new mechanism to detect the communication pattern of shared memory applications by monitoring page table accesses. Combining this mechanism with a dynamic migration algorithm allows mapping to be performed dynamically by the operating system. We implemented our mechanism in the Linux kernel and performed experiments with applications from the NAS Parallel Benchmarks. Results show a reduction of up to 16.7% of the execution time and 63% of the cache misses, compared to the original scheduler of the operating system. Furthermore, we decrease total processor and DRAM energy consumption by up to 14.7% and 28.5%, respectively.

international conference on cloud computing | 2012

Evaluating High Performance Computing on the Windows Azure Platform

Eduardo Roloff; Francis Birck; Matthias Diener; Alexandre Carissimi; Philippe Olivier Alexandre Navaux

Using the Cloud Computing paradigm for High-Performance Computing (HPC) is currently a hot topic in the research community and the industry. The attractiveness of Cloud Computing for HPC is the capability to run large applications on powerful, scalable hardware without needing to actually own or maintain this hardware. Most current research focuses on running HPC applications on the Amazon Cloud Computing platform, which is relatively easy because it supports environments that are similar to existing HPC solutions, such as clusters and supercomputers. In this paper, we evaluate the possibility of using Microsoft Windows Azure as a platform for HPC applications. Since most HPC applications are based on the Unix programming model, their source code has to be ported to the Windows programming model in addition to porting it to the Azure platform. We outline the challenges we encountered during porting applications and their resolutions. Furthermore, we introduce a metric to measure the efficiency of Cloud Computing platforms in terms of performance and price. We compared the performance and efficiency of running these benchmarks on a real machine, an Amazon EC2 instance and a Windows Azure instance. Results show that the performance of Azure is close to the performance of running on real machines, and that it is a viable alternative for running HPC applications when compared to other Cloud Computing solutions.

parallel, distributed and network-based processing | 2015

An Efficient Algorithm for Communication-Based Task Mapping

Eduardo Henrique Molina da Cruz; Matthias Diener; Laércio Lima Pilla; Philippe Olivier Alexandre Navaux

The communication between tasks of a parallel application is an important characteristic to consider when mapping tasks to computing cores due to possible differences in communication performance. Within a machine, performance differences are introduced by the memory hierarchy, in which cache memories can be shared by groups of cores and intra-chip interconnections are faster than inter-chip interconnections. In cluster and grid systems, the network imposes an additional communication latency. By mapping tasks that communicate to cores nearby on the memory hierarchy, or to the same nodes in clusters or grids, the communication of parallel applications is optimized, leading to increased performance and energy efficiency. In the task mapping context, one of the most important aspects to be considered is the mapping algorithm, as it determines the improvements that can be achieved. Since the problem of finding the best mapping is NP-Hard, heuristics must be employed to find an approximate solution in feasible time. In this paper, we present Eager Map, a new algorithm to perform communication-based mapping that is based on a greedy grouping strategy applied hierarchically. Experimental evaluation indicates that the execution time of our algorithm is 10 times faster than the state-of-the-art, and presents higher performance improvements. Due to its low execution time and high stability, Eager Map is also suitable for online task mapping, where tasks are migrated during execution.

Performance Evaluation | 2015

Characterizing Communication and Page Usage of Parallel Applications for Thread and Data Mapping

Matthias Diener; Eduardo Henrique Molina da Cruz; Laércio Lima Pilla; Fabrice Dupros; Philippe Olivier Alexandre Navaux

Abstract The parallelism in shared-memory systems has increased significantly with the advent and evolution of multicore processors. Current systems include several multicore and multithreaded processors with Non-Uniform Memory Access (NUMA) characteristics. These architectures require the adoption of two strategies for the efficient execution of parallel applications: (i) threads sharing data should be placed in such a way in the memory hierarchy that they execute on shared caches; and (ii) a thread should have the data that it accesses placed on the NUMA node where it is executing. We refer to these techniques as thread and data mapping, respectively. Both strategies require knowledge of the application’s memory access behavior to identify the communication between threads and processes as well as their usage of memory pages. In this paper, we introduce a profiling method to establish the suitability of parallel applications for improved mappings that take the memory hierarchy into account, based on a mathematical description of their memory access behaviors. Experiments with a large set of parallel workloads that are based on a variety of parallel APIs (MPI, OpenMP, Pthreads, and MPI+OpenMP) show that most applications can benefit from improved mappings. We provide a mechanism to compute optimized thread and data mappings. Experimental results with this mechanism showed performance improvements of up to 54% (20% on average), as well as reductions of the energy consumption of up to 37% (11% on average), compared to the default mapping by the operating system. Furthermore, our results show that thread and data mapping have to be performed jointly in order to achieve optimal improvements.

Journal of Parallel and Distributed Computing | 2014

Dynamic thread mapping of shared memory applications by exploiting cache coherence protocols

Eduardo Henrique Molina da Cruz; Matthias Diener; Marco Antonio Zanata Alves; Philippe Olivier Alexandre Navaux

In current computer architectures, the communication performance between threads varies depending on the memory hierarchy. This performance difference must be considered when mapping parallel applications to processor cores. In parallel applications based on the shared memory paradigm, the communication is difficult to detect because it is implicit. Furthermore, dynamic mapping introduces several challenges, since it needs to find a suitable mapping and migrate the threads with a low overhead during the execution of the application. We propose a mechanism to detect the communication pattern of shared memory applications by monitoring cache coherence protocols. We also propose heuristics that, combined with our communication detection mechanism, allow the mapping to be performed dynamically by the operating system. Experiments with the NAS Parallel Benchmarks showed a reduction of up to 13.9% of the execution time, 30.5% of the cache misses and 39.4% of the number of invalidation messages.

high performance computing and communications | 2015

SiNUCA: A Validated Micro-Architecture Simulator

Marco Antonio Zanata Alves; Carlos Villavieja; Matthias Diener; Francis Birck Moreira; Philippe Olivier Alexandre Navaux

In order to observe and understand the architectural behavior of applications and evaluate new techniques, computer architects often use simulation tools. Several cycle-accurate simulators have been proposed to simulate the operation of the processor on the micro-architectural level. However, an important step before adopting a simulator is its validation, in order to determine how accurate the simulator is compared to a real machine. This validation step is often neglected with the argument that only the industry possesses the implementation details of the architectural components. The lack of publicly available micro-benchmarks that are capable of providing insights about the processor implementation is another barrier. In this paper, we present the validation of a new cycle-accurate, trace-driven simulator, SiNUCA. To perform the validation, we introduce a new set of micro-benchmarks to evaluate the performance of architectural components. SiNUCA provides a controlled environment to simulate the micro-architecture inside the cores, the cache memory sub-system with multi-banked caches, a NoC interconnection and a detailed memory controller. Using our micro-benchmarks, we present a simulation validation comparing the performance of real Core 2 Duo and Sandy-Bridge processors, achieving an average performance error of less than 9%.

Explore More