Is this you? Create Your Porfile

Andrés Goens

Dresden University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrés Goens is active.

Explore More

Publication

Featured researches published by Andrés Goens.

international embedded systems symposium | 2015

Analysis of Process Traces for Mapping Dynamic KPN Applications to MPSoCs

Andrés Goens; Jeronimo Castrillon

Current approaches for mapping Kahn Process Networks (KPN) and Dynamic Data Flow (DDF) applications rely on assumptions on the program behavior specific to an execution. Thus, a near-optimal mapping, computed for a given input data set, may become sub-optimal at run-time. This happens when a different data set induces a significantly different behavior. We address this problem by leveraging inherent mathematical structures of the dataflow models and the hardware architectures. On the side of the dataflow models, we rely on the monoid structure of histories and traces. This structure help us formalize the behavior of multiple executions of a given dynamic application. By defining metrics we have a formal framework for comparing the executions. On the side of the hardware, we take advantage of symmetries in the architecture to reduce the search space for the mapping problem. We evaluate our implementation on execution variations of a randomly-generated KPN application and on a low-variation JPEG encoder benchmark. Using the described methods we show that trace differences are not sufficient for characterizing performance losses. Additionally, using platform symmetries we manage to reduce the design space in the experiments by two orders of magnitude.

design, automation, and test in europe | 2014

Optimized buffer allocation in multicore platforms

Maximilian Odendahl; Andrés Goens; Rainer Leupers; Gerd Ascheid; Benjamin Ries; Berthold Vöcking; Tomas Henriksson

With the availability of advanced MPSoC and emerging Dynamic RAM (DRAM) interface technologies, an optimal allocation of logical data buffers to physical memory cannot be handled manually anymore due to the huge design space. An allocation does not only need to decide between an on-or off-chip memory, but also needs to take an increasing number of available memory channels, different bandwidth capacities and several routing possibilities into account. We formalize this problem and introduce a Mixed Integer Linear Programming (MILP) model based on two different optimization criteria. We implement the MILP model into a retargetable tool and present a case study with representative data of the Long-Term-Evolution (LTE) standard to show the real-life applicability of our approach.

2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC) | 2016

Why Comparing System-Level MPSoC Mapping Approaches is Difficult: A Case Study

Andrés Goens; Robert Khasanov; Jeronimo Castrillon; Simon Polstra; Andy D. Pimentel

Software abstractions are crucial to effectively program heterogeneous Multi-Processor Systems on Chip (MPSoCs). Prime examples of such abstractions are Kahn Process Networks (KPNs) and execution traces. When modeling computation as a KPN, one of the key challenges is to obtain a good mapping, i.e., an assignment of logical computation and communication to physical resources. In this paper we compare two system-level frameworks for solving the mapping problem: Sesame and MAPS. These frameworks, while superficially similar, embody different approaches. Sesame, motivated by modeling and design-space exploration, uses evolutionary algorithms for mapping. MAPS, being a compiler framework, uses simple and fast heuristics instead. In this work we highlight the value of common abstractions, such as KPNs and traces, as a vehicle to enable comparisons between large independent frameworks. These types of comparisons are fundamental for advancing research in the area. At the same time, we illustrate how the lack of formalized models at the hardware level are an obstacle to achieving fair comparisons. Additionally, using a set of applications from the embedded systems domain, we observe that genetic algorithms tend to outperform heuristics by a factor between 1× and 5×, with notable exceptions. This performance comes at the cost of a longer computation time, between 0 and 2 orders of magnitude in our experiments.

IEEE Transactions on Multi-Scale Computing Systems | 2018

A Hardware/Software Stack for Heterogeneous Systems

Jeronimo Castrillon; Matthias Lieber; Sascha Klüppelholz; Marcus Völp; Nils Asmussen; Uwe Aßmann; Franz Baader; Christel Baier; Gerhard P. Fettweis; Jochen Fröhlich; Andrés Goens; Sebastian Haas; Dirk Habich; Hermann Härtig; Mattis Hasler; Immo Huismann; Tomas Karnagel; Sven Karol; Akash Kumar; Wolfgang Lehner; Linda Leuschner; Siqi Ling; Steffen Märcker; Christian Menard; Johannes Mey; Wolfgang E. Nagel; Benedikt Nöthen; Rafael Peñaloza; Michael Raitza; Jörg Stiller

Plenty of novel emerging technologies are being proposed and evaluated today, mostly at the device and circuit levels. It is unclear what the impact of different new technologies at the system level will be. What is clear, however, is that new technologies will make their way into systems and will increase the already high complexity of heterogeneous parallel computing platforms, making it ever so difficult to program them. This paper discusses a programming stack for heterogeneous systems that combines and adapts well-understood principles from different areas, including capability-based operating systems, adaptive application runtimes, dataflow programming models, and model checking. We argue why we think that these principles built into the stack and the interfaces among the layers will also be applicable to future systems that integrate heterogeneous technologies. The programming stack is evaluated on a tiled heterogeneous multicore.

software and compilers for embedded systems | 2017

Robust Mapping of Process Networks to Many-Core Systems using Bio-Inspired Design Centering

Gerald Hempel; Andrés Goens; Jeronimo Castrillon; Josefine Asmus; Ivo F. Sbalzarini

Embedded systems are often designed as complex architectures with numerous processing elements. Effectively programming such systems requires parallel programming models e.g. task-based or dataflow-based models. With these types of models, the mapping of the abstract application model to the existing hardware architecture plays a decisive role and is usually optimized to achieve an ideal resource footprint or a near-minimal execution time. However, when mapping several independent programs to the same platform, resource conflicts can arise. This can be circumvented by remapping some of the tasks of an application, which in turn affect its timing behavior, possibly leading to constraint violations. In this work we present a novel method to compute mappings that are robust against local task remapping. The underlying method is based on the bio-inspired design centering algorithm of Lp-Adaptation. We evaluate this with several benchmarks on different platforms and show that mappings obtained with our algorithm are indeed robust. In all experiments, our robust mappings tolerated significantly more run-time perturbations without violating constraints than mappings devised with optimization heuristics

ACM Transactions on Architecture and Code Optimization | 2017

Symmetry in Software Synthesis

Andrés Goens; Sergio Siccha; Jeronimo Castrillon

With the surge of multi- and many-core systems, much research has focused on algorithms for mapping and scheduling on these complex platforms. Large classes of these algorithms face scalability problems. This is why diverse methods are commonly used for reducing the search space. While most such approaches leverage the inherent symmetry of architectures and applications, they do it in a problem-specific and intuitive way. However, intuitive approaches become impractical with growing hardware complexity, like Network-on-Chip interconnect or heterogeneous cores. In this article, we present a formal framework that can determine the inherent local and global symmetry of architectures and applications algorithmically and leverage these for problems in software synthesis. Our approach is based on the mathematical theory of groups and a generalization called inverse semigroups. We evaluate our approach in two state-of-the-art mapping frameworks. Even for the platforms with a handful of cores of today and moderate-sized benchmarks, our approach consistently yields reductions of the overall execution time of algorithms. We obtain a speedup of more than 10 × for one use-case and saved 10% of time in another.

Journal of Systems Architecture | 2016

An optimal allocation of memory buffers for complex multicore platforms

Andrés Goens; Jeronimo Castrillon; Maximilian Odendahl; Rainer Leupers

In deeply embedded heterogeneous multicores the allocation of data to memories is crucial for application performance. For applications with stringent throughput constraints, the allocation is often done manually by carefully assigning static memory locations to the logical buffers of the application. Today, designers are confronted with applications with thousands of buffers and architectures with hundreds of memories, rendering manual approaches impractical. In this paper we present an automatic approach for statically allocating logical buffers to physical memories, assuming a fixed task-to-processor mapping and respecting multiple throughput constraints.In our approach, we model the application in a data-centric way, by explicitly defining buffers and associating computational tasks that access the buffers within well-specified time intervals. Besides, we use an architecture model that allows to perform an allocation that is aware of the topology of the multicore and the physical bandwidth constraints of the interconnect. We present a layered approach to describe and solve the buffer-allocation problem as well as related subproblems, using mixed-integer linear programming. We show that the buffer-allocation problem is NP-complete, and present a more scalable formulation as a semi-definite programming problem. We evaluate the proposed LP methods by allocating around 1000 buffers corresponding to processing one frame in the Long-Term Evolution (LTE) standard, onto a multicore with 80 processing elements. We introduce a solution approach that allowed to find an optimal allocation in around 2 hours, which is at least two orders of magnitude faster than a straightforward formulation.

international parallel and distributed processing symposium | 2015

Buffer Allocation Based On-Chip Memory Optimization for Many-Core Platforms

Maximilian Odendahl; Andrés Goens; Rainer Leupers; Gerd Ascheid; Tomas Henriksson

The problem of finding an optimal allocation of logical data buffers to memory has emerged as a new research challenge due to the increased complexity of applications and new emerging Dynamic RAM (DRAM) interface technologies. This new opportunity of a large off-chip memory accessible by an ample bandwidth allows to reduce the on-chip Static RAM (SRAM) significantly and save production cost of future manycore platforms. We thus propose changes to an existing work that allows to uniformly reduce the on-chip memory size for a given application. We additionally introduce a novel linear programming model to automatically generate all necessary on chip memory sizes for a given application based on an optimal allocation of data buffers. An extension allows to further reduce the required on-chip memory in multi-application scenarios. We conduct a case study to validate all our models and show the applicability of our approach.

high performance embedded architectures and compilers | 2018

Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap

Robert Khasanov; Andrés Goens; Jeronimo Castrillon

Modern embedded systems are rapidly increasing their complexity, both in terms of numbers of cores, as well as heterogeneity. To generate efficient code for these systems, it is common to leverage formal models of computation. Among these, the dataflow model of Kahn Process Networks (KPN) is widespread because it is expressive but guarantees a deterministic execution. However, the KPN model is ill-suited to expose data-level parallelism, since this has to be made explicit in the process network. This is aggravated by the fact that its most common execution model, Kahn-MacQueen, poses restrictive conditions on the scheduling of data-parallel processes, leading to an inefficient execution. In this paper we present a novel extension to the KPN model and a relaxed execution strategy that addresses this problem, while keeping the deterministic KPN semantics. It improves run-time adaptivity in malleable way and provides implicit parallelism. We evaluate our approach on two architectures, improving the performance of a benchmark by up to 25.6% on an Intel chip with hyper-threading, and by up to 78.0 % on a heterogeneous embedded ARM big.LITTLE architecture.

compiler construction | 2018

Compiling for concise code and efficient I/O

Sebastian Ertel; Andrés Goens; Justus Adam; Jeronimo Castrillon

Large infrastructures of Internet companies, such as Facebook and Twitter, are composed of several layers of micro-services. While this modularity provides scalability to the system, the I/O associated with each service request strongly impacts its performance. In this context, writing concise programs which execute I/O efficiently is especially challenging. In this paper, we introduce Ÿauhau, a novel compile-time solution. Ÿauhau reduces the number of I/O calls through rewrites on a simple expression language. To execute I/O concurrently, it lowers the expression language to a dataflow representation. Our approach can be used alongside an existing programming language, permitting the use of legacy code. We describe an implementation in the JVM and use it to evaluate our approach. Experiments show that Ÿauhau can significantly improve I/O, both in terms of the number of I/O calls and concurrent execution. Ÿauhau outperforms state-of-the-art approaches with similar goals.

Explore More