Juan Fernando Eusse
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Juan Fernando Eusse.
ACM Transactions on Reconfigurable Technology and Systems | 2015
Juan Fernando Eusse; Christopher Williams; Rainer Leupers
Application Specific Instruction Set Processor (ASIP) design methodologies have not been significantly altered during the past decade, and are still based on a highly manual and iterative process. Profiling has been established as a first step to prune the design space, and gain a deep understanding of the algorithms that underpin the application for which an ASIP is to be tailored. Independently of the profiling strategy, none of the existing ASIP-oriented profiling technologies enables on-the-loop application optimization or algorithmic exploration, which are mandatory steps throughout ASIP design. An innovative multi-grained approach that enables multiple levels of profiling detail according to the ASIP design stage (i.e. hot spot identification, application optimization, algorithmic exploration and architectural design) is presented. To validate our multi-grained profiling approach, the design of an ASIP for Marker-Based Augmented Reality was undertaken, achieving a 6x speedup in application execution in two days of design time.
design automation conference | 2012
Luis Gabriel Murillo; Juan Fernando Eusse; Jovana Jovic; Sergey Yakoushkin; Rainer Leupers; Gerd Ascheid
Full-system simulators are essential to enable early software development and increase the MPSoC programming productivity, however, their speed is limited by the speed of processor models. Although hybrid processor simulators provide native execution speed and target architecture visibility, their use for modern multi-core OSs and parallel software is restricted due to dynamic temporal and state decoupling side effects. This work analyzes the decoupling effects caused by hybridization and presents a novel synchronization technique which enables full-system hybrid simulation for modern MPSoC software. Experimental results show speed-ups from 2× to 45× over instruction-accurate simulation while still attaining functional correctness.
international conference on embedded computer systems architectures modeling and simulation | 2015
Miguel Angel Aguilar; Juan Fernando Eusse; Projjol Ray; Rainer Leupers; Gerd Ascheid; Weihua Sheng; Prashant Sharma
In the last years the presence of embedded devices in everyday life has grown exponentially. The market of these devices imposes conflicting requirements such as cost, performance and energy. The use of Multiprocessor Systems on Chip (MPSoCs) is a widely accepted solution to provide a trade-off between these demands. However, programming MPSoCs is still a cumbersome task. Several research efforts have addressed this challenge in two complementary directions: paradigms for parallel programming and tools for parallelism extraction. However, most of these efforts are focused on the high performance domain and they do not consider the characteristics of the underlying platform. In this paper, we present an approach to extract multiple forms of parallelism from sequential C code, which is applied to widespread Android mobile devices. We show the effectiveness of our work by parallelizing relevant embedded benchmarks on a quad-core Nexus 7 tablet.
international conference on embedded computer systems architectures modeling and simulation | 2014
Juan Fernando Eusse; Christopher Williams; Luis Gabriel Murillo; Rainer Leupers; Gerd Ascheid
Application Specific Instruction Set Processors (ASIPs) seek for an optimal performance/area/energy trade-off for a given algorithm. In all current design methodologies an architectural model must be first manually created based on designers experience. These models are increasingly refined until the design constraints are met, through several time consuming algorithmic/architecture co-exploration iterations. This paper presents a novel performance estimation approach that shortens the design cycle of existing methodologies by providing an early assessment of the impact of customizations on the achievable performance. The approach does so by eliminating the need for a completely specified architecture, without limiting designers freedom and without simulating the application repeatedly. Overall, our approach reduces the number of necessary co-exploration iterations, thus increasing design productivity. We validate our approach via two different case studies: a butterfly-enabled ASIP for Fast Fourier Transform computation and a Connected Components Labeling ASIP for computer vision.
reconfigurable communication centric systems on chip | 2013
Juan Fernando Eusse; Christopher Williams; Rainer Leupers
Application Specific Instruction Set Processor (ASIP) design methodologies have not been significantly altered during the past decade, and are still based on a highly manual and iterative process. Profiling has been established as a first step to prune the design space, and gain a deep understanding of the algorithms that underpin the application for which an ASIP is to be tailored. Independently of the profiling strategy, none of the existing ASIP-oriented profiling technologies enables on-the-loop application optimization or algorithmic exploration, which are mandatory steps throughout ASIP design. An innovative multi-grained approach that enables multiple levels of profiling detail according to the ASIP design stage (i.e. hot spot identification, application optimization, algorithmic exploration and architectural design) is presented. To validate our multi-grained profiling approach, the design of an ASIP for Marker-Based Augmented Reality was undertaken, achieving a 6x speedup in application execution in two days of design time.
design, automation, and test in europe | 2014
Juan Fernando Eusse; Rainer Leupers; Gerd Ascheid; Patrick Sudowe; Bastian Leibe; Tamon Sadasue
Real-time identification of connected regions of pixels in large (e.g. FullHD) frames is a mandatory and expensive step in many computer vision applications that are becoming increasingly popular in embedded mobile devices such as smart-phones, tablets and head mounted devices. Standard off-the-shelf embedded processors are not yet able to cope with the performance/flexibility trade-offs required by such applications. Therefore, in this work we present an Application Specific Instruction Set Processor (ASIP) tailored to concurrently execute thresholding, connected components labeling and basic feature extraction of image frames. The proposed architecture is capable to cope with frame complexities ranging from QCIF to FullHD frames with 1 to 4 bytes-per-pixel formats, while achieving an average frame rate of 30 frames-per-second (fps). Synthesis was performed for a standard 65nm CMOS library, obtaining an operating frequency of 350MHz and 2.1mm2 area. Moreover, evaluations were conducted both on typical and synthetic data sets, in order to thoroughly assess the achievable performance. Finally, an entire planar-marker based augmented reality application was developed and simulated for the ASIP.
design, automation, and test in europe | 2012
Jovana Jovic; Sergey Yakoushkin; Luis Gabriel Murillo; Juan Fernando Eusse; Rainer Leupers; Gerd Ascheid
Due to their good flexibility-performance trade-off, Application Specific Instruction-set Processors (ASIPs) have been identified as a valuable component in modern embedded systems, especially the extensible ones, achieving good cost-efficiency trade-offs. Since the generation of the described hardware is usually automated to a high extent, in order to deliver an ASIP-based design in due time, developers are limited by the performance of the underlying simulation techniques for software development. On the other hand, the Hybrid Processor simulation technology (HySim), which enables dynamic run-time switching between native and instruction-accurate simulation, has reported high speed-up values for some fixed architectures. This paper presents enhanced HySim technology for extensible cores, based on a layered simulation infrastructure. This technology has shown a speed-up on a per-function basis of two orders of magnitude for a realistic MIMO OFDM benchmark on a multi-core platform with customized Xtensa cores by Tensilica.
high performance computing and communications | 2015
Miguel Angel Aguilar; Juan Fernando Eusse; Rainer Leupers; Gerd Ascheid; Maximilian Odendahl
Many embedded applications such as multimedia, signal processing and wireless communications present a streaming processing behavior. In order to take full advantage of modern multi-and many-core embedded platforms, these applications have to be parallelized by describing them in a given parallel Model of Computation (MoC). One of the most prominent MoCs is Kahn Process Network (KPN) as it allows to express multiple forms of parallelism and it is suitable for efficient mapping and scheduling onto parallel embedded platforms. However, describing streaming applications manually in a KPN is a challenging task. Especially, since they spend most of their execution time in loops with unbounded number of iterations. These loops are in several cases implemented as while loops, which are difficult to analyze. In this paper, we present an approach to guide the derivation of KPNs from embedded streaming applications dominated by multiple types of while loops. We evaluate the applicability of our approach on an eight DSP core commercial embedded platform using realistic benchmarks. Results measured on the platform showed that we are able to speedup sequential benchmarks on average by a factor up to 4.3x and in the best case up to 7.7x. Additionally, to evaluate the effectiveness of our approach, we compared it against a state-of-the-art parallelization framework.
international conference on embedded computer systems architectures modeling and simulation | 2016
Juan Fernando Eusse; Francisco Fernandez; Rainer Leupers; Gerd Ascheid
The design of an adequate memory subsystem is critical for the achievement of high performance in application specific and data plane processors. Applications running on such processors must fully exploit the memory hierarchy, so that the gains achieved due to hardware optimization are not invalidated by software inefficiency. This paper presents a framework that vertically integrates the process of memory subsystem design with application optimization and algorithmic exploration. Based on an abstract model, the framework is able to predict the impact that a customization in the memory hierarchy will have on performance, while applying optimization techniques to efficiently utilize it. To perform the aforementioned processes, the tool flow relies on source level information and a novel data model, avoiding the necessity for expensive cycle-accurate simulations. Throughout the evaluation, we show that the framework is capable of predicting memory-related performance metrics with an accuracy of ±10%, when compared to simulation. Furthermore, we show that the approach is significantly more efficient than simulation, and can lead to gains in designers productivity up to a factor of 40x.
software and compilers for embedded systems | 2015
Juan Fernando Eusse; Luis Gabriel Murillo; Christopher McGirr; Rainer Leupers; Gerd Ascheid
Early design decisions such as architectural class and instruction set selection largely determine the performance and energy consumption of application specific processors (ASIPs). However, making decisions that effectively reflect in high performance require that a careful analysis of the target application is done by an experienced designer. Such process is extremely time consuming, and a confirmation that the processor meets the application requirements can only be extracted after costly architectural implementation, synthesis and simulation. To shorten design times, this work couples High-Level Synthesis (HLS) with pre-architectural performance estimation. We do so with the aim of providing designers with an initial architectural seed together with quantitative feedback about its performance. This enables to perform a light-weight refinement process based on the obtained feedback, such that time-consuming microarchitectural implementation is done only once at the end of the refinement steps. We employed our flow to generate four potential ASIPs for a 1024-point FFT. Estimates validation and gain evaluation is performed on actual ASIP implementations, which achieve performance gains of up to 8.42x and energy gains up to 1.32x over an existing VLIW processor.