Vlad Mihai Sima
Delft University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vlad Mihai Sima.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2016
Razvan Nane; Vlad Mihai Sima; Christian Pilato; Jongsok Choi; Blair Fort; Andrew Canis; Yu Ting Chen; Hsuan Hsiao; Stephen Dean Brown; Fabrizio Ferrandi; Jason Helge Anderson; Koen Bertels
High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing todays system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
field programmable logic and applications | 2012
Razvan Nane; Vlad Mihai Sima; Bryan Olivier; Roel Meeuws; Yana Yankova; Koen Bertels
In the last decade, a considerable amount of effort was spent on raising the implementation level of hardware systems by automatically extracting the parallelism from input applications and using tools to generate Hardware/Software co-design solutions. However, the tools developed thus far either focus on particular application domains or they impose severe restrictions on the input language. In this paper, we present the DWARV 2.0 compiler that accepts general C-code as input and generates synthesizable VHDL for unrestricted application domains. Dissimilar to previous hardware compilers, this implementation is based on CoSy compiler framework. This allowed us to build a highly modular compiler in which standard or custom optimizations can be easily integrated. Validation experiments showed speed-ups of up to 4.41× when comparing against another state of the art hardware compiler.
design, automation, and test in europe | 2012
Giovanni Mariani; Vlad Mihai Sima; Gianluca Palermo; Vittorio Zaccaria; Cristina Silvano; Koen Bertels
Resource run-time managers have been shown particularly effective for coordinating the usage of the hardware resources by multiple applications, eliminating the necessity of a full-blown operating system. For this reason, we expect that this technology will be increasingly adopted in emerging multi-application reconfigurable systems. This paper introduces a fully automated design flow that exploits multi-objective design space exploration to enable runtime resource management for the Molen reconflgurable architecture. The entry point of the design flow is the application source code; our flow is able to heuristically determine a set of candidate hardware/software configurations of the application (i.e., operating points) that trade off the occupation of the reconflgurable fabric (in this case, an FPGA), the load of the master processor and the performance of the application itself. This information enables a run-time manager to exploit more efficiently the available system resources in the context of multiple applications. We present the results of an experimental campaign where we applied the proposed design flow to two reference audio applications mapped on the Molen architecture. The analysis proved that the overhead of the design space exploration and operating points extraction with respect to the original Molen flow is within reasonable bounds since the final synthesis time still represents the major contribution. Besides, we have found that there is a high variance in terms of execution time speedup associated with the operating points of the application (characterized by a different usage of the FPGA) which can be exploited by the run-time manager to increase/decrease the quality of service of the application depending on the available resources1.
international conference on embedded computer systems architectures modeling and simulation | 2015
Ernst Joachim Houtgast; Vlad Mihai Sima; Koen Bertels; Zaid Al-Ars
We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3× faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.
international parallel and distributed processing symposium | 2009
Vlad Mihai Sima; Koen Bertels
In this paper, we present a runtime optimization targeting the speedup of applications running on a reconfigurable platform supporting the MOLEN programming paradigm. More specifically, for functions that have an execution time dependent on parameters, we propose an online adaptive decision algorithm to determine if the gain of running that function in hardware outweighs the overhead of transferring the parameters, managing the start and stop of the execution and obtaining the result. Our approach is dynamic in the sense it does not rely on compile time information.The algorithm is applied on a real video codec for which a function is implemented in hardware and we show improvements as big as 24% percent can be obtained for the specific kernel. We also determine the overhead and execution time ranges in which this optimisation is usefull and what other factors can influence it.
field-programmable logic and applications | 2009
Mojtaba Sabeghi; Vlad Mihai Sima; Koen Bertels
Multitasking reconfigurable computers with one or more reconfigurable processors are being used increasingly during the past few years. One of the major challenges in such systems is the scheduling and allocation of the tasks on the reconfigurable fabric. In this paper we present a two level scheduling mechanism for tightly coupled reconfigurable architecture machines. To overcome the complexity of identifying kernels at runtime, we use the compiler support. The compiler provides the runtime system with a configuration call graph which will be used as a viable source of information for the scheduling algorithm. We combine the configuration call graphs from all running applications and extract the distance to the next call and frequency of calls in future for each kernel from this graph. We base our scheduling decisions on these two parameters. Evaluation results show that the proposed method is very promising and it has the potential to be considered for future research.
Reconfigurable Computing-From FPGAs to Hardware/Software Codesign. Ed.: J. M. P. Cardoso | 2011
João M. P. Cardoso; Pedro C. Diniz; Zlatko Petrov; Koen Bertels; Michael Hübner; Hans van Someren; Fernando M. Gonçalves; José Gabriel F. Coutinho; George A. Constantinides; Bryan Olivier; Wayne Luk; Juergen Becker; Georgi Kuzmanov; Florian Thoma; Lars Braun; Matthias Kühnle; Razvan Nane; Vlad Mihai Sima; Kamil Krátký; José Carlos Alves; João Canas Ferreira
The relentless increase in capacity of Field-Programmable Gate-Arrays (FPGAs) has made them vehicles of choice for both prototypes and final products requiring on-chip multi-core, heterogeneous and reconfigurable systems. Multiple cores can be embedded as hard- or soft-macros, have customizable instruction sets, multiple distributed RAMs and/or configurable interconnections. Their flexibility allows them to achieve orders of magnitude better performance than conventional computing systems via customization. Programming these systems, however, is extremely cumbersome and error-prone and as a result their true potential is only achieved very often at unreasonably high design efforts. This project covers developing, implementing and evaluating a novel compilation and synthesis system approach for FPGA-based platforms. We rely on Aspect-Oriented Specifications to convey critical domain knowledge to a mapping engine while preserving the advantages of a high-level imperative programming paradigm in early software development as well as program and application portability. We leverage Aspect-Oriented specifications and a set of transformations to generate an intermediate representation suitable to hardware mapping. A programming language, LARA, will allow the exploration of alternative architectures and design patterns enabling the generation of flexible hardware cores that can be easily incorporated into larger multi-core designs. We will evaluate the effectiveness of the proposed approach using partner-provided codes from the domain of audio processing and real-time avionics. We expect the technology developed in REFLECT to be integrated by our industrial partners, in particular by ACE, a leading compilation tool supplier for embedded systems, and by Honeywell, a worldwide solution supplier of embedded high-performance systems.
international conference on computer aided design | 2015
Nauman Ahmed; Vlad Mihai Sima; Ernst Joachim Houtgast; Koen Bertels; Zaid Al-Ars
The fast decrease in cost of DNA sequencing has resulted in an enormous growth in available genome data, and hence led to an increasing demand for fast DNA analysis algorithms used for diagnostics of genetic disorders, such as cancer. One of the most computationally intensive steps in the analysis is represented by the DNA read alignment. In this paper, we present an accelerated version of BWA-MEM, one of the most popular read alignment algorithms, by implementing a heterogeneous hardware/software optimized version on the Convey HC2ex platform. A challenging factor of the BWA-MEM algorithm is the fact that it consists of not one, but three computationally intensive kernels: SMEM generation, suffix array lookup and local Smith-Waterman. Obtaining substantial speedup is hence contingent on accelerating all of these three kernels at once. The paper shows an architecture containing two hardware-accelerated kernels and one kernel optimized in software. The two hardware kernels of suffix array lookup and local Smith-Waterman are able to reach speedups of 2.8x and 5.7x, respectively. The software optimization of the SMEM generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version.
bioinformatics and biomedicine | 2015
Shanshan Ren; Vlad Mihai Sima; Zaid Al-Ars
Many DNA sequence analysis tools have been developed to turn the massive raw DNA sequencing data generated by NGS (Next Generation Sequencing) platforms into biologically meaningful information. The pair-HMMs forward algorithm is widely used to calculate the overall alignment probability needed by a number of DNA analysis tools. In this paper, we propose a novel systolic array design to accelerate the pair-HMMs forward algorithm on FPGAs. A number of architectural features have been implemented to improve the performance of the design, such as early exit points to increase the utilization of the array for small sequence sizes, as well as on-chip buffering to enable the processing of long sequences effectively. We present an implementation of the design on the Convey supercomputing platform. Experimental results show that the FPGA implementation of the pair-HMMs forward algorithm is up to 67x faster, compared to software-only execution.
field-programmable logic and applications | 2007
Koen Bertels; Georgi Kuzmanov; Elena Moscu Panainte; Georgi Gaydadjiev; Yana Yankova; Vlad Mihai Sima; Kamana Sigdel; Roel Meeuws; Stamatis Vassiliadis
The aim of the hartes project is to facilitate and automate the rapid design and development of heterogeneous embedded systems, targeting a combination of a general purpose embedded processor, digital signal processing and reconfigurable hardware. In this paper, we evaluate three tools from the hartes toolchain supporting profiling, compilation, and HDL generation. These tools facilitate the HW/SW partitioning, co-design, co-verification, and co-execution of demanding embedded applications. The described tools are provided by the Delft Work Bench framework1. Experimental results on MJPEG and G721 encoder application case studies suggest overall performance improvement of 228% and 36% respectively.