Fernando Martinez Vallina

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Fernando Martinez Vallina is active.

Explore More

Publication

Featured researches published by Fernando Martinez Vallina.

Iet Computers and Digital Techniques | 2012

Image and video processing platform for field programmable gate arrays using a high-level synthesis

Christophe Desmouliers; Erdal Oruklu; Semih Aslan; Jafar Saniie; Fernando Martinez Vallina

In this study, an image and video processing platform (IVPP) based on field programmable gate array (FPGAs) is presented. This hardware/software co-design platform has been implemented on a Xilinx Virtex-5 FPGA using a high-level synthesis and can be used to realise and test complex algorithms for real-time image and video processing applications. The video interface blocks are done in Register Transfer Languages and can be configured using the MicroBlaze processor allowing the support of multiple video resolutions. The IVPP provides the required logic to easily plug-in the generated processing blocks without modifying the front-end (capturing video data) and the back-end (displaying processed output data). The IVPP can be a complete hardware solution for a broad range of real-time image/video processing applications including video encoding/decoding, surveillance, detection and recognition.

field-programmable technology | 2015

OpenCL library of stream memory components targeting FPGAs

Jasmina Vasiljevic; Ralph D. Wittig; Paul R. Schumacher; Jeff Fifield; Fernando Martinez Vallina; Henry E. Styles; Paul Chow

In recent years, high-level languages and compilers, such as OpenCL have improved both productivity and FPGA adoption on a wider scale. One of the challenges in the design of high-performance stream FPGA applications is iterative manual optimization of the numerous application buffers (e.g., arrays, FIFOs and scratch-pads). First, to achieve the desired throughput, the programmer faces the burden of analyzing the memory accesses of each application buffer, and based on observed data locality determines the optimal on-chip buffering, and off-chip read/write data access strategy. Second, to minimize throughput bottlenecks, the programmer has to carefully partition the limited on-chip memory resources among many application buffers. In this work we present an FPGA OpenCL library of pre-optimized stream memory components (SMCs). The library contains three types of SMCs, which implement frequently applied data transformations: 1) stencil, 2) transpose and 3) tiling. The library generates SMCs that are optimized both for the specific data transformation they perform as well as the user specified data set size. Further, to ease the partitioning of on-chip memory resources among many application memories, the library automatically maps application buffers to on-chip and off-chip memory resources. This is achieved by enabling the programmer to specify an on-chip memory budget for each component. In terms of on-chip memory, the SMCs perform data buffering to exploit data locality and maximize reuse. In terms of off-chip memory accesses, the SMCs optimize read/write memory operations by performing data coalescing, bursting and prefetching. We show that using the SMC library, the programmer can quickly generate scalable, pre-optimized stream application memory components, thus reaching throughput targets without time consuming manual memory optimization.

electro information technology | 2016

Implementation of elementary functions for FPGA compute accelerators

Spenser Gilliland; Jafar Saniie; Fernando Martinez Vallina

Field programmable gate arrays (FPGA) are growing from the role of glue logic into the area of application acceleration and compute. This is fostered by advances in silicon technologies as well as standards based methodologies for interacting with heterogeneous compute resources. As these standards generally require the implementation of elementary functions, this work outlines the implementation and evaluation of the elementary functions required by the heterogeneous programming standard OpenCL. It outlines the implementation of the math “builtin” functions using CORDIC methods and details the processes that will be taken to benchmark the resource usage, maximum frequency, and latency of each function on Xilinx 7 Series FPGAs. Because of the applicability and standardization of the OpenCL math functions, this benchmarking effort provides a basis for understanding and analysing future implementations.

electro information technology | 2017

Analysis of library functions for FPGA compute accelerators

Spenser Gilliland; Jafar Saniie; Fernando Martinez Vallina

As FPGAs have grown ever larger, there has been a shift in the manner in which they are programmed. Early on, it was typical for designers to design all FPGA firmware in house using VHDL and Verilog. This gradually shifted towards design reuse at the IP Core Level. However in modern times, even designs at the IP level are having trouble adapting quickly enough to customer demands. This has resulted in a change in focus towards higher level languages such as OpenCL. A key aspect of OpenCL is its standard library and specifically the math builtins of the standard library. This paper performs an in-depth analysis of the math functions in the OpenCL standard library and develops a framework to perform further analysis of library functions being implemented on FPGAs.

international workshop on opencl | 2015

Performance optimization for a SHA-1 cryptographic workload expressed in OpenCL for FPGA execution

Fernando Martinez Vallina; Spenser Gilliland

The introduction of Field Programmable Gate Array (FPGA) based devices for OpenCL applications provides an opportunity to develop kernels which are executed on application specific compute units which can be optimized for specific workloads such as encryption. This work examines the optimization of the SHA-1 hashing algorithm developed in OpenCL for and FPGA based implementation. The implementation starts from the freely available SHA-1 implementation in OpenSWAN; ports the implementation to OpenCL; and optimizes the kernel for FPGA implementation using the Xilinx SDAcccel development environment for OpenCL applications. Through each stage, the implementation is benchmarked in order to examine latency, throughput, and power usage on FPGA, Graphics Processing Unit (GPU), and Central Processing Unit (CPU) systems. While the programming model of OpenCL on FPGAs is identical to GPU and CPU, in order to optimize an application it is necessary to understand how the OpenCL concepts are implemented on FPGAs. In the platform model, one FPGA is considered a device. Inside the FPGA, a portion of the resources are dedicated to the fixed platform region. This region includes the memory interface, PCI Express interface, Direct Memory Access (DMA) controller, flash programming interface, and FPGA reconfiguration interface. The rest of the FPGA resources are dedicated to one or more OpenCL regions. By calling clCreateProgramWith-Binary, an OpenCL Region will be reprogrammed with the users chosen FPGA binary file. The FPGA binary file contains the configuration information to program one or more compute into the FPGA fabric for a kernel in the application. As a result of the flexibility of the FPGA fabric, these compute units can all work on the same kernel function or can be specialized to work on a set of kernel functions. All compute units, which target the same kernel, may be used by clEnqueueNDRange or out of order queues to implement concurrency. In addition to customizing the kernel compute units, FPGAs enable the creation of application specific memory architectures to minimize latency. Within the memory architectures supported by an FPGA, external and on-chip memories can be used within the OpenCL memory model. The external memory is mapped to the global and constant memory spaces while the on chip memory is mapped to the local, private and program scope global memory spaces. The on chip memory is implemented via BRAM and register resources inside the FPGA which are automatically inferred by the compiler. In addition to the levels of memory hierarchy available with FPGAs, the Xilinx SDAccel compiler analyzes the kernel data movement operations to automatically coalesce transactions and infer burst transactions to maximize usage of the available memory bandwidth. The SHA-1 is an algorithm used for the hashing of messages. It has a block iterative structure which utilizes one way compression to achieve strong cryptographic integrity. In the algorithm, a message is broken into 512 bit blocks. For each block, an initialization hash from the previous block is combined with a portion of the message to determine a new hash. Inside the blocks, there is an 80 iteration loop used for scrambling the input data. The implementation of this scrambling loop is the main computational activity in the algorithm. When optimizing the SHA-1 algorithm, it is important to consider the use model that will be employed by the application. In many cases, a continuous stream of messages will need to be hashed using the SHA-1 algorithm. In this scenario, the primary goal should be to focus on overall throughput of the algorithm. Assuming this use case, a dataflow implementation of the SHA-1 with a pipelined inner scrambling loop is chosen. Furthermore, the FPGA implementation of program scope global memory is utilized to provide an efficient dataflow paradigm. After optimizing the SHA-1 computation kernel by refactoring the OpenCL kernel code and using compiler options in SDAccel, it is important characterize the operation of the kernel within the context of the FPGA accelerator card. For this reason, the implementation will be compared to other CPU and GPU based targets. For each target, latency, throughput, and power usage are measured with optimized kernels for each platform. To test latency, a single message will be queued for hashing on the accelerator card. For throughput, many messages will be queued for execution on the accelerator card. Finally, power usage will be collected in both scenarios to determine the overall throughput per watt of the system in high load and low load scenarios. Executing the SHA-1 workload in an FPGA expressed as an OpenCL application achieves a 6x improvement when compared to the execution of the reference OpenSWAN implementation. This workload which is both computationally and memory intensive provides a good reference example to explain how OpenCL applications can be profiled and optimized for implementation on FPGA systems with a 25W power envelope.

field programmable gate arrays | 2015

Unlocking FPGAs Using High Level Synthesis Compiler Technologies

Fernando Martinez Vallina; Henry E. Styles

FPGA devices have long been the standard for massively parallel computing fabrics with a low power footprint. Unfortunately, the complexity associated with an FPGA design has limited the rate of adoption by software application programmers. Recent advances in compiler and FPGA fabric capabilities are reversing this trend and there is a growing adoption of FPGAs for algorithmic workloads such as data analytics, feature detection in images, adaptive beam forming, etc. One of the pillars of this shift is the Vivado HLS compiler, which enables the compilation of algorithms captured in C and C++into efficient FPGA implementations. This talk focuses on how the HLS compiler creates algorithm specific compute architectures and how these elements are then used in an OpenCL based system level design abstraction. The evolution of these hardware design abstractions into software centric specifications enable application developers to leverage the flexibility of the FPGA fabric without the constraints typically found in fixed parallel architectures such as multi-core CPUs/GPUs.

Archive | 2014

Heterogeneous multiprocessor platform targeting programmable integrated circuits

Henry E. Styles; Jeffrey M. Fifield; Ralph D. Wittig; Philip B. James-Roxby; Sonal Santan; Devadas Varma; Fernando Martinez Vallina; Sheng Zhou; Charles Kwok-Wah Lo

Archive | 2014

Heterogeneous multiprocessor program compilation targeting programmable integrated circuits

Henry E. Styles; Jeffrey M. Fifield; Ralph D. Wittig; Philip B. James-Roxby; Sonal Santan; Devadas Varma; Fernando Martinez Vallina; Sheng Zhou; Charles Kwah-Wah Lo

Archive | 2018

Hardware acceleration device handoff for using programmable integrated circuits as hardware accelerators

Susheel Kumar Puthana; Stephen P. Rozum; Sudipto Chakraborty; David A. Knol; Yong Li; Fernando Martinez Vallina; Sonal Santan; Nabeel Shirazi; Salil Ravindra Raje; Ethan T. Parker; Suman Kumar Timmireddy; Heera Nand

Archive | 2017

Software development-based compilation flow for hardware implementation

Bennet An; Henry E. Styles; Sonal Santan; Fernando Martinez Vallina; Pradip K. Jha; David A. Knol; Sudipto Chakraborty; Jeffrey M. Fifield; Stephen P. Rozum

Explore More