Marcos Sanchez-Elez
Complutense University of Madrid
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Marcos Sanchez-Elez.
design, automation, and test in europe | 2003
Haitao Du; Marcos Sanchez-Elez; Nozar Tabrizi; Nader Bagherzadeh; M.L. Anido; Milagros Fernández
MorphoSys is a reconfigurable SIMD architecture. In this paper, a BSP-based ray tracing is gracefully mapped onto MorphoSys. The mapping highly exploits ray-tracing parallelism. A straightforward mechanism is used to handle irregularity among parallel rays in BSP. To support this mechanism, a special data structure is established, in which no intermediate data has to be saved. Moreover, optimizations such as object reordering and merging are facilitated. Data starvation is avoided by overlapping data transfer with intensive computation so that applications with different complexity can be managed efficiently. Since MorphoSys is small in size and power efficient, we demonstrate that MorphoSys is an economic platform for 3D animation applications on portable devices.
Computers & Graphics | 2003
Marcos Sanchez-Elez; Haitao Du; Nozar Tabrizi; Yun Long; Nader Bagherzadeh; Milagros Fernández
Abstract This paper presents a mapping scheme of an optimized octree-based ray tracing algorithm and its implementation on a SIMD reconfigurable architecture, MorphoSys, with appropriate hardware incorporated. A two-level SIMD mapping scheme for ray tracing is chosen to get better trade-off between coherence exploitation efficiency and bandwidth requirements. We apply a SIMD octree traversal algorithm that supports ray traversals of any origins and directions. Moreover, we have applied the bottom-up traversal order for shadow and reflection rays to avoid unnecessary testing. The memory overhead of the parallel execution of ray tracing in SIMD systems is analyzed to direct memory optimization. Pre-fetching is utilized to hide data fetch latency behind the computation. A Spatial Partitioning Tree buffer reduces the latency due to the interleaved accesses to the shared memory. It also dynamically exploits ray coherence to save memory bandwidth. A Pointer Update Unit and a Pointer Buffer are combined to remove the overhead resulted from pointer-calculations and stack pushes during the parallel depth-first-traversal process. The associated hardware cost is less than 2% of the whole system. In order to include diffuse effects into the output, we apply spherical harmonic. Post-synthesis simulation shows that the target chip is estimated to be 33 mm 2 and consumes less than 1 W in the worst case. Cycle-accurate simulation demonstrates that interactive ray tracing for medium-sized scenes is achieved on MorphoSys.
design, automation, and test in europe | 2002
Marcos Sanchez-Elez; Milagros Fernández; Rafael Maestre; Fadi J. Kurdahi; Román Hermida; Nader Bagherzadeh
A new technique is presented in this paper to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve the applications execution time minimizing external memory transfers. Some amount of on-chip data storage is assumed to be available in the reconfigurable architecture. Therefore the Complete Data Scheduler tries to optimally exploit this storage, saving data and result transfers between on-chip and external memories. In order to do this, specific algorithms for data placement and replacement have been designed. We also show that a suitable data scheduling could decrease the number of transfers required to implement the dynamic reconfiguration of the system.
design, automation, and test in europe | 2003
Marcos Sanchez-Elez; Milagros Fernández; M.L. Anido; Haitao Du; Nader Bagherzadeh; Román Hermida
This paper presents a new technique to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal is to improve application energy consumption. Two levels of on-chip data storage are assumed in the reconfigurable architecture. The data scheduler attempts to optimally exploit this storage, by deciding in which on-chip memory the data have to be stored in order to reduce energy consumption. We also show that a suitable data scheduling could decrease the energy required to implement the dynamic reconfiguration of the system.
international symposium on systems synthesis | 2001
Marcos Sanchez-Elez; Milagros Fernández; Román Hermida; Rafael Maestre; Fadi J. Kurdahi; Nader Bagherzadeh
We present an approach to the problem of data scheduling for multi-context reconfigurable architectures targeting DSP applications. The main goal is to improve applications execution time, through the integration of the data scheduler within a compilation framework specifically conceived for these architectures. Some on-chip data storage is assumed to be available in the reconfigurable architecture. Therefore, the data scheduler tries to optimally exploit this storage, saving data transfers between on-chip and external memories. In order to do this, specific algorithms for data placement and replacement have been designed. We also show that a suitable data scheduling can decrease the number of operations required to implement the dynamic reconfiguration of the system.
arXiv: Hardware Architecture | 2012
Marcos Sanchez-Elez; Sara Roman
An intensive use of reconfigurable hardware is expected in future embedded systems. This means that the system has to decide which tasks are more suitable for hardware execution. In order to make an efficient use of the FPGA it is convenient to choose one that allows hardware multitasking, which is implemented by using partial dynamic reconfiguration. One of the challenges for hardware multitasking in embedded systems is the online management of the only reconfiguration port of present FPGA devices. This paper presents different online reconfiguration scheduling strategies which assign the reconfiguration interface resource using different criteria: workload distribution or task deadline. The online scheduling strategies presented take efficient and fast decisions based on the information available at each moment. Experiments have been made in order to analyze the performance and convenience of these reconfiguration strategies.
Iet Computers and Digital Techniques | 2008
Fredy Rivera; Marcos Sanchez-Elez; Román Hermida; Nader Bagherzadeh
The authors present a scheduling methodology for conditional execution of kernels onto single instruction stream/multiple data stream multicontext reconfigurable architectures. Data flow graphs are used to describe the target applications in which some kernels are conditionally executed depending on runtime conditions. Immediately after testing a condition the next kernel to be processed is known and its configurations and input data can be loaded, producing a computation stall while these transfers are performed. A compilation-time kernel scheduling is proposed to handle conditional branches (CBs) by determining a kernel sequence that minimises these computation stalls reducing the application latency. Target applications are firstly partitioned taking into account the presence of CBs, and then kernels are ordered for execution and mapped onto the reconfigurable system. Experimental results obtained for interactive and synthetic applications demonstrate the effectiveness of the proposal.
digital systems design | 2005
Fredy Rivera; Marcos Sanchez-Elez; Milagros Fernández; Nader Bagherzadeh
Reconfigurable architectures have becoming very relevant in recent years. In this paper we propose a methodology dedicated to analyze interactive applications in order to execute them in a SIMD reconfigurable architecture taking into account power/performance trade-offs. This methodology starts from a kernel description of the interactive application. Kernels are conditionally executed depending on dynamic conditions like users input data manipulation. The volume of data involved in this kind of applications combined with users actions occurring at unexpected times strongly impact on performance. We define an execution model to deal with conditional branches accompanied by a data prefetch scheme in order to avoid reconfigurable processing unit stalls due to operands unavailability. Experimental results satisfy time constraints of interactive applications and show a power effective solution for them.
Archive | 2005
Marcos Sanchez-Elez; Milagros Fernández; Román Hermida; Nader Bagherzadeh
This paper presents a new technique to improve the efficiency of data scheduling for multi-context reconfigurable architectures targeting multimedia and DSP applications. The main goal of this technique is to diminish application energy consumption. Two levels of on-chip data storage are assumed in the reconfigurable architecture. The Data Scheduler attempts to optimally exploit this storage, by deciding in which on-chip memory the data have to be stored in order to reduce energy consumption. We also show that a suitable data scheduling could decrease the energy required to implement the dynamic reconfiguration of the system.
Journal of Parallel and Distributed Computing | 2017
Enrique De Lucas; Marcos Sanchez-Elez; Inmaculada Pardines
Abstract This work proposes a methodology to synthesize arithmetic operations maximizing the reuse of the DSP48E1 blocks presented in the new reconfigurable architectures. The input for DSPONE48 is a VHDL code without any reference to the FPGA hardware resources. This input code is modified, so the synthesis tool is able to implement it with DSP slices. In order to achieve this objective we use DSP block instantiation templates and we encourage the use of SIMD mode within the DSP block. This methodology replaces automatically the most common arithmetic operations by their equivalents on DSP slices. The methodology guarantees that the new code preserves the functionality and the number of execution cycles of the original design. Experimental results, on a Virtex 7 FPGA, show that the designs obtained by DSPONE48 use less DSPs than those obtained automatically by Xilinx ISE or Vivado. Moreover, these designs have lower area and higher frequency.