Olivier Serres | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Olivier Serres is active.

Explore More

Publication

Featured researches published by Olivier Serres.

ACM Transactions on Reconfigurable Technology and Systems | 2010

Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing

Miaoqing Huang; Vikram K. Narayana; Harald Simmler; Olivier Serres; Tarek A. El-Ghazawi

High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. When the hardware tasks of an application cannot simultaneously fit in an FPGA, the task graph needs to be partitioned and scheduled into multiple FPGA configurations, in a way that minimizes the total execution time. This article proposes the Reduced Data Movement Scheduling (RDMS) algorithm that aims to improve the overall performance of hardware tasks by taking into account the reconfiguration time, data dependency between tasks, intertask communication as well as task resource utilization. The proposed algorithm uses the dynamic programming method. A mathematical analysis of the algorithm shows that the execution time would at most exceed the optimal solution by a factor of around 1.6, in the worst-case. Simulations on randomly generated task graphs indicate that RDMS algorithm can reduce interconfiguration communication time by 11% and 44% respectively, compared with two other approaches that consider data dependency and hardware resource utilization only. The practicality, as well as efficiency of the proposed algorithm over other approaches, is demonstrated by simulating a task graph from a real-life application - N-body simulation - along with constraints for bandwidth and FPGA parameters from existing high-performance reconfigurable computers. Experiments on SRC-6 are carried out to validate the approach.

ieee aerospace conference | 2011

Experiences with UPC on TILE-64 processor

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Tarek A. El-Ghazawi

Partitioned global address space (PGAS) programming model presents programmers with a globally shared address space with locality awareness and one-sided communication constructs. The shared address space and the one-sided communication constructs enhance ease-of-use of PGAS based languages and the locality awareness enables programmers and the runtime systems to achieve higher performance. Thus PGAS programming model may help address the escalating software complexity issues resulting from the proliferation of many-core processor architectures in aerospace and computing systems in general. This paper presents our experiences with Unified parallel C (UPC), a PGAS language, on the Tile64™ processor, a 64-core processor from Tilera Corporation. We ported Berkeley UPC compiler and runtime system on the Tilera architecture and evaluated two separate runtime implementation conduits of the underlying GASNet communication library, a pThreads based conduit and an MPI based conduit. Each conduit uses different on-chip, inter-core communication networks providing different latencies and bandwidths for inter-process communications. The paper presents the implementation details and empirical analyses of both approaches by comparing and evaluating results from NAS Parallel Benchmark suite. The analyses reveal various optimization opportunities based on specific many-core architectural features which are also discussed in the paper12.

international parallel and distributed processing symposium | 2009

RDMS: A hardware task scheduling algorithm for Reconfigurable Computing

Miaoqing Huang; Harald Simmler; Olivier Serres; Tarek A. El-Ghazawi

Reconfigurable Computers (RC) can provide significant performance improvement for domain applications. However, wide acceptance of todays RCs among domain scientist is hindered by the complexity of design tools and the required hardware design experience. Recent developments in HW/SW co-design methodologies for these systems provide the ease of use, but they are not comparable in performance to manual co-design. This paper aims at improving the overall performance of hardware tasks assigned to FPGA devices by minimizing both the communication overhead and configuration overhead, which are introduced by using FPGA devices. The proposed Reduced Data Movement Scheduling (RDMS) algorithm takes data dependency among tasks, hardware task resource utilization, and inter-task communication into account during the scheduling process and adopts a dynamic programming approach to reduce the communication between μP and FPGA co-processor and the number of FPGA configurations to a minimum. Compared to two other approaches that consider data dependency and hardware resource utilization only, RDMS algorithm can reduce inter-configuration communication time by 11% and 44% respectively based on simulation using randomly generated data flow graphs. The implementation of RDMS on a real-life application, N-body simulation, verifies the efficiency of RDMS algorithm against other approaches.

computing frontiers | 2010

Efficient cache design for solid-state drives

Miaoqing Huang; Olivier Serres; Vikram K. Narayana; Tarek A. El-Ghazawi; Gregory B. Newby

Solid-State Drives (SSDs) are data storage devices that use solid-state memory to store persistent data. Flash memory is the de facto nonvolatile technology used in most SSDs. It is well known that the writing performance of flash-based SSDs is much lower than the reading performance due to the fact that a flash page can be written only after it is erased. In this work, we present an SSD cache architecture designed to provide a balanced read/write performance for flash memory. An efficient automatic updating technique is proposed to provide a more responsive SSD architecture by writing back stable but dirty flash pages according to a predetermined set of policies during the SSD device idle time. Those automatic updating policies are also tested and compared. Simulation results demonstrate that both reading and writing performance are improved significantly by incorporating the proposed cache with automatic updating feature into SSDs.

IEEE Transactions on Energy Conversion | 2010

Modeling and Simulation of PEM Fuel Cell Thermal Behavior on Parallel Computers

Abdelkrim Salah; Jaafar Gaber; Rachid Outbib; Olivier Serres; Hoda El-Sayed

Proton exchange membrane fuel cells (PEMFCs) have aroused great interest in recent years, in particular for transportation applications. However, fuel cell (FC) technology is not yet ready for large-scale commercial use, as it requires more understanding and intensive development, in particular in the thermal behavior area. Such understanding of the FC requires many large-scale simulations that can take unacceptably large execution time. This is especially true when using traditional models that are governed by heat equations and based on computational tools that derive approximate solutions to partial differential equations. Such multimodel systems also require synchronization that results in overhead. Instead, in this paper, a new fully integrated modeling approach that lends itself to parallelism is introduced. This approach can benefit from advances in parallel computing, and thus dramatically reduce time and enable multiple large simulations. This is called the global nodal method, which is intended to analyze and simulate the thermal behavior of PEMFCs. Parallel simulations are implemented with the message passing interface (MPI) and using the unified parallel C (UPC) language on parallel systems. It will be shown that computation time to conduct thermal behavior in large-scale simulation using MPI and UPC is significantly reduced compared to sequential simulations, and obtained data are highly precise and accurate.

reconfigurable computing and fpgas | 2011

An Architecture for Reconfigurable Multi-core Explorations

Olivier Serres; Vikram K. Narayana; Tarek A. El-Ghazawi

Multi-core systems are now the norm, and reconfigurable systems have shown substantial benefits over general purpose ones. This paper presents a combination of the two: a fully featured reconfigurable multi-core processor based on the Leon3 processor. The platform has important features like cache coherency, a fully running modern OS (GNU/Linux) and each core has a tightly coupled reconfigurable coprocessor unit attached. This allows the SPARC instruction set to be extended for the running application. The multi-core reconfigurable processor architecture, including the coprocessor interface, the ICAP controller and the Linux kernel driver, is presented. The experimental results show the characteristics of the platform including: area costs, the memory contention, the reprogramming cost... Speedups up to 100x are demonstrated on a cryptography test.

southern conference programmable logic | 2008

An Image Processing Architecture to Exploit I/O Bandwidth on Reconfigurable Computers

Miaoqing Huang; Olivier Serres; Sergio Lopez-Buedo; Tarek A. El-Ghazawi; Greg Newby

FPGA devices in reconfigurable computers (RCs) allow datapath, memory, and processing elements (PEs) to be customized in order to achieve very efficient algorithm implementations. However, the maximum speedup on RCs is bounded by the bandwidth available between muPs and FPGA hardware accelerators. In this paper, an image processing architecture is presented to fully exploit this bandwidth for achieving the maximum possible speedup. This architecture can be used to implement any convolution operation between an image and a kernel, and comprises four fully pipelined components: a line buffer, a data window, an array of PEs and a data concatenating block. Multiple image processing algorithms have been successfully implemented using this architecture, such as digital filters, edge detectors, and image transforms. In all cases, the maximum throughput is upper-bounded by the muP-FPGA I/O bandwidth, regardless of the complexity of the algorithm. This end-to-end throughput has been measured to be 1.2 GB/s on Cray XD1 and 2.1 GB/s on SGI RC100.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2011

Address Translation Optimization for Unified Parallel C Multi-dimensional Arrays

Olivier Serres; Ahmad Anbar; Saumil G. Merchant; Abdullah Kayi; Tarek A. El-Ghazawi

Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi- and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a many-core system, the TILE64 (a 64 core processor) and a dual-socket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.

international parallel and distributed processing symposium | 2016

PGAS Access Overhead Characterization in Chapel

Engin Kayraklioglu; Olivier Serres; Ahmad Anbar; Hashem Elezabi; Tarek A. El-Ghazawi

The Partitioned Global Address Space (PGAS) model, increases programmer productivity by presenting a flat memory space with locality awareness. However, the abstract representation of memory incurs overheads especially when global data is accessed. As a PGAS programming language, Chapel provides language structures to alleviate such overheads. In this work, we examine such optimizations on a set of benchmarks using multiple locales and analyze their impact on programmer productivity quantitatively. The optimization methods that we study achieved improvements over non-optimized versions ranging from 1.1 to 68.1 times depending on the benchmark characteristic.

ACM Transactions on Architecture and Code Optimization | 2016

Exploiting Hierarchical Locality in Deep Parallel Architectures

Ahmad Anbar; Olivier Serres; Engin Kayraklioglu; Abdel-Hameed A. Badawy; Tarek A. El-Ghazawi

Parallel computers are becoming deeply hierarchical. Locality-aware programming models allow programmers to control locality at one level through establishing affinity between data and executing activities. This, however, does not enable locality exploitation at other levels. Therefore, we must conceive an efficient abstraction of hierarchical locality and develop techniques to exploit it. Techniques applied directly by programmers, beyond the first level, burden the programmer and hinder productivity. In this article, we propose the Parallel Hierarchical Locality Abstraction Model for Execution (PHLAME). PHLAME is an execution model to abstract and exploit machine hierarchical properties through locality-aware programming and a runtime that takes into account machine characteristics, as well as a data sharing and communication profile of the underlying application. This article presents and experiments with concepts and techniques that can drive such runtime system in support of PHLAME. Our experiments show that our techniques scale up and achieve performance gains of up to 88%.

Explore More