Sami Yehia
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sami Yehia.
high performance embedded architectures and compilers | 2012
Petar Radojković; Sylvain Girbal; Arnaud Grasset; Eduardo Quiñones; Sami Yehia; Francisco J. Cazorla
Commercial Off-The-Shelf (COTS) processors are now commonly used in real-time embedded systems. The characteristics of these processors fulfill system requirements in terms of time-to-market, low cost, and high performance-per-watt ratio. However, multithreaded (MT) processors are still not widely used in real-time systems because the timing analysis is too complex. In MT processors, simultaneously-running tasks share and compete for processor resources, so the timing analysis has to estimate the possible impact that the inter-task interferences have on the execution time of the applications. In this paper, we propose a method that quantifies the slowdown that simultaneously-running tasks may experience due to collision in shared processor resources. To that end, we designed benchmarks that stress specific processor resources and we used them to (1) estimate the upper limit of a slowdown that simultaneously-running tasks may experience because of collision in different shared processor resources, and (2) quantify the sensitivity of time-critical applications to collision in these resources. We used the presented method to determine if a given MT processor is a good candidate for systems with timing requirements. We also present a case study in which the method is used to analyze three multithreaded architectures exhibiting different configurations of resource sharing. Finally, we show that measuring the slowdown that real applications experience when simultaneously-running with resource-stressing benchmarks is an important step in measurement-based timing analysis. This information is a base for incremental verification of MT COTS architectures.
international symposium on computer architecture | 2004
Sami Yehia; Olivier Temam
In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-level combinational circuits. We present an approach that exploits this principle, speeds up programs thanks to circuit-level parallelism/redundancy, but avoids the exponential costs. We analyze the potential of this approach, and then we propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selected programs from the Olden and MiBench benchmark suites improves on average from 2.4% to 12% depending on the latency of the functional units, and up to 39.6%; more precisely, the performance of optimized code sections improves on average from 3.5% to 19%, and up to 49%.
high-performance computer architecture | 2009
Sami Yehia; Sylvain Girbal; Hugues Berry; Olivier Temam
While parallelism and multi-cores are receiving much attention as a major scalability path, customization is another, orthogonal and complementary, scalability path which can target not easily parallelizable programs or program sections. The key assets of customization are cost and power efficiency. The key limitation of customization is flexibility. However, we argue that there is no perfect balance between efficiency and flexibility, each system vendor may want to strike a different such balance. In this article, we present a method for achieving any desired balance between flexibility and efficiency by automatically combining any set of individual customization circuits into a larger compound circuit. This circuit is significantly more cost efficient than the simple union of all target circuits, and is configurable to behave as any of the target circuits, while avoiding the routing and configuration cost overhead of FPGAs. The more individual circuits are included, the larger the number of applications which can potentially benefit from this compound customization circuit, realizing flexibility at a minimal cost. Moreover, we observe that the compound circuit cost does not increase in proportion to the number of target applications, due to the wide range of common data-flow and control-flow patterns in programs. Currently, the target individual circuits correspond to loops, like most accelerators in embedded systems, but the aggregation method can accommodate circuits of any size. Using the UTDSP benchmarks and accelerators coupled with an embedded PowerPC405 processor, we show that this approach can yield an average performance improvement of 2.97, while the corresponding synthesized aggregate accelerator is 3 time smaller than the sum of individual accelerators for each target benchmark.
international on line testing symposium | 2011
Jaume Abella; Francisco J. Cazorla; Eduardo Quiñones; Arnaud Grasset; Sami Yehia; Philippe Bonnot; Dimitris Gizopoulos; Riccardo Mariani; Guillem Bernat
Performance demand of Critical Real-Time Embedded (CRTE) systems implementing safety-related system features grows at an exponential rate. Only modern semiconductor technologies can satisfy CRTE systems performance needs efficiently. However, those technologies lead to high failure rates, thus lowering survivability of chips to unacceptable levels for CRTE systems. This paper presents SESACS architecture (Surviving Errors in SAfety-Critical Systems), a paradigm shift in the design of CRTE systems. SESACS is a new system design methodology consisting of three main components: (i) a multicore hardware/firmware platform capable of detecting and diagnosing hardware faults of any type with minimal impact on the worst-case execution time (WCET), recovering quickly from errors, and properly reconfiguring the system so that the resulting system exhibits a predictable and analyzable degradation in WCET; (ii) a set of analysis methods and tools to prove the timing correctness of the reconfigured system; and (iii) a white-box methodology and tools to prove the functional safety of the system and compliance with industry standards. This new design paradigm will deliver huge benefits to the embedded systems industry for several decades by enabling the use of more cost-effective multicore hardware platforms built on top of modern semiconductor technologies, thereby enabling higher performance, and reducing weight and power dissipation. This new paradigm will further extend the life of embedded systems, therefore, reducing warranty and early replacement costs.
symposium on application specific processors | 2010
Dominik Auras; Sylvain Girbal; Hugues Berry; Olivier Temam; Sami Yehia
Custom acceleration has been a standard choice in embedded systems thanks to the power density and performance efficiency it provides. Parallelism is another orthogonal scalability path that efficiently overcomes the increasing limitation of frequency scaling in current general-purpose architectures. In this paper we propose a multi-accelerator architecture that combines the best of both worlds, parallelism and custom acceleration, while addressing the programmability inconvenience of heterogeneous multiprocessing systems. A Chip Multi-Accelerator (CMA) is a regular parallel architecture where each core is complemented with a custom accelerator to speed up specific functions. Furthermore, by using techniques to efficiently merge more than one custom accelerator together, we are able to cram as many accelerators as needed by the application or a domain of applications. We demonstrate our approach on a Software Defined Radio (SDR) case study. We show that starting from a baseline description of several SDR waveforms and candidate tasks for acceleration, we are able to map the different waveforms on the heterogeneous multi-accelerator architecture while keeping a logical view of a regular multi-core architecture, thus simplifying the mapping of the waveforms onto the multi-accelerator.
application specific systems architectures and processors | 2011
Alessandro Strano; Davide Bertozzi; Arnaud Grasset; Sami Yehia
Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wear-out inevitably make margining uneconomical or impossible. In this context, new design approaches are required. The inherent regularity and redundancy of SIMD architectures make them suitable to address the challenges posed by new semiconductor technologies at the architecture level. This paper proposes a built-in self-test/self-diagnosis procedure for a SIMD processor. Concurrent BIST operations are carried out after reset at each PE, thus resulting in scalable test application time with processor size. The key principle consists of exploiting the inherent structural redundancy of the SIMD architecture in a cooperative way, thus strongly reducing the testing framework latency and area overhead. Once the faults are detected, a reconfiguration technique is then proposed in order to preserve correct operation. Testing of single stuck-at faults is performed at-speed in 240 cycles regardless of the accelerator size, with a hardware overhead of less than 10%. Finally, the fault-tolerant tile integrating both BIST, reconfiguration logic and spare PE requires a 25% of total area overhead.
International Journal of Parallel Programming | 2011
Arnaud Grasset; Philippe Millet; Philippe Bonnot; Sami Yehia; Wolfram Putzke-Roeming; Fabio Campi; Alberto Rosti; Michael Huebner; Nikolaos S. Voros; Davide Rossi; Henning Sahlbach; Rolf Ernst
Reconfigurable computing offers a wide range of low cost and efficient solutions for embedded systems. The proper choice of the reconfigurable device, the granularity of its processing elements and its memory architecture highly depend on the type of application and their data flow. Existing solutions either offer fine grain FPGAs, which rely on a hardware synthesis flow and offer the maximum degree of flexibility, or coarser grain solutions, which are usually more suitable for a particular type of data flow and applications. In this paper, we present the MORPHEUS architecture, a versatile reconfigurable heterogeneous System-on-Chip targeting streaming applications. The presented architecture exploits different reconfigurable technologies at several computation granularities that efficiently address the different applications needs. In order to efficiently exploit the presented architecture, we implemented a complete software solution to map C applications to the reconfigurable architecture. In this paper, we describe the complete toolset and provide concrete use cases of the architecture.
design automation conference | 2013
Sylvain Girbal; Miquel Moretó; Arnaud Grasset; Jaume Abella; Eduardo Quiñones; Francisco J. Cazorla; Sami Yehia
The computing market has been dominated during the last two decades by the well-known convergence of the high-performance computing market and the mobile market. In this paper we witness a new type of convergence between the mission-critical market (such as avionic or automotive) and the mainstream consumer electronics market. Such convergence is fuelled by the common needs of both markets for more reliability, support for mission-critical functionalities and the challenge of harnessing the unsustainable increases in safety margins to guarantee either correctness or timing. In this position paper, we present a description of this new convergence, as well as the main challenges and opportunities that it brings to computing industry.
compilers, architecture, and synthesis for embedded systems | 2010
Sylvain Girbal; Olivier Temam; Sami Yehia; Hugues Berry; Zheng Li
Power and programming challenges make heterogeneous multi-cores composed of cores and ASICs an attractive alternative to homogeneous multi-cores. Recently, multi-purpose loop-based generated accelerators have emerged as an especially attractive accelerator option. They have several assets: short design time (automatic generation), flexibility (multi-purpose) but low configuration and routing overhead (unlike FPGAs), computational performance (operations are directly mapped to hardware), and a focus on memory throughput by leveraging loop constructs. However, with multiple streams, the memory behavior of such accelerators can become at least as complex as that of superscalar processors, while they still need to retain the memory ordering predictability and throughput efficiency of DMAs. In this article, we show how to design a memory interface for multi-purpose accelerators which combines the ordering predictability of DMAs, retains key efficiency features of memory systems for complex processors, and requires only a fraction of their cost by leveraging the properties of streams references. We evaluate the approach with a synthesizable version of the memory interface for an example 9-task generated loop-based accelerator
acm sigplan symposium on principles and practice of parallel programming | 2005
Jean-Francois Collard; Norman P. Jouppi; Sami Yehia
Inspired by recent advances in microprocessor performance monitors, this paper shows how a shared-memory multiprocessor chipset and interconnect can be equipped with performance monitors that associate performance events with the PCs of the individual instructions causing these events. Such monitors greatly simplify performance debugging of shared-memory programs---for example, they make finding pairs of instructions in false sharing straightforward. These monitors also enable precise feedback-directed compiler optimizations and, as a second contribution, we show how they can guide the code generator to use the version of the load instruction that makes the best use of the coherence protocol. Experiments show up to almost 10% coherence traffic reduction on SPLASH2 applications.