Mohammadsadegh Sadri
University of Bologna
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mohammadsadegh Sadri.
Proceedings of the 10th FPGAworld Conference on | 2013
Mohammadsadegh Sadri; Christian Weis; Norbert Wehn; Luca Benini
Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks, provides significant advantages in run-time speed and energy. Efficient management of data sharing among multiple computational kernels can rapidly turn into a complicated problem. The Accelerator coherency port (ACP) emerges as a possible solution by enabling hardware accelerators to issue coherent accesses to the memory space. In this paper, we quantify the advantages of using ACP over the traditional method of sharing data on the DRAM. We select the Xilinx ZYNQ as target and develop an infrastructure to stress the ACP and high-performance (HP) AXI interfaces of the ZYNQ device. Hardware accelerators on both of HP and ACP AXI interfaces reach full duplex data processing bandwidth of over 1.6 GBytes/s running at 125 MHz on a XC7Z020-1C device. The effect of background DRAM and cache traffic on the performance of accelerators is analyzed. For a sample image filtering task, the cooperative operation of CPU and ACP accelerator (CPU-ACP) gains a speed-up of 1.2X over CPU and HP acceleration (CPU-HP). In terms of energy efficiency, an improvement of 2.5 nJ (> 20%) is shown for each byte of processed data. This is the first work which represents detailed practical comparisons on the speed and energy efficiency of various processor-accelerator memory sharing techniques in a configurable heterogeneous platform.
design, automation, and test in europe | 2014
Mohammadsadegh Sadri; Matthias Jung; Christian Weis; Norbert Wehn; Luca Benini
Heterogeneous 3D integrated systems with Wide-I/O DRAMs are a promising solution to squeeze more functionality and storage bits into an ever decreasing volume. Unfortunately, with 3D stacking, the challenges of high power densities and thermal dissipation are exacerbated. We improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. In order to provide proof of our concepts we develop an advanced virtual platform which models the performance, power, and thermal behavior of a 3D-integrated MPSoC with Wide-I/O DRAMs in detail. On this platform we run the Android OS with real-world benchmarks to quantify the advantages of our ideas. We show improvements of 16% in DRAM refresh power due to temperature variation aware bank-wise refresh. Furthermore, two solutions are investigated to speedup system simulations: (1) Adaptive tuning of sampling intervals based on the estimated chip thermal profile, which results in speedups of 2X. (2) Hardware acceleration of thermal simulations using the Maxeler engine, which shows possible speedups of 12X.
2014 22nd International Conference on Very Large Scale Integration (VLSI-SoC) | 2014
Matthias Jung; Christian Weis; Norbert Wehn; Mohammadsadegh Sadri; Luca Benini
3D stacked systems with Wide-I/O DRAMs are the future density optimized mobile computing platforms. Unfortunately, with 3D integration, the power densities and thermal dissipation are increased dramatically. In this paper, we investigate the effectiveness of power-down mode policies (using precharge power down, active power-down and self-refresh) and bank-wise refresh in active mode. We run real-life benchmarks to quantify the impact of each power-down mode setting. We derive a power-down mode policy which shows up to 10% energy reduction in high activity periods and up to 13% in idle phases. Further, we improve DRAM refresh power by considering the lateral and vertical temperature variations in the 3D structure and adapting the per-DRAM-bank refresh period accordingly. To achieve this, a per DRAM array hotspot detector, designed with DRAM cells and circuits, is used to acquire temperature and refresh information directly from the DRAM array. We show 16% improvements in DRAM refresh power due to hotspot detectors inside the DRAM enabling temperature variation aware bank-wise refresh. For all the above mentioned investigations a detailed DRAM controller model with accurate functionality, timing, and power estimation in SystemC TLM-2.0 (Transaction Level Modeling) and a highly sophisticated virtual hardware platform are mandatory to achieve a through analysis.
design, automation, and test in europe | 2012
Andrea Bartolini; Mohammadsadegh Sadri; John-Nicholas Furst; Ayse Kivilcim Coskun; Luca Benini
Dynamic frequency and voltage scaling (DVFS) techniques have been widely used for meeting energy constraints. Single-chip many-core systems bring new challenges owing to the large number of operating points and the shift to message passing interface (MPI) from shared memory communication. DVFS, however, has been mostly studied on single-chip systems with one or few cores, without considering the impact of the communication among cores. This paper evaluates the impact of frequency scaling on the performance and power of many-core systems with MPI. We conduct experiments on the Single-Chip Cloud Computer (SCC), an experimental many-core processor developed by Intel. The paper first introduces the run-time monitoring infrastructure and the application suite we have designed for an in-depth evaluation of the SCC. We provide an extensive analysis quantifying the effects of frequency perturbations on performance and energy efficiency. Experimental results show that run-time communication patterns lead to significant differences in power/performance tradeoffs in many-core systems with MPI.
design, automation, and test in europe | 2017
Menbere Tekleyohannes; Mohammadsadegh Sadri; Christian Weis; Norbert Wehn; Martin Klein; Michael Siegrist
In recent years, connected component analysis (CCA) has become one of the vital image/video processing algorithms due to its wide-range applicability in the field of computer vision. Numerous applications such as pattern recognition, object detection and image segmentation involve connected component analysis. In the context of camera-based inspection systems, CCA plays an important role for quality assurance. State-of-the-art hardware architectures offer high performance implementations of CCA using field programmable gate arrays (FPGAs). However, due to their high memory-demand, most of these implementations inhibit a large resource utilization. In this paper, we propose a hybrid software-hardware architecture of CCA for an industrial application using Xilinx Zynq-7000 All Programmable System on Chip (SoC). By offloading the most resource consuming part of the algorithm to the embedded CPU, we achieved high performance, while reducing the required resources on the FPGA. Our proposed architecture saves more than 30% of on-chip memory (Block RAMs) compared to state-of-the-art hardware architectures without affecting the throughput. Furthermore, due to the embedded CPU, our system provides a versatile and highly flexible feature extraction at run-time without the necessity to reconfigure the FPGA.
Archive | 2015
Mohammadsadegh Sadri; Christian de Schryver; Norbert Wehn
The need for high performance computing dictates constraints on the acceptable bandwidth of data transfer between processing units and the memory. Consequently it is crucial to build high performance, scalable, and energy efficient architectures capable of completing data transfer requests at satisfactory rates. Thanks to increased transfer rates obtained by exploiting high-speed serial data transfer links instead of traditional parallel ones, PCI Express provides a promising solution to the problem of connectivity for todays complex heterogeneous architectures. In this chapter, we first cover the principals of interfacing using PCI Express. To illustrate a practical situation, we select the Xilinx Zynq device and develop an example architecture which allows the x86 CPU cores of the host system, the ARM cores of the Zynq device, and the hardware accelerators directly realized on the FPGA fabric of the Zynq to share the available DRAM memory for efficient data sharing. We provide estimates on possible data transfer bandwidths in our architecture.
Integration | 2015
Mohammadsadegh Sadri; Andrea Bartolini; Luca Benini
Thermal effects are rapidly gaining importance in nanometer CMOS technologies. Increased power density, coupled with spatio-temporal variability of chip workloads, causes on-die temperature non-uniformities. The assumption of a uniform temperature for the delay and power analysis of a large CMOS circuit produces inaccurate results. For this reason, significant design margins are taken to ensure safe operation. To improve design quality, we need precise localization of hotspots at detailed spatial resolution which is very computationally intensive. Consequently, thermal analysis needs to be done at multiple levels of granularity using a versatile thermal floorplan. We propose MiMAPT, an approach for analyzing delay, power and temperature in digital circuits. MiMAPT integrates seamlessly into major industrial Front-end and Back-end chip design flows. It accounts for temperature non-uniformities and self-heating while performing analysis. Thermal analysis is done at register-transfer (RT) and then gate-level considering non-regular shapes of on-die units with multiple scales of resolution and speed. To demonstrate the capability of MiMAPT in temperature variation aware delay/power estimation, a widely used IP block is chosen and four different chips are implemented using 65nm and 40nm (LVT, HVT) technology nodes. Different temperature patterns are then applied to the design. Accuracy improvements of up to 28% for static power and 16% for minimum clock period are reported in comparison with uniform averaged temperature assumption. Evaluating the ability of MiMAPT in multi-scale thermal analysis, a speed-up of 98 i? is reported compared to fine-grained method, while keeping false negatives at zero and the error of temperature estimation below 0.05?C. HighlightsWe propose MiMAPT, an approach for analyzing delay, power and temperature in digital circuits.MiMAPT integrates seamlessly into major industrial Front-end and Back-end chip design flows.We showed that the sensitivity of delay and power to temperature grows with scaling down the fabrication technology.We quantified the estimation error that has to be expected when assuming a uniform temperature across the die.MiMAPT is implemented in Python and is available as a completely stand-alone and fully functional software.
2015 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE) | 2015
Christian Brugger; Lorenzo Dal'Aqua; Javier Alejandro Varela; Christian de Schryver; Mohammadsadegh Sadri; Norbert Wehn; Martin Klein; Michael Siegrist
The rapidly growing applications based on morphological operations in image processing and computer vision make efficient implementations of these key blocks an important topic of research. Nevertheless, a detailed comparison of the energy efficiency and performance of these implementations that covers all available major hardware platforms is still missing. In this paper we evaluate the performance and power consumption of the most efficient available morphological image processing algorithms for CPU, GPU, and FPGA platforms in detail. In addition, we study the suitability of available morphological library units for high-level synthesis and compare the results with an optimized hand-coded FPGA implementation. We demonstrate that even high-end GPUs cannot achieve the throughputs of modern CPUs and FPGAs by far. Our experimental results show that an FPGA implementation is 8-10 times more energy efficient for this application, being comparable in speed to CPUs for large kernels.
international workshop on thermal investigations of ics and systems | 2011
Mohammadsadegh Sadri; Andrea Bartolini; Luca Benini
international workshop on thermal investigations of ics and systems | 2012
Mohammadsadegh Sadri; Andrea Barolini; Luca Benini