Nasim Farahini
Royal Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nasim Farahini.
international symposium on quality electronic design | 2013
Syed Mohammad Asad Hassan Jafri; Ozan Bag; Ahmed Hemani; Nasim Farahini; Kolin Paul; Juha Plosila; Hannu Tenhunen
This paper presents a self adaptive architecture to enhance the energy efficiency of coarse-grained reconfigurable architectures (CGRAs). Today, platforms host multiple applications, with arbitrary inter-application communication and concurrency patterns. Each application itself can have multiple versions (implementations with different degree of parallelism) and the optimal version can only be determined at runtime. For such scenarios, traditional worst case designs and compile time mapping decisions are neither optimal nor desirable. Existing solutions to this problem employ costly dedicated hardware to configure the operating point at runtime (using DVFS). As an alternative to dedicated hardware, we propose exploiting the reconfiguration features of modern CGRAs. Our solution relies on dynamically reconfigurable isolation cells (DRICs) and autonomous parallelism, voltage, and frequency selection algorithm (APVFS). The DRICs reduce the overheads of DVFS circuitry by configuring the existing resources as isolation cells. APVFS ensures high efficiency by dynamically selecting the parallelism, voltage and frequency trio, which consumes minimum power to meet the deadlines on available resources. Simulation results using representative applications (Matrix multiplication, FIR, and FFT) showed up to 23% and 51% reduction in power and energy, respectively, compared to traditional DVFS designs. Synthesis results have confirmed significant reduction in area overheads compared to state of the art DVFS methods.
international conference on hardware/software codesign and system synthesis | 2013
Shuo Li; Nasim Farahini; Ahmed Hemani; Kathrin Rosvall; Ingo Sander
SYLVA is a system level synthesis framework that transforms DSP sub-systems modeled as synchronous data flow into hardware implementations in ASIC, FPGAs or CGRAs. SYLVA synthesizes in terms of pre-characterized function implementations (FTMPs). It explores the design space in three dimensions, number of FTMPs, type of FTMPs and pipeline parallelism between the producing and consuming FTMPs. We introduce timing and interface model of FTMPs to enable reuse and automatic generation of Global Interconnect and Control (GLIC) to glue the FTMPs together into a working system. SYLVA has been evaluated by applying it to five realistic DSP applications and results analyzed for design space exploration, efficacy in generating GLIC by comparing to manually generated GLIC and accuracy of design space exploration by comparing the area and energy costs considered during the design space exploration based on pre-characterized FIMPs and the final results.
international symposium on circuits and systems | 2013
Nasim Farahini; Shuo Li; Muhammad Adeel Tajammul; Muhammad Ali Shami; Guo Chen; Ahmed Hemani; Wei Ye
This paper presents an industrial case study of using a Coarse Grain Reconfigurable Architecture (CGRA) for a multi-mode accelerator for two kernels: FFT for the LTE standard and the Correlation Pool for the UMTS standard to be executed in a mutually exclusive manner. The CGRA multi-mode accelerator achieved computational efficiency of 39.94 GOPS/watt (OP is multiply-add) and silicon efficiency of 56.20 GOPS/mm2. By analyzing the code and inferring the unused features of the fully programmable solution, an in-house developed tool was used to automatically customize the design to run just the two kernels and the two efficiency metrics improved to 49.05 GOPS/watt and 107.57 GOPS/mm2. Corresponding numbers for the ASIC implementation are 63.84 GOPS/watt and 90.91 GOPS/mm2. Though the ASICs silicon and computational efficiency numbers are slightly better, the engineering efficiency of the pre-verified/characterized CGRA solution is at least 10X better than the ASIC solution.
asia and south pacific design automation conference | 2014
Nasim Farahini; Ahmed Hemani; Anders Lansner; Fabian Clermidy; Christer Svensson
A multi-chip custom digital super-computer called eBrain for simulating Bayesian Confidence Propagation Neural Network (BCPNN) model of the human brain has been proposed. It uses Hybrid Memory Cube (HMC), the 3D stacked DRAM memories for storing synaptic weights that are integrated with a custom designed logic chip that implements the BCPNN model. In 22nm node, eBrain executes BCPNN in real time with 740 TFlops/s while accessing 30 TBs synaptic weights with a bandwidth of 112 TBs/s while consuming less than 6 kWs power for the typical case. This efficiency is three orders better than general purpose supercomputers in the same technology node.
Microprocessors and Microsystems | 2014
Nasim Farahini; Ahmed Hemani; Hassan Sohofi; Syed Mohammad Asad Hassan Jafri; Muhammad Adeel Tajammul; Kolin Paul
This paper presents a hardware based solution for a scalable runtime address generation scheme for DSP applications mapped to a parallel distributed coarse grain reconfigurable computation and storage fabric. The scheme can also deal with non-affine functions of multiple variables that typically correspond to multiple nested loops. The key innovation is the judicious use of two categories of address generation resources. The first category of resource is the low cost AGU that generates addresses for given address bounds for affine functions of up to two variables. Such low cost AGUs are distributed and associated with every read/write port in the distributed memory architecture. The second category of resource is relatively more complex but is also distributed but shared among a few storage units and is capable of handling more complex address generation requirements like dynamic computation of address bounds that are then used to configure the AGUs, transformation of non-affine functions to affine function by computing the affine factor outside the loop, etc. The runtime computation of the address constraints results in negligibly small overhead in latency, area and energy while it provides substantial reduction in program storage, reconfiguration agility and energy compared to the prevalent pre-computation of address constraints. The efficacy of the proposed method has been validated against the prevalent address generation schemes for a set of six realistic DSP functions. Compared to the pre-computation method, the proposed solution achieved 75% average code compaction and compared to the centralized runtime address generation scheme, the proposed solution achieved 32.7% average performance improvement.
digital systems design | 2013
Nasim Farahini; Ahmed Hemani; Kolin Paul
This paper presents hardware solution for runtime computation of loop constraints and synchronizing delays for multiple inner loops in parallel distributed implementation of digital signal processing sub-systems. Methods to map and generate the runtime computation code for loop constraints and synchronizing delays are also presented. Compared to the traditional methods, the proposed solution achieves 55% average code compaction and 32.7% average performance improvement. The solution has modest hardware cost that increases linearly with the dimension of the architecture and has no performance penalty. Results from multiple realistic examples are presented, analyzed and compared to the traditional methods.
international conference on embedded computer systems architectures modeling and simulation | 2015
Nasim Farahini; Ahmed Hemani; Hassan Sohofi; Shuo Li
In spite of decades of research, only a small percentage of hardware is designed using high-level synthesis because of the large gap between the abstraction levels of standard cells and algorithmic level. We propose a grid-based regular physical design platform composed of large grain hardened building blocks called SiLago blocks. This platform is divided into regions which are specialized for different functionalities like computation, storage, system control, etc. The characterized micro-architectural operations of the SiLago platform serve as the interface to meet-in-the-middle high-level and system-level syntheses framework. This framework was used to generate three hardware macro instances, derived from SiLago platform for three applications from signal processing domain. Results show two orders of magnitude improvements in efficiency of the system-level design space exploration and synthesis time, with average loss in design quality of 18% for energy and 54% for area compared to the commercial SOC flow.
asia and south pacific design automation conference | 2014
Anders Lansner; Ahmed Hemani; Nasim Farahini
We estimate the computational capacity required to simulate in real time the neural information processing in the human brain. We show that the computational demands of a detailed implementation are beyond reach of current technology, but that some biologically plausible reductions of problem complexity can give performance gains between two and six orders of magnitude, which put implementations within reach of tomorrows technology.
field-programmable custom computing machines | 2013
Shuo Li; Nasim Farahini; Ahmed Hemani
SYLVA is a System Level Architectural Synthesis Framework that translates Synchronous Data Flow (SDF) models of DSP sub-systems like modems and codecs into hardware implementation in ASIC/Standard Cells, FPGAs or CGRAs (Coarse Grain Reconfigurable Fabric).
application-specific systems, architectures, and processors | 2015
Nasim Farahini; Ahmed Hemani
The increasing demand for higher resolution of images and communication bandwidth requires the streaming applications to deal with ever increasing size of datasets. Further, with technology scaling the cost of moving data is reducing at a slower pace compared to the cost of computing. These trends have motivated the proposed micro-architectural reorganization of stream processors by dividing the stream computation into functional computation, address constraints computation and address generation and deploying independent, distributed micro-threads to implement them. This scheme is an alternative to parallelizing them at instruction level. The proposed scheme has two benefits: a more efficient sequencer logic and energy savings in address generation and transportation. These benefits are quantified for a set of streaming applications and show average percentage improvement of 39 in silicon efficiency of the sequencer logic and 23 in total computational efficiency.