Mahdi Nazm Bojnordi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Mahdi Nazm Bojnordi is active.

Explore More

Publication

Featured researches published by Mahdi Nazm Bojnordi.

design, automation, and test in europe | 2006

A concurrent testing method for NoC switches

Mohammad Hosseinabady; Abbas Banaiyan; Mahdi Nazm Bojnordi; Zainalabedin Navabi

This paper proposes reuse of on-chip networks for testing switches in network on chips (NoCs). The proposed algorithm broadcasts test vectors of switches through the on-chip networks and detects faults by comparing output responses of switches with each other. This algorithm alleviates the need for: (1) external comparison of the output response of the circuit-under-test with the response of a fault free circuit stored on a tester (2) on-chip signature analysis (3) a dedicated test-bus to reach test vectors and collect their responses. Experimental results on a few test benches compare the proposed algorithm with traditional system on chip (SoC) test methods

international symposium on computer architecture | 2012

PARDIS: a programmable memory controller for the DDRx interfacing standards

Mahdi Nazm Bojnordi; Engin Ipek

Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable---a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6--17% and reduces DRAM energy by 9--22% over four existing memory controllers.

high-performance computer architecture | 2016

Memristive Boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning

Mahdi Nazm Bojnordi; Engin Ipek

The Boltzmann machine is a massively parallel computational model capable of solving a broad class of combinatorial optimization problems. In recent years, it has been successfully applied to training deep machine learning models on massive datasets. High performance implementations of the Boltzmann machine using GPUs, MPI-based HPC clusters, and FPGAs have been proposed in the literature. Regrettably, the required all-to-all communication among the processing units limits the performance of these efforts. This paper examines a new class of hardware accelerators for large-scale combinatorial optimization and deep learning based on memristive Boltzmann machines. A massively parallel, memory-centric hardware accelerator is proposed based on recently developed resistive RAM (RRAM) technology. The proposed accelerator exploits the electrical properties of RRAm to realize in situ, fine-grained parallel computation within memory arrays, thereby eliminating the need for exchanging data between the memory cells and the computational units. Two classical optimization problems, graph partitioning and boolean satisfiability, and a deep belief network application are mapped onto the proposed hardware. As compared to a multicore system, the proposed accelerator achieves 57x higher performance and 25x lower energy with virtually no loss in the quality of the solution to the optimization problems. The memristive accelerator is also compared against an RRAM based processing-in-memory (PIM) system, with respective performance and energy improvements of 6.89x and 5.2x.

international symposium on microarchitecture | 2013

DESC: energy-efficient data exchange using synchronized counters

Mahdi Nazm Bojnordi; Engin Ipek

Increasing cache sizes in modern microprocessors require long wires to connect cache arrays to processor cores. As a result, the last-level cache (LLC) has become a major contributor to processor energy, necessitating techniques to increase the energy efficiency of data exchange over LLC interconnects. This paper presents an energy-efficient data exchange mechanism using synchronized counters. The key idea is to represent information by the delay between two consecutive pulses on a set of wires, which makes the number of state transitions on the interconnect independent of the data patterns, and significantly lowers the activity factor. Simulation results show that the proposed technique reduces overall processor energy by 7%, and the L2 cache energy by 1.81 × on a set of sixteen parallel applications. This efficiency gain is attained at a cost of less than 1% area, overhead to the L2 cache, and a 2% delay overhead to execution time.

international conference on acoustics, speech, and signal processing | 2006

An Efficient Deblocking Filter with Self-Transposing Memory Architecture For H.264/AVC

Mahdi Nazm Bojnordi; Omid Fatemi; Mahmoud Reza Hashemi

One of the main reasons behind the superior efficiency of the H.264/AVC video coding standard is the use of an in-loop deblocking filter. Since the deblocking filter is computation and data intensive, it has a profound impact on the speed degradation of both encoding and decoding processes. In this paper, we propose an efficient deblocking filter architecture that can be used as an IP core either in the dedicated or platform-based H.264/AVC codec systems. Novel self-transposing memory unit is used in this paper to alleviate switching between the horizontal and vertical filtering modes. Moreover, to reduce the processing latency, a two-stage pipelined architecture is designed for 1-D filter that produces output data after 2 clock cycles. With a clock of 100 MHz the proposed design is able to process a 1280times1024 (4:2:0) video at 25 frame/second. The proposed architecture offers 33% to 56% performance improvement compared to the existing state-of-the-art architectures

canadian conference on electrical and computer engineering | 2006

A Fast Two Dimensional Deblocking Filter for H.264/AVC Video Coding

Mahdi Nazm Bojnordi; Mahmoud Reza Hashemi; Omid Fatemi

H.264/AVC as the most recent video coding standard delivers significantly better performance compared to previous standards, supporting higher video quality over lower bit rate channels. The H.264 in-loop deblocking filter is one of the several complex techniques that have realized this superior coding quality. The deblocking filter is a computationally and data intensive tool resulting in increased execution time of both the encoding and decoding processes. In this paper and in order to reduce the deblocking complexity, we propose a new 2D deblocking filtering algorithm based on the existing 1D method of the H.264/AVC standard. Simulation results indicate that the proposed technique achieves a 40% speed improvement compared to the existing 1D H.264/AVC deblocking filter, while affecting the SNR by 0.15% in average

asia pacific conference on circuits and systems | 2006

An Efficient Self-Transposing Memory Structure for 32-bit Video Processors

Mahdi Nazm Bojnordi; Naser Sedaghati-Mokhtari; Omid Fatemi; Mahmoud Reza Hashemi

Many 2D data processing applications can be simplified and represented by use of 1D operations. Such tools, however, require applying both vertical and horizontal operations to the data blocks. The data transposing units is preferred to be used by the designers rather than applying individual operations for horizontal and vertical directions. Hence, designing a cost efficient and extendible transposing memory is a key issue for these applications. This paper proposes an efficient management strategy for using the SRAM modules in order to make a self-transposing memory architecture (STMA). In addition to its lower cost compared to flip-flop based buffers, the proposed architecture is more than 29% faster than usual SRAM based memory units. Simulations indicate that using the STMA in the H.264/AVC deblocking filter results in 60% speed improvement

IEEE Transactions on Very Large Scale Integration Systems | 2016

Back to the Future: Current-Mode Processor in the Era of Deeply Scaled CMOS

Yuxin Bai; Yanwei Song; Mahdi Nazm Bojnordi; Alexander E. Shapiro; Eby G. Friedman; Engin Ipek

This paper explores the use of MOS current-mode logic (MCML) as a fast and low noise alternative to static CMOS circuits in microprocessors, thereby improving the performance, energy efficiency, and signal integrity of future computer systems. The power and ground noise generated by an MCML circuit is typically 10-100× smaller than the noise generated by a static CMOS circuit. Unlike static CMOS, whose dominant dynamic power is proportional to the frequency, MCML circuits dissipate a constant power independent of clock frequency. Although these traits make MCML highly energy efficient when operating at high speeds, the constant static power of MCML poses a challenge for a microarchitecture that operates at the modest clock rate and with a low activity factor. To address this challenge, a single-core microarchitecture for MCML is explored that exploits the C-slow retiming technique, and operates at a high frequency with low complexity to save energy. This design principle contrasts with the contemporary multicore design paradigm for static CMOS that relies on a large number of gates operating in parallel at the modest speeds. The proposed architecture generates 10-40× lower power and ground noise, and operates within 13% of the performance (i.e., 1/ExecutionTime) of a conventional, eight-core static CMOS processor while exhibiting 1.6× lower energy and 9% less area. Moreover, the operation of an MCML processor is robust under both systematic and random variations in transistor threshold voltage and effective channel length.

international conference on computer design | 2015

Architecting a MOS current mode logic (MCML) processor for fast, low noise and energy-efficient computing in the near-threshold regime

Yuxin Bai; Yanwei Song; Mahdi Nazm Bojnordi; Alexander E. Shapiro; Engin Ipek; Eby G. Friedman

Near-threshold computing (NTC) is an effective technique for improving the energy efficiency of a CMOS microprocessor, but suffers from a significant performance loss and an increased sensitivity to voltage noise. MOS current-mode logic (MCML), a differential logic family, maintains a low voltage swing and a constant current, making it inherently fast and low-noise. These traits make MCML a natural selection to implement an NTC processor; however, MCML suffers from a high static power regardless of the clock frequency or the level of switching activity, which would result in an inordinate energy consumption in a large scale IC. To address this challenge, this paper explores a single-core microarchitecture for MCML that takes advantage of C-slow retiming technique, and runs at a high frequency with low complexity to save energy. This design principle is opposite to the contemporary multicore design paradigm for static CMOS that relies on a large number of gates running in parallel at modest speeds. When compared to an eight-core static CMOS processor operating in the near-threshold regime, the proposed processor exhibits 3x higher performance, 2x lower energy, and 10 x lower voltage noise, while maintaining a similar level of power dissipation.

ACM Transactions on Computer Systems | 2013

A programmable memory controller for the DDRx interfacing standards

Mahdi Nazm Bojnordi; Engin Ipek

Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable - a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6-17% and reduces DRAM energy by 9-22% over four existing memory controllers.

Explore More