Mahdi Nazm Bojnordi
University of Rochester
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mahdi Nazm Bojnordi.
design, automation, and test in europe | 2006
Mohammad Hosseinabady; Abbas Banaiyan; Mahdi Nazm Bojnordi; Zainalabedin Navabi
This paper proposes reuse of on-chip networks for testing switches in network on chips (NoCs). The proposed algorithm broadcasts test vectors of switches through the on-chip networks and detects faults by comparing output responses of switches with each other. This algorithm alleviates the need for: (1) external comparison of the output response of the circuit-under-test with the response of a fault free circuit stored on a tester (2) on-chip signature analysis (3) a dedicated test-bus to reach test vectors and collect their responses. Experimental results on a few test benches compare the proposed algorithm with traditional system on chip (SoC) test methods
international symposium on computer architecture | 2012
Mahdi Nazm Bojnordi; Engin Ipek
Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable---a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6--17% and reduces DRAM energy by 9--22% over four existing memory controllers.
high-performance computer architecture | 2016
Mahdi Nazm Bojnordi; Engin Ipek
The Boltzmann machine is a massively parallel computational model capable of solving a broad class of combinatorial optimization problems. In recent years, it has been successfully applied to training deep machine learning models on massive datasets. High performance implementations of the Boltzmann machine using GPUs, MPI-based HPC clusters, and FPGAs have been proposed in the literature. Regrettably, the required all-to-all communication among the processing units limits the performance of these efforts. This paper examines a new class of hardware accelerators for large-scale combinatorial optimization and deep learning based on memristive Boltzmann machines. A massively parallel, memory-centric hardware accelerator is proposed based on recently developed resistive RAM (RRAM) technology. The proposed accelerator exploits the electrical properties of RRAm to realize in situ, fine-grained parallel computation within memory arrays, thereby eliminating the need for exchanging data between the memory cells and the computational units. Two classical optimization problems, graph partitioning and boolean satisfiability, and a deep belief network application are mapped onto the proposed hardware. As compared to a multicore system, the proposed accelerator achieves 57x higher performance and 25x lower energy with virtually no loss in the quality of the solution to the optimization problems. The memristive accelerator is also compared against an RRAM based processing-in-memory (PIM) system, with respective performance and energy improvements of 6.89x and 5.2x.
international symposium on microarchitecture | 2013
Mahdi Nazm Bojnordi; Engin Ipek
Increasing cache sizes in modern microprocessors require long wires to connect cache arrays to processor cores. As a result, the last-level cache (LLC) has become a major contributor to processor energy, necessitating techniques to increase the energy efficiency of data exchange over LLC interconnects. This paper presents an energy-efficient data exchange mechanism using synchronized counters. The key idea is to represent information by the delay between two consecutive pulses on a set of wires, which makes the number of state transitions on the interconnect independent of the data patterns, and significantly lowers the activity factor. Simulation results show that the proposed technique reduces overall processor energy by 7%, and the L2 cache energy by 1.81 × on a set of sixteen parallel applications. This efficiency gain is attained at a cost of less than 1% area, overhead to the L2 cache, and a 2% delay overhead to execution time.
international conference on acoustics, speech, and signal processing | 2006
Mahdi Nazm Bojnordi; Omid Fatemi; Mahmoud Reza Hashemi
One of the main reasons behind the superior efficiency of the H.264/AVC video coding standard is the use of an in-loop deblocking filter. Since the deblocking filter is computation and data intensive, it has a profound impact on the speed degradation of both encoding and decoding processes. In this paper, we propose an efficient deblocking filter architecture that can be used as an IP core either in the dedicated or platform-based H.264/AVC codec systems. Novel self-transposing memory unit is used in this paper to alleviate switching between the horizontal and vertical filtering modes. Moreover, to reduce the processing latency, a two-stage pipelined architecture is designed for 1-D filter that produces output data after 2 clock cycles. With a clock of 100 MHz the proposed design is able to process a 1280times1024 (4:2:0) video at 25 frame/second. The proposed architecture offers 33% to 56% performance improvement compared to the existing state-of-the-art architectures
canadian conference on electrical and computer engineering | 2006
Mahdi Nazm Bojnordi; Mahmoud Reza Hashemi; Omid Fatemi
H.264/AVC as the most recent video coding standard delivers significantly better performance compared to previous standards, supporting higher video quality over lower bit rate channels. The H.264 in-loop deblocking filter is one of the several complex techniques that have realized this superior coding quality. The deblocking filter is a computationally and data intensive tool resulting in increased execution time of both the encoding and decoding processes. In this paper and in order to reduce the deblocking complexity, we propose a new 2D deblocking filtering algorithm based on the existing 1D method of the H.264/AVC standard. Simulation results indicate that the proposed technique achieves a 40% speed improvement compared to the existing 1D H.264/AVC deblocking filter, while affecting the SNR by 0.15% in average
asia pacific conference on circuits and systems | 2006
Mahdi Nazm Bojnordi; Naser Sedaghati-Mokhtari; Omid Fatemi; Mahmoud Reza Hashemi
Many 2D data processing applications can be simplified and represented by use of 1D operations. Such tools, however, require applying both vertical and horizontal operations to the data blocks. The data transposing units is preferred to be used by the designers rather than applying individual operations for horizontal and vertical directions. Hence, designing a cost efficient and extendible transposing memory is a key issue for these applications. This paper proposes an efficient management strategy for using the SRAM modules in order to make a self-transposing memory architecture (STMA). In addition to its lower cost compared to flip-flop based buffers, the proposed architecture is more than 29% faster than usual SRAM based memory units. Simulations indicate that using the STMA in the H.264/AVC deblocking filter results in 60% speed improvement
IEEE Transactions on Very Large Scale Integration Systems | 2016
Yuxin Bai; Yanwei Song; Mahdi Nazm Bojnordi; Alexander E. Shapiro; Eby G. Friedman; Engin Ipek
This paper explores the use of MOS current-mode logic (MCML) as a fast and low noise alternative to static CMOS circuits in microprocessors, thereby improving the performance, energy efficiency, and signal integrity of future computer systems. The power and ground noise generated by an MCML circuit is typically 10-100× smaller than the noise generated by a static CMOS circuit. Unlike static CMOS, whose dominant dynamic power is proportional to the frequency, MCML circuits dissipate a constant power independent of clock frequency. Although these traits make MCML highly energy efficient when operating at high speeds, the constant static power of MCML poses a challenge for a microarchitecture that operates at the modest clock rate and with a low activity factor. To address this challenge, a single-core microarchitecture for MCML is explored that exploits the C-slow retiming technique, and operates at a high frequency with low complexity to save energy. This design principle contrasts with the contemporary multicore design paradigm for static CMOS that relies on a large number of gates operating in parallel at the modest speeds. The proposed architecture generates 10-40× lower power and ground noise, and operates within 13% of the performance (i.e., 1/ExecutionTime) of a conventional, eight-core static CMOS processor while exhibiting 1.6× lower energy and 9% less area. Moreover, the operation of an MCML processor is robust under both systematic and random variations in transistor threshold voltage and effective channel length.
international conference on computer design | 2015
Yuxin Bai; Yanwei Song; Mahdi Nazm Bojnordi; Alexander E. Shapiro; Engin Ipek; Eby G. Friedman
Near-threshold computing (NTC) is an effective technique for improving the energy efficiency of a CMOS microprocessor, but suffers from a significant performance loss and an increased sensitivity to voltage noise. MOS current-mode logic (MCML), a differential logic family, maintains a low voltage swing and a constant current, making it inherently fast and low-noise. These traits make MCML a natural selection to implement an NTC processor; however, MCML suffers from a high static power regardless of the clock frequency or the level of switching activity, which would result in an inordinate energy consumption in a large scale IC. To address this challenge, this paper explores a single-core microarchitecture for MCML that takes advantage of C-slow retiming technique, and runs at a high frequency with low complexity to save energy. This design principle is opposite to the contemporary multicore design paradigm for static CMOS that relies on a large number of gates running in parallel at modest speeds. When compared to an eight-core static CMOS processor operating in the near-threshold regime, the proposed processor exhibits 3x higher performance, 2x lower energy, and 10 x lower voltage noise, while maintaining a similar level of power dissipation.
ACM Transactions on Computer Systems | 2013
Mahdi Nazm Bojnordi; Engin Ipek
Modern memory controllers employ sophisticated address mapping, command scheduling, and power management optimizations to alleviate the adverse effects of DRAM timing and resource constraints on system performance. A promising way of improving the versatility and efficiency of these controllers is to make them programmable - a proven technique that has seen wide use in other control tasks ranging from DMA scheduling to NAND Flash and directory control. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. This paper presents the instruction set architecture (ISA) and hardware implementation of PARDIS, a programmable memory controller that can meet the performance requirements of a high-speed DDRx interface. The proposed controller is evaluated by mapping previously proposed DRAM scheduling, address mapping, refresh scheduling, and power management algorithms onto PARDIS. Simulation results show that the average performance of PARDIS comes within 8% of fixed-function hardware for each of these techniques; moreover, by enabling application-specific optimizations, PARDIS improves system performance by 6-17% and reduces DRAM energy by 9-22% over four existing memory controllers.