Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Aamir Zia is active.

Publication


Featured researches published by Aamir Zia.


IEEE Design & Test of Computers | 2005

Predicting the performance of a 3D processor-memory chip stack

Philip Jacob; Okan Erdogan; Aamir Zia; Paul M. Belemjian; Russell P. Kraft; John F. McDonald

We are exploring a 3D processor-memory stack for use with the message passing interface (MPI). The communication among processors in huge servers wastes several thousands of cycles. Most of these wasted cycles do not come from the communication link among the processors across the system, but rather in handling the message packets. A processor that could handle this message packing and communication at a much faster rate could significantly increase this tasks efficiency and thus increase the utilization of such supercomputers, currently a very low 1%. However, at such high clock rates, the memory wall would become a significant problem. Tackling this problem requires innovative technologies, such as 3D memories, which alleviate some problems with long on-chip interconnects. The importance of interconnection wires to circuit performance is on a chip. The need for shorter interconnection delays suggests shorter interconnection wires. Shorter interconnections are more likely in 3D architectures than in equivalent 2D systems. This article explores the advantages of 3D in a processor-memory stack system. We conducted simulations using simple tools like Dinero IV and the cache access and cycle time information (Cacti) to evaluate the performances of various memory architectures.


Proceedings of the IEEE | 2009

Mitigating Memory Wall Effects in High-Clock-Rate and Multicore CMOS 3-D Processor Memory Stacks

Philip Jacob; Aamir Zia; Okan Erdogan; Paul M. Belemjian; Jin Woo Kim; Michael Chu; Russell P. Kraft; John F. McDonald; Kerry Bernstein

Three-dimensional chip (3-D) stacking technology provides a new approach to address the so-called memory wall problem. Memory processor chip stacking reduces this memory wall problem, permitting faster clock rates (with suitable processor logic) or permitting multicore access to shared memory using a large number of vertical vias between tiers in the stack, for ultrawide bit path transfer of data and address information to and from various levels of cache. Although a limited amount of parallel access is possible using conventional two-dimensional (2-D) chip memory-processor approaches, 3-D memory-processor stacking greatly extends this to much larger capacity memories. We evaluate high-clock-rate processors as well as shared memory processors with a large number of cores. Various architectural design options to reduce the impact of the memory wall on the processor performance are explored and validated through simulations. Certain architectural features can be implemented in a 3-D chip, such as an ultrawide, ultrashort vertical bus with low parasitic resistance and the elimination of conventional electrostatic discharge, and packaging parasitics required in multiple package 2-D solutions. The objective is to reduce the clocks per instruction figure of merit for high clock speeds in order to deliver significant performance levels. High-clock-rate processors can be designed with SiGe heterostructure bipolar transistors to obtain processors operating on the order of 16 or 32 GHz.


IEEE Transactions on Very Large Scale Integration Systems | 2010

A 3-D Cache With Ultra-Wide Data Bus for 3-D Processor-Memory Integration

Aamir Zia; Philip Jacob; Jin Woo Kim; Michael Chu; Russell P. Kraft; John F. McDonald

Slow cache memory systems and low memory bandwidth present a major bottleneck in performance of modern microprocessors. 3-D integration of processor and memory subsystems provides a means to realize a wide data bus that could provide a high bandwidth and low latency on-chip cache. This paper presents a three-tier, 3-D 192-kB cache for a 3-D processor-memory stack. The chip is designed and fabricated in a 0.18 m fully depleted SOI CMOS process. An ultra wide data bus for connecting the 3-D cache with the microprocessor is implemented using dense vertical vias between the stacked wafers. The fabricated cache operates at 500 MHz and achieves up to 96 GB/s aggregate bandwidth at the output.


design and diagnostics of electronic circuits and systems | 2012

A three-dimensional DRAM using floating body cell in FDSOI devices

Xuelian Liu; Aamir Zia; Mitchell R. LeRoy; Srikumar Raman; Ryan Clark; Russell P. Kraft; John F. McDonald

This paper describes the capacitorless 1-transistor (1T) DRAMs exploits the floating body (FB) effect of Fully depleted (FD) SOI devices, where the transistor body is used as a charge storage node. A novel three-tier, 3D, 1T embedded DRAM is presented that can be vertically integrated with the microprocessor achieving low cost, high density on-chip main memory. A 394Kbits test chip is designed and fabricated in a 0.15um fully depleted SOI CMOS process. The measured retention time under holding conditions is higher than 10ms. In the continuous read mode, every read should be followed by a refresh. The test chip is designed to work with an access time of 50ns and operates at 10MHz.


Iet Circuits Devices & Systems | 2011

Reconfigurable 40 GHz BiCMOS uniform delay crossbar switch for broadband and wide tuning range narrowband applications

Jin Woo Kim; Michael Chu; Philip Jacob; Aamir Zia; Russell P. Kraft; John F. McDonald

A wide-band crossbar switch configured as a non-blocking signal router can be used in various applications that need reconfigurable digital or analog cross connections such as network switches, CPU-memory connecting modules and wide tuning range radar switches. Current mode logic using IBM 8HP SiGe heterojunction bipolar transistors having f T s of 210 GHz and a symmetrical signal path design are employed to make a 40 GHz crossbar switch capable of 80 Gb/s transmission with a fast reconfiguration time of 160 ps. A unique feature of this crossbar is that the delay through any path in the switch is constant. The f T of IBMs 8XP SiGe model is 350 GHz, which allows for faster circuits than the 8HP technology. The crossbar switch using IBMs 8XP kit is simulated to predict further performance improvement to 50 GHz (100 Gb/s for binary signals). To demonstrate the maximum operating speed, the crossbar switch is tested as a 40 GHz phase router for a phased array antenna system. The measured output of the crossbar switch is a 38.8 GHz sine wave with the selected phase delay. The phase noise of the output signal is -88.3 dBc/Hz for an input whose phase noise is -98 dBc/Hz at 1 MHz offset. Using a 2.5 V supply, the 8HP crossbar switch consumes 2.2-5.7 W depending on the number of active channels. The power dissipation of the crossbar switch can be reduced by about 70% with the same performance by using the 8XP kit.


international conference on computer design | 2007

Amdahl’s figure of merit, SiGe HBT BiCMOS, and 3D chip stacking

Phil Jacobs; Aamir Zia; Okan Erdogan; Paul M. Belemjian; Peng Jin; Jin Woo Kim; Mike Chu; Russell P. Kraft; John F. McDonald

Forty years ago Gene Amdahl published a figure of merit for parallel computation, which proved extremely controversial. The controversy still rages today, although those that have looked closely at this figure of merit conclude that it is correct, but perhaps misinterpreted. In this paper we will look at a small variation on that law that suggests computer designers should take a closer look at two emerging technologies, SiGe HBT BiCMOS and 3D chip stacking. We may be overlooking a way to continue the clock race, and in so doing accomplish better parallelism.


2009 IEEE International Conference on 3D System Integration | 2009

Thermal analysis for a SiGe HBT 40 watt 32 GHz clock 3D memory processor chip stack using diamond heat spreader layers

John F. McDonald; Okan Erdogan; Philip Jacobs; Paul M. Belemjian; Alexey Gutin; Aamir Zia; Michael Chu; Jin Woo Kim; Ryan Clarke; Nate DeSimone; Sherry Liu; Russell P. Kraft

CMOS evolution by lateral lithographic shrinkage has encountered an impediment in that wires do not scale well. As a result, it would appear that the clock race is over and the future of computing lies in multicore or parallel processing. In a prior paper [1] we have explored the implications of Amdahls figure of merit (FOM), which suggests that for algorithms to successfully demonstrate large throughput improvement by parallelization, the fraction of non-parallelizable code (also called serial code) must be less than about 4%. We observed that memory latency, synchronization, and inter-processor communication latency can masquerade as non-parallelizable code. While there are no doubt certain applications where this non-parallelizable code fraction of less than 4% exists, and others where the large memory needed to just hold the data can justify use of parallel processors in any case, the implications for the broad class of other code are at best in doubt. The Amdahl figure of merit suggests worse, that a favorable impact is unpromising. In this paper we continue the dialogue begun in the earlier paper by pushing on to examine a small demonstration processor that accomplishes high performance by pursuing the traditional higher clock rate through improvements in device and interconnection technology. Because the processor uses a BiCMOS process and requires a 3D memory for Memory Wall mitigation, it is important to address thermal issues. Preliminary analyses are perhaps unexpectedly somewhat favorable.


international conference on ic design and technology | 2008

A 3-tier, 3-D FD-SOI SRAM macro

Aamir Zia; Philip Jacob; Russell P. Kraft; John F. McDonald

Three dimensional memory systems has been argued as a potential pathway in solving the ever growing difference between comparative speeds of CPU and memory systems. In this paper, we describe a three-tier, three-dimensional SRAM macro that has been designed and fabricated in a 0.18 um FD-SOI CMOS technology. 3D stacking is found to improve wire latency as compared to planar memory structure although the reduction is not enough to have a significant effect on access time of the memory. It is argued that the major performance benefit obtained by 3D integration is in term of very wide data bus that can be realized much more easily with 3D structures as compared to 2D memories.


Iet Circuits Devices & Systems | 2014

Design of BiCMOS SRAMs for high-speed SiGe applications

Xuelian Liu; Mitchell R. LeRoy; Ryan Clarke; Michael Chu; Hadrian Olayvar Aquino; Srikumar Raman; Aamir Zia; Russell P. Kraft; John F. McDonald


Archive | 2013

A Three-Dimensional DRAM Using Floating Body Capacitance Cells in an FD-SOI Process

Xuelian Liu; Aamir Zia

Collaboration


Dive into the Aamir Zia's collaboration.

Top Co-Authors

Avatar

John F. McDonald

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Russell P. Kraft

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Jin Woo Kim

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Michael Chu

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Philip Jacob

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Okan Erdogan

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Paul M. Belemjian

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Xuelian Liu

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Mitchell R. LeRoy

Rensselaer Polytechnic Institute

View shared research outputs
Top Co-Authors

Avatar

Ryan Clarke

Rensselaer Polytechnic Institute

View shared research outputs
Researchain Logo
Decentralizing Knowledge