Muhammad M. Khellah | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Muhammad M. Khellah is active.

Explore More

Publication

Featured researches published by Muhammad M. Khellah.

IEEE Journal of Solid-state Circuits | 2011

A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance

Keith A. Bowman; James W. Tschanz; Shih-Lien Lu; Paolo A. Aseron; Muhammad M. Khellah; Arijit Raychowdhury; Bibiche M. Geuskens; Chris Wilkerson; Tanay Karnik; Vivek De

A 45 nm microprocessor core integrates resilient error-detection and recovery circuits to mitigate the clock frequency (FCLK) guardbands for dynamic parameter variations to improve throughput and energy efficiency. The core supports two distinct error-detection designs, allowing a direct comparison of the relative trade-offs. The first design embeds error-detection sequential (EDS) circuits in critical paths to detect late timing transitions. In addition to reducing the Fclk guardbands for dynamic variations, the embedded EDS design can exploit path-activation rates to operate the microprocessor faster than infrequently-activated critical paths. The second error-detection design offers a less-intrusive approach for dynamic timing-error detection by placing a tunable replica circuit (TRC) per pipeline stage to monitor worst-case delays. Although the TRCs require a delay guardband to ensure the TRC delay is always slower than critical-path delays, the TRC design captures most of the benefits from the embedded EDS design with less implementation overhead. Furthermore, while core min-delay constraints limit the potential benefits of the embedded EDS design, a salient advantage of the TRC design is the ability to detect a wider range of dynamic delay variation, as demonstrated through low supply voltage (VCC) measurements. Both error-detection designs interface with error-recovery techniques, enabling the detection and correction of timing errors from fast-changing variations such as high-frequency VCC droops. The microprocessor core also supports two separate error-recovery techniques to guarantee correct execution even if dynamic variations persist. The first technique requires clock control to replay errant instructions at 1/2FCLK. In comparison, the second technique is a new multiple-issue instruction replay design that corrects errant instructions with a lower performance penalty and without requiring clock control. Silicon measurements demonstrate that resilient circuits enable a 41% throughput gain at equal energy or a 22% energy reduction at equal throughput, as compared to a conventional design when executing a benchmark program with a 10% VCC droop. In addition, the microprocessor includes a new adaptive clock control circuit that interfaces with the resilient circuits and a phase-locked loop (PLL) to track recovery cycles and adapt to persistent errors by dynamically changing Fclk f°Γ maximum efficiency.

international solid state circuits conference | 2007

A 256-Kb Dual-

Muhammad M. Khellah; Dinesh Somasekhar; Yibin Ye; Nam Sung Kim; Jason Howard; Greg Ruhl; Murad Sunna; James W. Tschanz; Nitin Borkar; Fatih Hamzaoglu; Gunjan Pandya; Ali Farhang; Kevin Zhang; Vivek De

This paper addresses the stability problem of SRAM cells used in dense last level caches (LLCs). In order for the LLC not to limit the minimum voltage at which a processor core can run, a dual-VCC 256-Kb SRAM building block is proposed. A fixed high-voltage supply powers the cache which allows the use of the smallest SRAM cell for maximum density, while a separate variable supply is used by the core for ultra-low-voltage operation using dynamic voltage and frequency (DVF). Implemented in a 65-nm bulk CMOS process, the block features low overhead embedded level shifters and an actively clamped sleep transistor for maximum cache leakage power reduction during standby. Measured results show that the proposed block runs at 4.2GHz while consuming 30 mW at 85degC and 1.2V supply. Furthermore, measurements across a wide range of process, voltage, temperature, and aging conditions indicate virtual ground clamping accuracy within a few millivolts of required cache standby VMIN. Extrapolating the 256-Kb block measurement results in a large 64-Mb LLC used in a dual-V CC processor gives 35% reduction in total processor power as compared with a single-VCC processor design running at a high supply voltage

symposium on vlsi circuits | 2006

{V}_{\rm CC}

Muhammad M. Khellah; Yibin Ye; Nam Sung Kim; Dinesh Somasekhar; Gunjan Pandya; Ali Farhang; Kevin Zhang; Clair Webb; Vivek De

Pulsed wordline (PWL) & pulsed bitline (PBL) techniques to improve SRAM cell stabilities in single-Vcc microprocessor designs are evaluated in 65nm CMOS. At 0.7V Vcc, PWL improves cell failure rate by 15times while incurring <1% area overhead. Both PBL & PWL with read-modify-write (PWL-RMW) provide the best improvements (26times) in cell stability, with significant area overheads (4-8%)

IEEE Transactions on Very Large Scale Integration Systems | 2008

SRAM Building Block in 65-nm CMOS Process With Actively Clamped Sleep Transistor

DiaaEldin Khalil; Muhammad M. Khellah; Nam Sung Kim; Yehea I. Ismail; Tanay Karnik; Vivek De

In this paper, an accurate approach for estimating SRAM dynamic stability is proposed. The conventional methods of SRAM stability estimation suffer from two major drawbacks: 1) using static failure criteria, such as static noise margin (SNM), which does not capture the transient and dynamic behavior of SRAM operation and 2) using quasi-Monte Carlo simulation, which approximates the failure distribution, resulting in large errors at the tails where the desired failure probabilities exist. These drawbacks are eliminated by employing a new distribution-independent, most-probable-failure-point search technique for accurate probability calculation along with accurate simulation-based dynamic failure criteria. Compared to previously published techniques, the proposed technique offers orders of magnitude improvement in accuracy. Furthermore, the proposed technique enables the correct evaluation of stability in real operation conditions and for different dynamic circuit techniques, such as dynamic write-back, where the conventional methods are not applicable.

international solid-state circuits conference | 2010

Wordline & Bitline Pulsing Schemes for Improving SRAM Cell Stability in Low-Vcc 65nm CMOS Designs

James W. Tschanz; Keith A. Bowman; Shih-Lien Lu; Paolo A. Aseron; Muhammad M. Khellah; Arijit Raychowdhury; Bibiche M. Geuskens; Chris Wilkerson; Tanay Karnik; Vivek De

Microprocessors experience a wide range of dynamic variations, including voltage droops, temperature changes, and device aging, which vary across applications and systems. The necessity of ensuring correct operation even under infrequent worst-case conditions results in clock frequency (FCLK) or supply voltage (VCC) guardbands that degrade performance and increase energy consumption. In this paper, a research microprocessor core is described with resilient and adaptive circuits to mitigate dynamic variation guardbands for maximizing throughput or minimizing energy. The resiliency features consist of embedded error-detection sequentials (EDS) [1-4] and tunable replica circuits (TRC) [5] in conjunction with error-recovery circuits to detect and correct timing errors. A new instruction-replay error-recovery technique is introduced to correct errant instructions with low performance cost and implementation overhead. In addition, the microprocessor contains an adaptive clock controller based on error statistics to operate at maximum efficiency across a range of dynamic variations.

international solid-state circuits conference | 2010

Accurate Estimation of SRAM Dynamic Stability

Arijit Raychowdhury; Bibiche M. Geuskens; Jaydeep P. Kulkarni; James W. Tschanz; Keith A. Bowman; Tanay Karnik; Shih-Lien Lu; Vivek De; Muhammad M. Khellah

8T SRAM cell (Fig. 19.6.1) is commonly used in single-VCC microprocessor core for its performance critical low-level caches and multi-ported register-file arrays [1]. 8T cell offers fast read (RD) and write (WR), dual-port capability, and generally lower minimum Vcc (or VMIN) than the 6T cell. By using a decoupled single-ended RD port with domino-style hierarchical RD bit-line, 8T cell features fast RD evaluation path without causing access disturbance that limits RD VMIN in the 6T cell. Using the 8T cell in a half-select-free architecture eliminates pseudo-reads during partial writes, hence enabling WR VMIN optimization independent of RD.

IEEE Journal of Solid-state Circuits | 2010

A 45nm resilient and adaptive microprocessor core for dynamic variation tolerance

Dinesh Somasekhar; Balaji Srinivasan; Gunjan Pandya; Fatih Hamzaoglu; Muhammad M. Khellah; Tanay Karnik; Kevin Zhang

A multi-phase 1 GHz charge pump in 32 nm logic process demonstrates a compact area (159 × 42 ¿m2) for boosting supply voltage from twice the threshold voltage (2 Vth) to 3-4 Vth. Self contained clocking with metal-finger flying capacitors enable embedding voltage boost functionality in close proximity to digital logic for supplying low current Vmin requirement of state elements in logic blocks. Multi-phase operation with phase separation of the order of buffer delays avoids the need for a large storage reservoir capacitor. Special configuration of the pump stages to work in parallel enables a fast (5 ns) output transition from disable to enable state. The multi-phase pump operated as a 1 V to 2 V doubler with >5 mA output capability addresses the need for a gated power delivery solution for logic blocks having state-preservation Vmin requirements.

international symposium on quality electronic design | 2008

PVT-and-aging adaptive wordline boosting for 8T SRAM power reduction

DiaaEldin Khalil; Yehea I. Ismail; Muhammad M. Khellah; Tanay Karnik; Vivek De

This paper explores the modeling of the propagation delay of through silicon vias (TSVs) in 3D integrated circuits. The electrical characteristics and models of the TSVs are very crucial in enabling the analysis and CAD in 3D integrated circuits. In this paper, an analytical model for the propagation delay of the TSV as a function of its physical dimensions is proposed. The presented analytical model is in great agreement with simulations using electromagnetic field solver and lossy transmission line circuit model. Compared to earlier interconnect models, the presented analytical model provides higher accuracy and fidelity in addition to its simplicity. Hence, the presented analytical model is very useful in the analysis of 3D integrated circuits.

IEEE Journal of Solid-state Circuits | 2014

Multi-Phase 1 GHz Voltage Doubler Charge Pump in 32 nm Logic Process

Rinkle Jain; Bibiche M. Geuskens; Stephen T. Kim; Muhammad M. Khellah; Jaydeep P. Kulkarni; James W. Tschanz; Vivek De

A fully integrated switched capacitor voltage regulator (SCVR) with on-die high density MIM capacitor, distributed across a 14 KB register file (RF) load is demonstrated in 22 nm tri-gate CMOS. The multi-conversion-ratio SCVR provides a wide output voltage range of 0.45-1 V from a fixed input voltage of 1.225 V. It achieves 63-84% conversion efficiency and supports a maximum load current density of 0.88 A/mm2. The area overhead of the dedicated SCVR on the load is 3.6%. Measured data is presented on various performance indices in detail. Subsequent learning on tradeoffs between various factors like capacitance characteristics, conversion efficiency and current density are delineated and, correlated with theoretical estimates. Performance of RF array shows comparable results when powered with the SCVR and the external rail. The all-digital, modular design allows efficient spatial distribution across the load and hence robust power delivery. The extremely fast response times in the order of few nanoseconds is targeted to benefit agile power management. This work evinces voltage regulator technology as a standard homogenous CMOS component, which can proliferate DVFS domains for maximum energy and area benefits.

international conference on computer aided design | 2005

Analytical Model for the Propagation Delay of Through Silicon Vias

Maged Ghoneima; Yehea I. Ismail; Muhammad M. Khellah; James W. Tschanz; Vivek De

As technology scales, the shrinking wire width increases the interconnect resistivity, while the decreasing interconnect spacing significantly increases the coupling capacitance. This paper proposes reducing the number of bus lines of the conventional parallel-line bus CB architecture by multiplexing each m-bits onto a single line. This bus architecture, the serial-link bus SLB, transforms an n-bit conventional parallel-line bus into an n/m-line (serial-link) bus. The advantage of serial-link buses is that they have fewer lines, and if the bus width is kept the same, serial- link buses will have larger line width and spacing. Increasing the line width has a twofold reduction effect on the line resistance, as the resistivity of sub-100 nm wires significantly drops as the line width increases. Also, increasing the line width and spacing reduces the coupling capacitance between adjacent lines, but increases the line-to-ground capacitance. Thus, an optimum degree of multiplexing m exists that minimizes the bus energy dissipation and maximizes the bus throughput per-unit area. The optimum degree of multiplexing for maximum throughput-per- unit-area and for minimum energy dissipation for the 25-130 nm technologies was determined in this paper. HSPICE simulations show that; for the same throughput-per-unit-area as conventional parallel-line buses, the serial-link bus architecture reduces the energy dissipation by up to 31.42% for a 64-bit bus implemented in an intermediate metal layer of a 50 nm technology and a reduction of 52.7% is projected for the 25 nm technology.

Explore More