Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Farhana Sheikh is active.

Publication


Featured researches published by Farhana Sheikh.


IEEE Journal of Solid-state Circuits | 2011

53 Gbps Native

Sanu K. Mathew; Farhana Sheikh; Michael E. Kounavis; Shay Gueron; Amit Agarwal; Steven K. Hsu; Himanshu Kaul; Mark A. Anders; Ram K. Krishnamurthy

Abstract-This paper describes an on-die, reconfigurable AES encrypt/decrypt hardware accelerator fabricated in 45 nm CMOS, targeted for content-protection in high-performance microprocessors. 100% round computation in native GF(24)2 composite-field arithmetic, unified reconfigurable datapath for encrypt/decrypt, optimized ground & composite-field polynomials, integrated affine/bypass multiplexer circuits, fused Mix/InvMixColumn circuits and a folded ShiftRow datapath enable peak 2.2 Tbps/Watt AES-128 energy efficiency with a dense 2-round layout occupying 0.052 mm2, while achieving: (i) 53/44/38 Gbps AES-128/192/256 performance, 125 mW, measured at 1.1 V, 50 °C, (ii) scalable AES-128 performance up to 66 Gbps, measured at 1.35 V, 50 °C, (iii) wide operating supply voltage range with robust subthreshold voltage performance of 800 Mbps, 409 μW, measured at 320 mV, 50 °C (iv) 37% Sbox delay reduction and 25% area reduction with a compact Sbox layout occupying 759 μm2 (v) 67% reduction in worst-case interconnect length and 33% reduction in ShiftRow wiring tracks and (vi) 43 % reduction in Mix/InvMixColumn area with no performance penalty.


IEEE Journal of Solid-state Circuits | 2012

{\rm GF}(2 ^{4}) ^{2}

Sanu K. Mathew; Suresh Srinivasan; Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Farhana Sheikh; Amit Agarwal; Sudhir K. Satpathy; Ram K. Krishnamurthy

This paper describes an all-digital PVT-variation tolerant true-random number generator (TRNG), fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die entropy generation in high-performance microprocessors. The TRNG harvests differential thermal-noise at the diffusion nodes of a pre-charged cross-coupled inverter pair to resolve out of metastability, generating one random bit/cycle. A self-calibrating 2-step tuning mechanism using coarse-grained configurable inverters and fine-grained programmable clock delay generators, along with an entropy-tracking feedback loop provide tolerance to 20% PVT variation-induced device mismatches, enabling lowest-reported energy-consumption of 2.9 pJ/bit with a dense layout occupying 4004 μm2, while achieving: (i) 2.4 Gbps random bit throughput, 7 mW total power consumption with 0.7 mW leakage power component, measured at 1.1 V, 50°C, (ii) random bitstreams that passes all NIST RNG tests with raw entropy/bit measured up to 0.9999999993, (iii) good distribution of 1s with 4-bit entropy of 3.97996 and high-entropy pattern probability of 0.066 (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 14 Mbps, 5.6 μW, measured at 280 mV, 50°C, (v) 12 fine-grained high-entropy settings for the TRNG to dither in during steady-state operation, (vi) <;3% error while using an analytical ergodic Markov chain model for predicting pattern probabilities and (vii) 200x higher throughput and 9x higher energy-efficiency than previously reported implementations. Design modifications for robust operation in 22 nm high-volume manufacturing in the presence of 3σ process variations demonstrate scalability of the all-digital design to future technologies.


international solid-state circuits conference | 2010

Composite-Field AES-Encrypt/Decrypt Accelerator for Content-Protection in 45 nm High-Performance Microprocessors

Amit Agarwal; Sanu K. Mathew; Steven K. Hsu; Mark A. Anders; Himanshu Kaul; Farhana Sheikh; Rajaraman Ramanarayanan; Suresh Srinivasan; Ram K. Krishnamurthy; Shekhar Borkar

Computationally intensive DSP/media processing applications require specialized hardware accelerators to enable higher energy-efficiency on microprocessor platforms. On-die reconfigurable arrays enable flexible accelerators with dynamic on-the-fly programmability while amortizing die area and time-to-market costs across a wide range of workloads. An ultra-low-voltage fine-grained reconfigurable fabric consisting of a hybrid configurable logic block (CLB) array with process/voltage/temperature (PVT) variation-tolerant register file (Fig. 18.2.1), targeted for on-die acceleration of DSP/media algorithms on power-constrained mobile microprocessors, is fabricated in 32nm high-k/metal-gate CMOS [1]. The CLB combines self-decoded look-up tables (LUTs) for random logic with reconfigurable arithmetic building blocks, hybrid 3∶2 compressors with integrated partial product generation, configurable adder/multiplier carry propagation and optimized CLB input/output multiplexers to achieve peak energy-efficiency of 2.6TOPS/W measured at 340mV, 50°C. The register file includes programmable stacked shared keepers and interruptible operation of both write memory cells and set-dominant latches (SDLs), improving Vcc-min by 300mV across PVT variations with a wide dynamic operating range of 320mV–1.2V, enabling simultaneous dynamic supply/frequency optimization across target workloads and power budgets. These features also achieve: (i) nominal CLB performance of 2.4GHz, 5.3mW measured at 1.0V, (ii) robust CLB functionality measured at 260mV, 27MHz (sub-threshold) consuming 12µW, (iii) scalable register file performance up to 8.2GHz, 125mW measured at 1.2V, 50°C with low-voltage near-threshold operation at 320mV, 252MHz consuming 430µW, (iv) 4-tap FIR filter, radix-2 FFT butterfly and 16b string-match algorithms with peak throughput of 2.1GSamples/s, 2.4GSamples/s and 100Gbps respectively, and (v) application-dependent dual-supply power savings up to 34%.


international solid-state circuits conference | 2012

2.4 Gbps, 7 mW All-Digital PVT-Variation Tolerant True Random Number Generator for 45 nm CMOS High-Performance Microprocessors

Steven K. Hsu; Amit Agarwal; Mark A. Anders; Sanu K. Mathew; Himanshu Kaul; Farhana Sheikh; Ram K. Krishnamurthy

Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.


symposium on vlsi circuits | 2010

A 320mV-to-1.2V on-die fine-grained reconfigurable fabric for DSP/media accelerators in 32nm CMOS

Suresh Srinivasan; Sanu K. Mathew; Rajaraman Ramanarayanan; Farhana Sheikh; Mark A. Anders; Himanshu Kaul; Vasantha Erraguntla; Ram K. Krishnamurthy; Greg Taylor

An all-digital True Random Number Generator is fabricated in 45nm CMOS with 2.4Gbps random bit throughput and total power consumption of 7mW. Two-step coarse/fine-grained tuning with a self-calibrating feedback loop enables robust operation in the presence of 20% process variation while providing immunity to run-time voltage and temperature fluctuations. The 100% digital design enables a compact layout occupying 4004µm2 with measured entropy of 0.999965, and scalable operation down to 280mV, while passing all NIST RNG tests.


symposium on vlsi circuits | 2010

A 280mV-to-1.1V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22nm CMOS

Sanu K. Mathew; Farhana Sheikh; Amit Agarwal; Mike Kounavis; Steven K. Hsu; Himanshu Kaul; Mark A. Anders; Ram K. Krishnamurthy

An on-die, reconfigurable AES encrypt/decrypt hardware accelerator is fabricated in 45nm CMOS, targeted for content-protection in high-performance microprocessors. Compared to conventional AES implementations, this design computes the entire AES round in native GF(24)2 composite-field with one-time GF(28)-to-GF(24)2 mapping cost amortized over multiple AES iterations. This approach along with a fused Mix/InvMixColumns circuit and folded ShiftRow datapath results in 20% area savings and 67% reduction in worst-case interconnect length, enabling AES-128/192/256 ECB block throughput of 53/44/38Gbps, 125mW power measured at 1.1V, 50°C.


IEEE Journal of Solid-state Circuits | 2013

2.4GHz 7mW all-digital PVT-variation tolerant True Random Number Generator in 45nm CMOS

Steven K. Hsu; Amit Agarwal; Mark A. Anders; Sanu K. Mathew; Himanshu Kaul; Farhana Sheikh; Ram K. Krishnamurthy

An ultra-low voltage reconfigurable 4-way to 32-way SIMD vector permutation engine is fabricated in 22 nm tri-gate bulk CMOS, consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clock-less static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250 mV across PVT variations with a wide dynamic operating range of 280 mV-1.1 V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates, and ultra-low voltage split-output (ULVS) level shifters improving logic VMIN by 150 mV, while enabling peak energy efficiency of 585 GOPS/W measured at 260 mV, 50 °C. The permutation engine achieves: (i) nominal register file performance of 1.8 GHz, 106 mW measured at 0.9 V, 50 °C, (ii) robust register file functionality measured down to 280 mV with peak energy efficiency of 154 GOPS/W, (iii) scalable permute crossbar performance of 2.9 GHz, 69 mW measured at 1.1 V, 50 °C with sub-threshold operation at 240 mV, 10 MHz consuming 19 μW, and (iv) a 64b 4 × 4 matrix transpose algorithm and AoS to SoA conversion with 40%-53% energy savings and 25%-42% improved peak throughput measured at 1.8 GHz, 0.9 V.


international solid-state circuits conference | 2010

53Gbps native GF(2 4 ) 2 composite-field AES-encrypt/decrypt accelerator for content-protection in 45nm high-performance microprocessors

Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Amit Agarwal; Sanu K. Mathew; Farhana Sheikh; Ram K. Krishnamurthy; Shekhar Borkar

Interconnect networks, for high-bandwidth energy-efficient core-to-core communication, are key to enabling future tera-scale multi-core processors. Packet-switched 2D mesh networks provide efficient interconnect utilization, low latencies and high throughputs, but suffer from low energy efficiency due to data storage during routing [1–2]. Circuit-switched data transfer achieves both high bandwidth and energy efficiency by eliminating intra-route data storage [3]. An 8×8 mesh circuit-switched network-on-chip, consisting of arbitration logic for 512b data width with 1b data interconnect, is fabricated in 45nm high-к metal-gate CMOS [4]. Scaling data width measurements to 512b, the circuits achieve 560Gb/s/W energy efficiency, 4.1Tb/s bisection bandwidth, and 11ns diagonal corner-to-corner fall-through latency. Reconfigurable router circuits allow dynamic optimization of both circuit-switched channel-queue depth and the ratio of arbitration vs. data transfer rates based on traffic patterns. Pipelined arbitration phases with packet-switched channel allocation circuits, dual-supply optimization of data transfer power and proximity-based streaming circuits enable: i) 2.64Tb/s maximum throughput for random 512b transmissions measured at 1.1V, 50°C, ii) 87% increased throughput from channel queuing, iii) 6.43Tb/s scalable performance with streaming traffic at energy efficiency of 0.91Tb/s/W, iv) 4.73W total network power at 74mW per router with ≪17% arbitration overhead, v) traffic-dependent network power consumption scalable down to 1.35W at 21mW per router, vi) 28% power savings through dual-supply optimization at iso-throughput, and vii) low-voltage energy efficiency of 1.51Tb/s/W measured at 550mV, 50°C.


european solid-state circuits conference | 2011

A 280 mV-to-1.1 V 256b Reconfigurable SIMD Vector Permutation Engine With 2-Dimensional Shuffle in 22 nm Tri-Gate CMOS

Amit Agarwal; Steven K. Hsu; Sanu K. Mathew; Mark A. Anders; Himanshu Kaul; Farhana Sheikh; Ram K. Krishnamurthy

A 128-entry × 128b content addressable memory (CAM) design enables 145ps search operation in 1.0V, 32nm high-k metal-gate CMOS technology. A high-speed 16b wide dynamic AND match-line, combined with a fully static search-line and swapped XOR CAM cell simulations show a 49% reduction of search energy at iso-search delay of 145ps over an optimized high-performance conventional NOR-type CAM design, enabling 1.07fJ/bit/search operation. Scaling the supply voltage of the proposed CAM enables 0.3fJ/bit/search with 1.07ns search delay at 0.5V.


international solid-state circuits conference | 2012

A 4.1Tb/s bisection-bandwidth 560Gb/s/W streaming circuit-switched 8×8 mesh network-on-chip in 45nm CMOS

Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Farhana Sheikh; Ram K. Krishnamurthy; Shekhar Borkar

High-throughput floating-point computations are key building blocks of 3D graphics, signal processing and high-performance computing workloads [1,2]. Higher floating-point precisions offer improved accuracy at the expense of performance and energy efficiency, with variable-precision floating-point circuits providing run-time precision selection [3]. Real-time certainty tracking enables variable-precision circuits not only to operate at the higher energy efficiency of low-precision datapaths, but also to preserve high-precision accuracy. A variable-precision floating-point unit that performs fused multiply-adds (FMA) with single-cycle throughput while supporting operation in either 1-way single-precision (24b mantissa), 2-way 12b precision or 4-way 6b precision modes is fabricated in 32nm High-k/Metal-gate CMOS [4]. Simultaneous floating-point certainty tracking, preshifted addends, a combined rounding and negation incrementer, efficient reuse of mantissa datapath for multiple parallel lower precision calculations, robust ultra-low voltage circuits, and fine-grained clock gating enable nominal energy efficiency of 52GFLOPS/W (IEEE 32b single-precision, measured at 1.45GHz, 1.05V, 25°C) with a dense layout occupying 0.045mm2 (Fig. 10.3.7) while achieving: (i) scalable performance up to 3.6GFLOPS (single-precision), 96mW measured at 1.2V; (ii) up to 4× higher throughput of 14.4GFLOPS with variable-precision, while maintaining single-precision accuracy; (iii) fast single-cycle precision reconfigurability; (iv) precision mode-dependent power consumption for up to 40% clock power reduction; (v) near-threshold single-precision operation measured at 300mV, 1.75MHz, 11μW; and, (vi) peak energy efficiency of 321GFLOPS/W (single-precision) and 1.2TFLOPS/W (6b precision) at 325mV, 25°C.

Collaboration


Dive into the Farhana Sheikh's collaboration.

Researchain Logo
Decentralizing Knowledge