Farhana Sheikh
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Farhana Sheikh.
IEEE Journal of Solid-state Circuits | 2011
Sanu K. Mathew; Farhana Sheikh; Michael E. Kounavis; Shay Gueron; Amit Agarwal; Steven K. Hsu; Himanshu Kaul; Mark A. Anders; Ram K. Krishnamurthy
Abstract-This paper describes an on-die, reconfigurable AES encrypt/decrypt hardware accelerator fabricated in 45 nm CMOS, targeted for content-protection in high-performance microprocessors. 100% round computation in native GF(24)2 composite-field arithmetic, unified reconfigurable datapath for encrypt/decrypt, optimized ground & composite-field polynomials, integrated affine/bypass multiplexer circuits, fused Mix/InvMixColumn circuits and a folded ShiftRow datapath enable peak 2.2 Tbps/Watt AES-128 energy efficiency with a dense 2-round layout occupying 0.052 mm2, while achieving: (i) 53/44/38 Gbps AES-128/192/256 performance, 125 mW, measured at 1.1 V, 50 °C, (ii) scalable AES-128 performance up to 66 Gbps, measured at 1.35 V, 50 °C, (iii) wide operating supply voltage range with robust subthreshold voltage performance of 800 Mbps, 409 μW, measured at 320 mV, 50 °C (iv) 37% Sbox delay reduction and 25% area reduction with a compact Sbox layout occupying 759 μm2 (v) 67% reduction in worst-case interconnect length and 33% reduction in ShiftRow wiring tracks and (vi) 43 % reduction in Mix/InvMixColumn area with no performance penalty.
IEEE Journal of Solid-state Circuits | 2012
Sanu K. Mathew; Suresh Srinivasan; Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Farhana Sheikh; Amit Agarwal; Sudhir K. Satpathy; Ram K. Krishnamurthy
This paper describes an all-digital PVT-variation tolerant true-random number generator (TRNG), fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die entropy generation in high-performance microprocessors. The TRNG harvests differential thermal-noise at the diffusion nodes of a pre-charged cross-coupled inverter pair to resolve out of metastability, generating one random bit/cycle. A self-calibrating 2-step tuning mechanism using coarse-grained configurable inverters and fine-grained programmable clock delay generators, along with an entropy-tracking feedback loop provide tolerance to 20% PVT variation-induced device mismatches, enabling lowest-reported energy-consumption of 2.9 pJ/bit with a dense layout occupying 4004 μm2, while achieving: (i) 2.4 Gbps random bit throughput, 7 mW total power consumption with 0.7 mW leakage power component, measured at 1.1 V, 50°C, (ii) random bitstreams that passes all NIST RNG tests with raw entropy/bit measured up to 0.9999999993, (iii) good distribution of 1s with 4-bit entropy of 3.97996 and high-entropy pattern probability of 0.066 (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 14 Mbps, 5.6 μW, measured at 280 mV, 50°C, (v) 12 fine-grained high-entropy settings for the TRNG to dither in during steady-state operation, (vi) <;3% error while using an analytical ergodic Markov chain model for predicting pattern probabilities and (vii) 200x higher throughput and 9x higher energy-efficiency than previously reported implementations. Design modifications for robust operation in 22 nm high-volume manufacturing in the presence of 3σ process variations demonstrate scalability of the all-digital design to future technologies.
international solid-state circuits conference | 2010
Amit Agarwal; Sanu K. Mathew; Steven K. Hsu; Mark A. Anders; Himanshu Kaul; Farhana Sheikh; Rajaraman Ramanarayanan; Suresh Srinivasan; Ram K. Krishnamurthy; Shekhar Borkar
Computationally intensive DSP/media processing applications require specialized hardware accelerators to enable higher energy-efficiency on microprocessor platforms. On-die reconfigurable arrays enable flexible accelerators with dynamic on-the-fly programmability while amortizing die area and time-to-market costs across a wide range of workloads. An ultra-low-voltage fine-grained reconfigurable fabric consisting of a hybrid configurable logic block (CLB) array with process/voltage/temperature (PVT) variation-tolerant register file (Fig. 18.2.1), targeted for on-die acceleration of DSP/media algorithms on power-constrained mobile microprocessors, is fabricated in 32nm high-k/metal-gate CMOS [1]. The CLB combines self-decoded look-up tables (LUTs) for random logic with reconfigurable arithmetic building blocks, hybrid 3∶2 compressors with integrated partial product generation, configurable adder/multiplier carry propagation and optimized CLB input/output multiplexers to achieve peak energy-efficiency of 2.6TOPS/W measured at 340mV, 50°C. The register file includes programmable stacked shared keepers and interruptible operation of both write memory cells and set-dominant latches (SDLs), improving Vcc-min by 300mV across PVT variations with a wide dynamic operating range of 320mV–1.2V, enabling simultaneous dynamic supply/frequency optimization across target workloads and power budgets. These features also achieve: (i) nominal CLB performance of 2.4GHz, 5.3mW measured at 1.0V, (ii) robust CLB functionality measured at 260mV, 27MHz (sub-threshold) consuming 12µW, (iii) scalable register file performance up to 8.2GHz, 125mW measured at 1.2V, 50°C with low-voltage near-threshold operation at 320mV, 252MHz consuming 430µW, (iv) 4-tap FIR filter, radix-2 FFT butterfly and 16b string-match algorithms with peak throughput of 2.1GSamples/s, 2.4GSamples/s and 100Gbps respectively, and (v) application-dependent dual-supply power savings up to 34%.
international solid-state circuits conference | 2012
Steven K. Hsu; Amit Agarwal; Mark A. Anders; Sanu K. Mathew; Himanshu Kaul; Farhana Sheikh; Ram K. Krishnamurthy
Energy-efficient SIMD permutation operations are key for maximizing high-performance microprocessor vector datapath utilization in multimedia, graphics, and signal processing workloads [1-3]. A wide SIMD vector permutation engine is required to achieve high-throughput data rearrangement operations on large data sets, with scaled supply voltages to deliver high energy efficiency. An ultra-low-voltage reconfigurable 4-way to 32-way SIMD vector permutation engine consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle is fabricated in 22nm CMOS. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clockless static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250mV across PVT variations with a wide dynamic operating range of 280mV-1.1V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully-connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates to average min-sized transistor variation, and ultra-low-voltage split-output (ULVS) level shifters improving logic VMIN by 150mV, while enabling peak energy efficiency of 585GOPS/W measured at 260mV, 50°C. The permutation engine occupies a dense layout of 0.048mm2 (Fig. 10.1.7) while achieving: (i) nominal register file performance of 1.8GHz, 106mW measured at 0.9V, 50°C; (ii) robust register file functionality measured down to 280mV (subthreshold) with peak energy efficiency of 154GOPS/W; (iii) scalable permute crossbar performance of 2.9GHz, 69mW measured at 1.1V, 50°C with deep sub-threshold operation at 240mV, 10MHz consuming 19μW; and (iv) a 64b 4×4 matrix transpose algorithm with 53% energy savings and 42% improved peak throughput of 263Gbps measured at 1.8GHz, 0.9V.
symposium on vlsi circuits | 2010
Suresh Srinivasan; Sanu K. Mathew; Rajaraman Ramanarayanan; Farhana Sheikh; Mark A. Anders; Himanshu Kaul; Vasantha Erraguntla; Ram K. Krishnamurthy; Greg Taylor
An all-digital True Random Number Generator is fabricated in 45nm CMOS with 2.4Gbps random bit throughput and total power consumption of 7mW. Two-step coarse/fine-grained tuning with a self-calibrating feedback loop enables robust operation in the presence of 20% process variation while providing immunity to run-time voltage and temperature fluctuations. The 100% digital design enables a compact layout occupying 4004µm2 with measured entropy of 0.999965, and scalable operation down to 280mV, while passing all NIST RNG tests.
symposium on vlsi circuits | 2010
Sanu K. Mathew; Farhana Sheikh; Amit Agarwal; Mike Kounavis; Steven K. Hsu; Himanshu Kaul; Mark A. Anders; Ram K. Krishnamurthy
An on-die, reconfigurable AES encrypt/decrypt hardware accelerator is fabricated in 45nm CMOS, targeted for content-protection in high-performance microprocessors. Compared to conventional AES implementations, this design computes the entire AES round in native GF(24)2 composite-field with one-time GF(28)-to-GF(24)2 mapping cost amortized over multiple AES iterations. This approach along with a fused Mix/InvMixColumns circuit and folded ShiftRow datapath results in 20% area savings and 67% reduction in worst-case interconnect length, enabling AES-128/192/256 ECB block throughput of 53/44/38Gbps, 125mW power measured at 1.1V, 50°C.
IEEE Journal of Solid-state Circuits | 2013
Steven K. Hsu; Amit Agarwal; Mark A. Anders; Sanu K. Mathew; Himanshu Kaul; Farhana Sheikh; Ram K. Krishnamurthy
An ultra-low voltage reconfigurable 4-way to 32-way SIMD vector permutation engine is fabricated in 22 nm tri-gate bulk CMOS, consisting of a 32-entry × 256b 3-read/1-write ported register file with a 256b byte-wise any-to-any permute crossbar for 2-dimensional shuffle. The register file integrates a vertical shuffle across multiple entries into read/write operations, and includes clock-less static reads with shared P/N dual-ended transmission gate (DETG) writes, improving register file VMIN by 250 mV across PVT variations with a wide dynamic operating range of 280 mV-1.1 V. The permute crossbar implements an interleaved folded byte-wise multiplexer layout forming an any-to-any fully connected tree to perform a horizontal shuffle with permute accumulate circuits, and includes vector flip-flops, stacked min-delay buffers, shared gates, and ultra-low voltage split-output (ULVS) level shifters improving logic VMIN by 150 mV, while enabling peak energy efficiency of 585 GOPS/W measured at 260 mV, 50 °C. The permutation engine achieves: (i) nominal register file performance of 1.8 GHz, 106 mW measured at 0.9 V, 50 °C, (ii) robust register file functionality measured down to 280 mV with peak energy efficiency of 154 GOPS/W, (iii) scalable permute crossbar performance of 2.9 GHz, 69 mW measured at 1.1 V, 50 °C with sub-threshold operation at 240 mV, 10 MHz consuming 19 μW, and (iv) a 64b 4 × 4 matrix transpose algorithm and AoS to SoA conversion with 40%-53% energy savings and 25%-42% improved peak throughput measured at 1.8 GHz, 0.9 V.
international solid-state circuits conference | 2010
Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Amit Agarwal; Sanu K. Mathew; Farhana Sheikh; Ram K. Krishnamurthy; Shekhar Borkar
Interconnect networks, for high-bandwidth energy-efficient core-to-core communication, are key to enabling future tera-scale multi-core processors. Packet-switched 2D mesh networks provide efficient interconnect utilization, low latencies and high throughputs, but suffer from low energy efficiency due to data storage during routing [1–2]. Circuit-switched data transfer achieves both high bandwidth and energy efficiency by eliminating intra-route data storage [3]. An 8×8 mesh circuit-switched network-on-chip, consisting of arbitration logic for 512b data width with 1b data interconnect, is fabricated in 45nm high-к metal-gate CMOS [4]. Scaling data width measurements to 512b, the circuits achieve 560Gb/s/W energy efficiency, 4.1Tb/s bisection bandwidth, and 11ns diagonal corner-to-corner fall-through latency. Reconfigurable router circuits allow dynamic optimization of both circuit-switched channel-queue depth and the ratio of arbitration vs. data transfer rates based on traffic patterns. Pipelined arbitration phases with packet-switched channel allocation circuits, dual-supply optimization of data transfer power and proximity-based streaming circuits enable: i) 2.64Tb/s maximum throughput for random 512b transmissions measured at 1.1V, 50°C, ii) 87% increased throughput from channel queuing, iii) 6.43Tb/s scalable performance with streaming traffic at energy efficiency of 0.91Tb/s/W, iv) 4.73W total network power at 74mW per router with ≪17% arbitration overhead, v) traffic-dependent network power consumption scalable down to 1.35W at 21mW per router, vi) 28% power savings through dual-supply optimization at iso-throughput, and vii) low-voltage energy efficiency of 1.51Tb/s/W measured at 550mV, 50°C.
european solid-state circuits conference | 2011
Amit Agarwal; Steven K. Hsu; Sanu K. Mathew; Mark A. Anders; Himanshu Kaul; Farhana Sheikh; Ram K. Krishnamurthy
A 128-entry × 128b content addressable memory (CAM) design enables 145ps search operation in 1.0V, 32nm high-k metal-gate CMOS technology. A high-speed 16b wide dynamic AND match-line, combined with a fully static search-line and swapped XOR CAM cell simulations show a 49% reduction of search energy at iso-search delay of 145ps over an optimized high-performance conventional NOR-type CAM design, enabling 1.07fJ/bit/search operation. Scaling the supply voltage of the proposed CAM enables 0.3fJ/bit/search with 1.07ns search delay at 0.5V.
international solid-state circuits conference | 2012
Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Farhana Sheikh; Ram K. Krishnamurthy; Shekhar Borkar
High-throughput floating-point computations are key building blocks of 3D graphics, signal processing and high-performance computing workloads [1,2]. Higher floating-point precisions offer improved accuracy at the expense of performance and energy efficiency, with variable-precision floating-point circuits providing run-time precision selection [3]. Real-time certainty tracking enables variable-precision circuits not only to operate at the higher energy efficiency of low-precision datapaths, but also to preserve high-precision accuracy. A variable-precision floating-point unit that performs fused multiply-adds (FMA) with single-cycle throughput while supporting operation in either 1-way single-precision (24b mantissa), 2-way 12b precision or 4-way 6b precision modes is fabricated in 32nm High-k/Metal-gate CMOS [4]. Simultaneous floating-point certainty tracking, preshifted addends, a combined rounding and negation incrementer, efficient reuse of mantissa datapath for multiple parallel lower precision calculations, robust ultra-low voltage circuits, and fine-grained clock gating enable nominal energy efficiency of 52GFLOPS/W (IEEE 32b single-precision, measured at 1.45GHz, 1.05V, 25°C) with a dense layout occupying 0.045mm2 (Fig. 10.3.7) while achieving: (i) scalable performance up to 3.6GFLOPS (single-precision), 96mW measured at 1.2V; (ii) up to 4× higher throughput of 14.4GFLOPS with variable-precision, while maintaining single-precision accuracy; (iii) fast single-cycle precision reconfigurability; (iv) precision mode-dependent power consumption for up to 40% clock power reduction; (v) near-threshold single-precision operation measured at 300mV, 1.75MHz, 11μW; and, (vi) peak energy efficiency of 321GFLOPS/W (single-precision) and 1.2TFLOPS/W (6b precision) at 325mV, 25°C.