Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Sanu K. Mathew is active.

Publication


Featured researches published by Sanu K. Mathew.


symposium on computer arithmetic | 2005

An improved unified scalable radix-2 Montgomery multiplier

David Money Harris; Ram K. Krishnamurthy; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu

This paper describes an improved version of the Tenca-Koc unified scalable radix-2 Montgomery multiplier with half the latency for small and moderate precision operands and half the queue memory requirement. Like the Tenca-Koc multiplier, this design is reconfigurable to accept any input precision in either GF(p) or GF(2/sup n/) up to the size of the on-chip memory. An FPGA implementation can perform 1024-bit modular exponentiation in 16 ms using 5598 4-input lookup tables, making it the fastest unified scalable design yet reported.


IEEE Journal of Solid-state Circuits | 2003

A 4-GHz 130-nm address generation unit with 32-bit sparse-tree adder core

Sanu K. Mathew; Mark A. Anders; Ram K. Krishnamurthy; Shekhar Borkar

This paper describes a 32-bit Address Generation Unit (AGU) designed for 4 GHz operation in 1.2 V, 130 nm technology. The AGU utilizes a 152 ps dual-V, sparse-tree adder core to achieve 20% delay reduction, 80% lower interconnect density and a low (1%) active energy leakage component. The semidynamic implementation enables an average energy profile similar to static CMOS, with good sub-130 nm scaling trend.


international solid-state circuits conference | 2008

A 320mV 56μW 411GOPS/Watt Ultra-Low Voltage Motion Estimation Accelerator in 65nm CMOS

Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Ram K. Krishnamurthy; Shekhar Borkar

Motion estimation for compressing inter-frame redundancies is the most performance and power-critical operation in video encoding applications, where a wide range of throughput and power constraints are required to handle a variety of video resolution, frame rate and application specifications. A motion estimation engine targeted for special-purpose on-die acceleration of sum of absolute difference (SAD) computation in real-time video encoding workloads on power-constrained mobile microprocessors is fabricated in 65nm CMOS.


international solid-state circuits conference | 2004

A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS

Sanu K. Mathew; Mark A. Anders; Brad Bloechel; Trang Nguyen; Ram K. Krishnamurthy; Shekhar Borkar

This paper describes a single-cycle 64-bit integer execution ALU fabricated in 90-nm dual-Vt CMOS technology, operating at 4 GHz in the 64-bit mode with a 32-bit mode frequency of 7 GHz (measured at 1.3 V, 25/spl deg/ C). The lower- and upper-order 32-bit domains operate on separate off-chip supply voltages, enabling conditional turn-on/off of the 64-bit ALU mode operation and efficient power-performance optimization. High-speed single-rail dynamic circuit techniques and a sparse-tree semi-dynamic adder architecture enable a dense layout occupying 280 /spl times/ 260 /spl mu/m/sup 2/ while simultaneously achieving: (i) low carry-merge fan-outs and inter-stage wiring complexity; (ii) low active leakage and dynamic power consumption; (iii) high DC noise robustness with maximum low-Vt usage; (iv) single-rail dynamic-compatible ALU write-back bus; (v) simple 2/spl Phi/ 50% duty-cycle timing plan with seamless time-borrowing across phases; (vi) scalable 64-bit ALU performance up to 7 GHz measured at 2.1 V, 25/spl deg/ C; and (vii) scalable 32-bit ALU performance up to 9 GHz measured at 1.68 V, 25/spl deg/ C.


international solid-state circuits conference | 2014

16.2 A 0.19pJ/b PVT-variation-tolerant hybrid physically unclonable function circuit for 100% stable secure key generation in 22nm CMOS

Sanu K. Mathew; Sudhir K. Satpathy; Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Amit Agarwal; Gregory K. Chen; Rachael J. Parker; Ram K. Krishnamurthy; Vivek De

Physically unclonable function (PUF) circuits are low-cost cryptographic primitives used for generation of unique, stable and secure keys or chip IDs for device authentication and data security in high-performance microprocessors [1][2][3][7]. The volatile nature of PUFs provides a high level of security and tamper resistance against invasive probing attacks compared to conventional fuse-based key storage technologies [4]. A process-voltage-temperature (PVT) variation-tolerant all-digital PUF array targeted for on-die generation of 100% stable, device-specific, high-entropy keys is fabricated in 22nm tri-gate high-κ metal-gate CMOS technology [5], featuring: i) a hybrid delay/cross-coupled PUF circuit where interaction of 16 minimum-sized, variation-impacted transistors determines resolution dynamics, ii) a temporal majority voting (TMV) circuit to stabilize occasionally unstable bits, resulting in 53% reduction in instability, iii) burn-in hardening to reinforce manufacturing-time PUF bias, resulting in 22% reduction in bit-errors, iv) soft dark bits for run-time identification and sequestration of highly unstable bits during field operation, resulting in 78% lower bit-errors, v) 19× separation between inter- and intra-PUF Hamming distance, enabling die-specific keys, vi) autocorrelation factor≈0 and entropy=0.9997, while passing NIST randomness tests, vii) high tolerance to voltage and temperature variation with 82% reduction in average Hamming-distance using a 100-cycle dark bit window, viii) in-situ PUF hardening by leveraging directed NBTI aging to improve stability during field operation, and ix) ultra-low energy consumption of 0.19pJ/b with compact bitcell layout of 4.66μm2 (Fig. 16.2.7a).


IEEE Journal of Solid-state Circuits | 2011

53 Gbps Native

Sanu K. Mathew; Farhana Sheikh; Michael E. Kounavis; Shay Gueron; Amit Agarwal; Steven K. Hsu; Himanshu Kaul; Mark A. Anders; Ram K. Krishnamurthy

Abstract-This paper describes an on-die, reconfigurable AES encrypt/decrypt hardware accelerator fabricated in 45 nm CMOS, targeted for content-protection in high-performance microprocessors. 100% round computation in native GF(24)2 composite-field arithmetic, unified reconfigurable datapath for encrypt/decrypt, optimized ground & composite-field polynomials, integrated affine/bypass multiplexer circuits, fused Mix/InvMixColumn circuits and a folded ShiftRow datapath enable peak 2.2 Tbps/Watt AES-128 energy efficiency with a dense 2-round layout occupying 0.052 mm2, while achieving: (i) 53/44/38 Gbps AES-128/192/256 performance, 125 mW, measured at 1.1 V, 50 °C, (ii) scalable AES-128 performance up to 66 Gbps, measured at 1.35 V, 50 °C, (iii) wide operating supply voltage range with robust subthreshold voltage performance of 800 Mbps, 409 μW, measured at 320 mV, 50 °C (iv) 37% Sbox delay reduction and 25% area reduction with a compact Sbox layout occupying 759 μm2 (v) 67% reduction in worst-case interconnect length and 33% reduction in ShiftRow wiring tracks and (vi) 43 % reduction in Mix/InvMixColumn area with no performance penalty.


IEEE Transactions on Very Large Scale Integration Systems | 2005

{\rm GF}(2 ^{4}) ^{2}

Vojin G. Oklobdzija; Bart R. Zeydel; Hoang Q. Dao; Sanu K. Mathew; Ram K. Krishnamurthy

In this paper, we motivate the concept of comparing very large scale integration adders based on their energy-delay characteristics and present results of our estimation technique. This stems from a need to make appropriate selection at the beginning of the design process. The estimation is quick, not requiring extensive simulation or use of computer-aided design tools, yet sufficiently accurate to provide guidance through various choices in the design process. We demonstrate the accuracy of the method by applying it to examples of high-performance 32- and 64-b adders in 100- and 130-nm CMOS technologies.


symposium on vlsi circuits | 2002

Composite-Field AES-Encrypt/Decrypt Accelerator for Content-Protection in 45 nm High-Performance Microprocessors

Sanu K. Mathew; Mark A. Anders; Ram K. Krishnamurthy; Shekhar Borkar

This paper describes a 32-bit Address Generation Unit (AGU) designed for 4 GHz operation in 1.2 V, 130 nm technology. The AGU utilizes a 152 ps dual-V, sparse-tree adder core to achieve 20% delay reduction, 80% lower interconnect density and a low (1%) active energy leakage component. The semidynamic implementation enables an average energy profile similar to static CMOS, with good sub-130 nm scaling trend.


international solid-state circuits conference | 2001

Comparison of high-performance VLSI adders in the energy-delay space

Sanu K. Mathew; Ram K. Krishnamurthy; Mark A. Anders; Rafael Rios; K. Mistry; Krishnamurthy Soumyanath

The requirements of high-throughput Internet servers necessitate the use of multiple ALUs in high-performance 64 b execution cores. Consequently, each ALU demands a compact, energy-efficient 64 b adder core with single-cycle latency. The resultant critical path, which is a balanced mix of interconnect, diffusion and gate loads, forms a representative test bed for evaluating competing circuit techniques and process technologies (bulk CMOS/SOI). This paper presents: (i) the design of an energy-efficient 64 b ALU in 0.18 /spl mu/m bulk CMOS technology; (ii) a direct port of this design to a comparable SOI technology and (iii) an SOI-optimal redesign of the adder core. Further, it describes design margining required for the SOI implementations and reports the results of shrinking the two architectures to 0.13 /spl mu/m Bulk/SOI. In both cases, a sophisticated SOI compact model that incorporates features to effectively model the SOI floating body effect is used.


international solid-state circuits conference | 2009

A 4 GHz 130 nm address generation unit with 32-bit sparse-tree adder core

Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Ram K. Krishnamurthy; Shekhar Borkar

This paper describes a reconfigurable 4-way SIMD engine fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors. The SIMD accelerator is reconfigured to perform 4-way 16b × 16b multiplies, 32b × 32b multiply, 4-way 16b additions, 2-way 32b additions or 72b addition with single-cycle throughput and wide supply voltage range of operation (1.3 V-230 mV). A reconfigurable 2 × 2 tile of signed 2s complement 16b multipliers, with conditional carry gating in the 72b sparse tree adder, dual-supplies for voltage hopping, and fine-grained power-gating enables peak energy efficiency of 494GOPS/W (measured at 300 mV, 50°C) with a dense layout occupying 0.081 mm2 while achieving: (i) scalable performance up to 2.8 GHz, 278 mW measured at 1.3 V; (ii) fast single-cycle switching between any operating/idle mode; (iii) configuration-dependent power reduction of up to 41% in total power and 6.5× in active leakage power; (iv) 10× standby leakage reduction during idle mode; (v) deep subthreshold operation measured at 230 mV, 8.8 MHz, 87 ¿W; and (vi) compensation for up to 3× performance variation in ultra-low voltage mode.

Collaboration


Dive into the Sanu K. Mathew's collaboration.

Researchain Logo
Decentralizing Knowledge