Paolo A. Aseron
Intel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paolo A. Aseron.
IEEE Journal of Solid-state Circuits | 2011
Keith A. Bowman; James W. Tschanz; Shih-Lien Lu; Paolo A. Aseron; Muhammad M. Khellah; Arijit Raychowdhury; Bibiche M. Geuskens; Chris Wilkerson; Tanay Karnik; Vivek De
A 45 nm microprocessor core integrates resilient error-detection and recovery circuits to mitigate the clock frequency (FCLK) guardbands for dynamic parameter variations to improve throughput and energy efficiency. The core supports two distinct error-detection designs, allowing a direct comparison of the relative trade-offs. The first design embeds error-detection sequential (EDS) circuits in critical paths to detect late timing transitions. In addition to reducing the Fclk guardbands for dynamic variations, the embedded EDS design can exploit path-activation rates to operate the microprocessor faster than infrequently-activated critical paths. The second error-detection design offers a less-intrusive approach for dynamic timing-error detection by placing a tunable replica circuit (TRC) per pipeline stage to monitor worst-case delays. Although the TRCs require a delay guardband to ensure the TRC delay is always slower than critical-path delays, the TRC design captures most of the benefits from the embedded EDS design with less implementation overhead. Furthermore, while core min-delay constraints limit the potential benefits of the embedded EDS design, a salient advantage of the TRC design is the ability to detect a wider range of dynamic delay variation, as demonstrated through low supply voltage (VCC) measurements. Both error-detection designs interface with error-recovery techniques, enabling the detection and correction of timing errors from fast-changing variations such as high-frequency VCC droops. The microprocessor core also supports two separate error-recovery techniques to guarantee correct execution even if dynamic variations persist. The first technique requires clock control to replay errant instructions at 1/2FCLK. In comparison, the second technique is a new multiple-issue instruction replay design that corrects errant instructions with a lower performance penalty and without requiring clock control. Silicon measurements demonstrate that resilient circuits enable a 41% throughput gain at equal energy or a 22% energy reduction at equal throughput, as compared to a conventional design when executing a benchmark program with a 10% VCC droop. In addition, the microprocessor includes a new adaptive clock control circuit that interfaces with the resilient circuits and a phase-locked loop (PLL) to track recovery cycles and adapt to persistent errors by dynamically changing Fclk f°Γ maximum efficiency.
international solid-state circuits conference | 2012
Shailendra Jain; Surhud Khare; Satish Yada; V Ambili; Praveen Salihundam; Shiva Ramani; Sriram Muthukumar; Manali R Srinivasan; Arun Kumar; Shasi Kumar Gb; Rajaraman Ramanarayanan; Vasantha Erraguntla; Jason Howard; Sriram R. Vangal; Saurabh Dighe; Greg Ruhl; Paolo A. Aseron; Howard Wilson; Nitin Borkar; Vivek De; Shekhar Borkar
Near-threshold computing brings the promise of an order of magnitude improvement in energy efficiency over the current generation of microprocessors [1]. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all single-threaded or performance-constrained applications. Enabling the processor to operate over a wide voltage range helps to achieve best possible energy efficiency while satisfying varying performance demands of the applications. This paper describes an IA-32 processor fabricated in 32nm CMOS technology [2], demonstrating a reliable ultra-low voltage operation and energy efficient performance across the wide voltage range from 280mV to 1.2V.
IEEE Journal of Solid-state Circuits | 2011
Saurabh Dighe; Sriram R. Vangal; Paolo A. Aseron; Shasi Kumar; Tiju Jacob; Keith A. Bowman; Jason Howard; James W. Tschanz; Vasantha Erraguntla; Nitin Borkar; Vivek De; Shekhar Borkar
In this paper, we present measured within-die core-to-core Fmax and leakage variation data for an 80-core processor in 65 nm CMOS and 1) populate a parameterized energy/performance model to determine the most energy-efficient operating point for a workload; 2) examine impacts of per-core clock and power gating on optimal dynamic voltage-frequency-core scaling (DVFCS) operating points; and 3) compare improvements in energy efficiency achievable by variation-aware DVFCS and core mapping on Single-Voltage/Multiple-Frequency (SVMF), Multiple-Voltage/Single-Frequency (MVSF) and Multiple-Voltage/Multiple-Frequency (MVMF) designs. Variation-aware DVFS with optimal core mapping is shown to improve energy efficiency 6%-35% across a range of compute/communication activity workloads. A new dynamic thread hopping scheme boosts performance by 5%-10% or energy efficiency by 20%-60%.
international solid-state circuits conference | 2010
James W. Tschanz; Keith A. Bowman; Shih-Lien Lu; Paolo A. Aseron; Muhammad M. Khellah; Arijit Raychowdhury; Bibiche M. Geuskens; Chris Wilkerson; Tanay Karnik; Vivek De
Microprocessors experience a wide range of dynamic variations, including voltage droops, temperature changes, and device aging, which vary across applications and systems. The necessity of ensuring correct operation even under infrequent worst-case conditions results in clock frequency (FCLK) or supply voltage (VCC) guardbands that degrade performance and increase energy consumption. In this paper, a research microprocessor core is described with resilient and adaptive circuits to mitigate dynamic variation guardbands for maximizing throughput or minimizing energy. The resiliency features consist of embedded error-detection sequentials (EDS) [1-4] and tunable replica circuits (TRC) [5] in conjunction with error-recovery circuits to detect and correct timing errors. A new instruction-replay error-recovery technique is introduced to correct errant instructions with low performance cost and implementation overhead. In addition, the microprocessor contains an adaptive clock controller based on error statistics to operate at maximum efficiency across a range of dynamic variations.
international solid-state circuits conference | 2010
Saurabh Dighe; Sriram R. Vangal; Paolo A. Aseron; Shasi Kumar; Tiju Jacob; Keith A. Bowman; Jason Howard; James W. Tschanz; Vasantha Erraguntla; Nitin Borkar; Vivek De; Shekhar Borkar
Many-core processors with on-die network-on-chip (NoC) interconnects have emerged as viable architectures for Single-Instruction/Multiple-Data (SIMD) vector applications and parallel workloads, and have been implemented in 65nm CMOS with Dynamic Voltage-Frequency Scaling (DVFS). Chips with Single-Voltage/Single-Frequency (SVSF) for all cores running homogeneous threads as well as Multiple-Voltage/Multiple-Frequency (MVMF), running heterogeneous applications and using independent V/F control for each core, have been reported. Combination of DVFS with dynamic core-count scaling (or DVFCS) has been proposed to further improve performance & energy efficiency across varying workloads. With technology scaling, both leakage power and core-to-core variations in frequency (Fmax) & leakage due to within-die device parameter variations have become significant, thus creating the need for per-core power gating and variation-aware DVFCS. Recently, variation-aware core mapping has been investigated using high level architectural simulations and statistical variation models.
international solid-state circuits conference | 2007
Jianping Xu; Peter Hazucha; Mingwei Huang; Paolo A. Aseron; Fabrice Paillet; Gerhard Schrom; James W. Tschanz; Cangsang Zhao; Vivek De; Tanay Karnik; Greg Taylor
The impedance of a microprocessor power-delivery network peaks at ~140MHz, resulting in power-grid resonance, which lowers operating frequency and compromises reliability. A suppression circuit uses an active-damping technique with a maximum of 12.7dB peak-to-peak noise reduction from 70 to 250MHz in a 90nm CMOS process.
IEEE Transactions on Circuits and Systems | 2011
Keith A. Bowman; James W. Tschanz; Arijit Raychowdhury; Muhammad M. Khellah; Bibiche M. Geuskens; Shih-Lien Lu; Paolo A. Aseron; Tanay Karnik; Vivek De
A 45 nm microprocessor integrates an all-digital dynamic variation monitor (DVM) to continuously measure the impact of dynamic parameter variations on circuit-level performance to enhance silicon debug and adaptive clock control. The DVM consists of a tunable replica circuit, a time-to-digital converter, and multiplexers to measure circuit delay or frequency changes with less than a 1% measured resolution error while capturing clock-to-data correlations. In validating the DVM with microprocessor maximum clock frequency (FMAX) measurements, an on-die noise injector circuit induces a supply voltage (VCC) droop at a particular cycle in the test program. The FMAX measurement is then repeated for over a thousand iterations while shifting the droop placement to a different cycle per iteration. Silicon measurements demonstrate the DVM capability of tracking the worst case FMAX reduction to within 1% for a wide range of VCC droop profiles. Furthermore, silicon measurements reveal that FMAX is highly sensitive to the placement and magnitude of a high-frequency VCC droop during program execution, thus highlighting the value of the DVM for silicon debug. In addition, the DVM interfaces with an adaptive clock control circuit to dynamically adjust the clock frequency by changing the divide ratio in the phase-locked loop in response to persistent variations, enabling the microprocessor to adapt to the operating environment for maximum efficiency.
international solid-state circuits conference | 2008
Dinesh Somasekhar; Yibin Ye; Paolo A. Aseron; Shih-Lien Lu; Muhammad M. Khellah; Jason Howard; Gregory Ruhl; Tanay Karnik; Shekhar Borkar; Vivek De; Ali Keshavarzi
As silicon technology scales, the possibility of fabricating dense memories is of great interest, particularly if the solution has low to no additional process cost. We demonstrate a 2Mb 2T gain-cell macro with 128GB/s bandwidth, fast 2ns cycle time, operating at 2GHz in a native 65nm logic process. Fast read access and cycle times are critical in lookup applications such as tag RAMs in microprocessors where read queries are abundant. In such a scenario replacing SRAM with a denser fast memory is desired. The 2T fully pipelined gain-cell macro features non-destructive read, partial-write support and sustains 8-cycle successive access to the same memory bank. The macro is fabricated in a high-performance 65nm process featuring 1.2nm nitrided gate oxide, 35nm gate length, enhanced channel strain, NiSi silicide, 8 layers of Cu metal interconnect, and low-K ILD enabling data-buses from the memory banks to run as high as 2GHz. This technology has a 0.57 mum2 6T SRAM cell. All internal circuits of the macro operate with a logic-compatible nominal IV supply with the exception of wordline drivers.
custom integrated circuits conference | 2010
Keith A. Bowman; James W. Tschanz; Arijit Raychowdhury; Muhammad M. Khellah; Bibiche M. Geuskens; Shih-Lien Lu; Paolo A. Aseron; Tanay Karnik; Vivek De
A 45nm microprocessor integrates an all-digital dynamic variation monitor (DVM), consisting of a tunable replica circuit with a time-to-digital converter, to measure the impact of dynamic variations on path-level delay or frequency. Measurements reveal the high sensitivity of the microprocessor maximum clock frequency (FMAX) to the placement and magnitude of a high-frequency supply voltage (VCC) droop and demonstrate the DVM capability of tracking FMAX changes to within 1% for a wide range of VCC droop profiles. Furthermore, the DVM interfaces with an adaptive clock control circuit to dynamically change the clock frequency in response to dynamic variations, enabling the microprocessor to operate at maximum efficiency.
international solid-state circuits conference | 2015
Jaydeep P. Kulkarni; Paolo A. Aseron; Trang Nguyen; Charles Augustine; James W. Tschanz; Vivek De
8-Transistor (8T) cell 1-read/1-write (1R1W) register files (RF) with domino read and static differential write are critical performance-limiting building blocks in high-performance microprocessor datapaths. The RF operating voltage (V) and frequency (F) are limited by the delay of the precharge-evaluate read critical path. Traditionally, the operating V/F is set to ensure no read timing error across all data access patterns in the RF array in the presence of within-die (WID) parameter (P) variations, and worst-case voltage droops, temperature (T) changes and transistor-aging-induced delay degradations. However, many of these worst-case conditions and events are rare during normal operation. Therefore, these V/F guardbands can severely limit the best-achievable performance and energy efficiency in scaled CMOS process.