Tung Thanh Hoang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Tung Thanh Hoang is active.

Explore More

Publication

Featured researches published by Tung Thanh Hoang.

system on chip conference | 2010

A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

Tung Thanh Hoang; Magnus Själander; Per Larsson-Edefors

We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports twos complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-extension solution, implements all other functionality. Place-and-route evaluations using a 65-nm 1.1-V cell library show that the proposed architecture offers a 31% improvement in speed and a 32% reduction in energy per operation, averaged across operand sizes of 16, 32, 48, and 64 bits, over a reference two-cycle MAC architecture that employs a multiplier in the first stage and an accumulator in the second. When operating the proposed architecture at the lower frequency of the reference architecture the available timing slack can be used to downsize gates, resulting in a 52% reduction in energy compared to the reference. We extend the new architecture to create a versatile double-throughput MAC (DTMAC) unit that efficiently performs either multiply-accumulate or multiply operations for N-bit, 1 × N/2-bit, or 2 × N/2-bit operands. In comparison to a fixed-function 32-bit MAC unit, 16-bit multiply-accumulate operations can be executed with 67% higher energy efficiency on a 32-bit DTMAC unit.

application specific systems architectures and processors | 2010

Design space exploration for an embedded processor with flexible datapath interconnect

Tung Thanh Hoang; Ulf Jälmbrant; Erik der Hagopian; Kasyab Parmesh Subramaniyan; Magnus Själander; Per Larsson-Edefors

The design of an embedded processor is dependent on the application domain. Traditionally, design solutions specific to an application domain have been available in three forms: VLIW-based DSP processors, ASICs and FPGAs; each respectively offering generality of application domain, energy efficiency and flexibility. However, while matching the application domain to the resources needed, the design space becomes huge. We present FlexTools, a tool framework built around the FlexCore architecture to evaluate performance and energy efficiency for different applications. Here we demonstrate FlexTools for design space exploration with a focus on the data-routing flexibility of the FlexCore processor, in search of energy-efficient interconnect configurations that are both cycle-count and hardware efficient. Evaluation results suggest that a well-optimized instance of a 65-nm multiplier-extended FlexCore processor datapath, obtained using FlexTools, executes nine integer EEMBC benchmarks with a 15% cycle count reduction and dissipates 17% less energy than a reference MIPS datapath.

international parallel and distributed processing symposium | 2009

Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements

Tung Thanh Hoang; Magnus Själander; Per Larsson-Edefors

As a simple five-stage General-Purpose Processor (GPP), the baseline FlexCore processor has a limited set of datapath units. By utilizing a flexible datapath interconnect and a wide control word, a FlexCore processor is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. In this paper, we propose the integration of a novel Double Throughput Multiply-Accumulate (DTMAC) unit, whose different operating modes allow for on-thefly optimization of computational precision. For the two EEMBC benchmarks considered, the FlexCore processor performance is significantly enhanced when one DTMAC accelerator is included, translating into reduced execution time and energy dissipation. In comparison to the 32-bit GPP reference, the accelerated 32-bit FlexCore processor shows a 4.37× improvement in execution time and a 3.92× reduction in energy dissipation, for a benchmark with many consecutive 16-bit MAC operations.

conference on ph.d. research in microelectronics and electronics | 2011

Power gating multiplier of embedded processor datapath

Tung Thanh Hoang; Vineeth Saseendran; Donatas Siaudinis; Per Larsson-Edefors

Leakage power is an important concern in modern electronic designs. To efficiently employ power gating for leakage reduction in embedded processors, the architecture must provide a clear-cut software support for power gating and the power-gated unit must have significant idle times during the execution of the applications. We introduce power gating of individual datapath units for the embedded architecture of FlexCore, to evaluate if leakage reductions in temporarily idle units can reduce the overall power dissipation of compute-intensive applications. Post-layout multi-corner simulations for a 65-nm FlexCore datapath implementation demonstrate that power gating of the multiplier unit yields overall datapath energy savings, up to 14%, for two EEMBC benchmarks.

symposium on cloud computing | 2009

High-speed, energy-efficient 2-cycle Multiply-Accumulate architecture

Tung Thanh Hoang; Magnus Själander; Per Larsson-Edefors

We propose a high-speed and energy-efficient 2-cycle multiply-accumulate (MAC) architecture. Our architecture is based on twos complement representation, it uses guarding bits to efficiently support longer MAC loops, and it includes output saturation. By performing carry propagation only in the second stage of the MAC pipeline, multiplication and accumulation have similar delays. But in contrast to previous MAC architectures that propose to only use one carry-propagation stage, our architecture requires no extra cycles to produce the final result. Instead it correctly produces the sum of the accumulated value and the product in each cycle. Our place-and-route evaluation shows that the proposed architecture, averaged across several operand sizes, offers a 33% improvement in speed and a 37% reduction of energy over a conventional 2-cycle MAC architecture.

application specific systems architectures and processors | 2012

Viterbi Accelerator for Embedded Processor Datapaths

Muhammad Waqar Azhar; Magnus Själander; Hasan Ali; Akshay Vijayashekar; Tung Thanh Hoang; Kashan Khurshid Ansari; Per Larsson-Edefors

We present a novel architecture for a lightweight Viterbi accelerator that can be tightly integrated inside an embedded processor datapath. We investigate the accelerators impact on processor performance by using the EEMBC Viterbi benchmark and the in-house Viterbi Branch Metric kernel. Our evaluation based on the EEMBC benchmark shows that an accelerated 65-nm 2.7-ns processor datapath is 20% larger but 90% more cycle efficient than a datapath lacking the Viterbi accelerator, leading to an 87% overall energy reduction and a data throughput of 3.52 Mbit/s.

ieee computer society annual symposium on vlsi | 2012

Data-Width-Driven Power Gating of Integer Arithmetic Circuits

Tung Thanh Hoang; Per Larsson-Edefors

When performing narrow-width computations, power gating of unused arithmetic circuit portions can significantly reduce leakage power. We deploy coarse-grain power gating in 32-bit integer arithmetic circuits that frequently will operate on narrow-width data. Our contributions include a design framework that automatically implements coarse-grain power-gated arithmetic circuits considering a narrow-width input data mode, and an analysis of the impact of circuit architecture on the efficiency of this data-width-driven power gating scheme. As an example, with a performance penalty of 6.7%, coarse-grain power gating of a 45-nm 32-bit multiplier is demonstrated to yield an 11.6× static leakage energy reduction per 8×8-bit operation.

digital systems design | 2010

Cyclic Redundancy Checking (CRC) Accelerator for the FlexCore Processor

Muhammad Waqar Azhar; Tung Thanh Hoang; Per Larsson-Edefors

A proven approach to increase performance of general-purpose processors is to add hardware accelerators. In its basic configuration, the FlexCore processor has a limited set of datapath units. But thanks to a flexible datapath interconnect and a wide control word, the FlexCore datapath is explicitly designed to support integration of special units that, on demand, can accelerate certain data-intensive applications. We present the integration of a versatile accelerator for several Cyclic Redundancy Checking (CRC) keys. Furthermore, we investigate the accelerator’s impact on processor execution time and energy efficiency, using the Power Stone CRC benchmark. Our evaluation shows that the accelerated 65-nm 2.7-ns FlexCore datapath is, for example, 86% more energy and cycle efficient than a datapath lacking the CRC accelerator.

conference on ph.d. research in microelectronics and electronics | 2011

FlexDEF: Development framework for processor architecture implementation and evaluation

Kasyab Parmesh Subramaniyan; Erik J Ryman; Magnus Själander; Tung Thanh Hoang; Mafijul Md. Islam; Per Larsson-Edefors

Designing a processor is a complex task that uses multiple and varied tools. The complete development cycle spans software as well as hardware design and verification. More often than not, in spite of the close dependencies between hardware and software, there is no common platform for quick and accurate testing of these dependencies. Though such systems are often employed in industry, it is not common for end-to-end frameworks to be deployed in educational and research settings.We present the FlexCore Design Exploration Framework (FlexDEF), an end-to-end tool-chain used to develop the FlexCore processor and its accompanying cache system. The tool-chain is a hierarchically linked system that spans the various development phases involved in design and verification. The processor system is intended to be a model, for use in research-oriented projects where both the software and hardware are in a constant state of flux. We discuss the complete framework and the advantages in each context. Finally, we summarize the developments and discuss the future of the FlexDEF tool-chain.

Archive | 2013