Jason Luu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jason Luu is active.

Explore More

Publication

Featured researches published by Jason Luu.

field programmable gate arrays | 2009

VPR 5.0: FPGA cad and architecture exploration tools with single-driver routing, heterogeneity and process scaling

Jason Luu; Ian Kuon; Peter Jamieson; Ted Campbell; Andy Ye; Wei Mark Fang; Jonathan Rose

The VPR toolset [6, 7] has been widely used to perform FPGA architecture and CAD research, but has not evolved over the past decade to include many architectural features now present in modern FPGAs. This paper describes a new version of the toolset that includes four significant features: first, it now supports a broad range of single-driver routing architectures [29, 4, 16]. Single-driver routing has significantly different architectural and electrical properties from the multi-driver approach previously modelled, and is now employed in the majority of FPGAs sold. Second, the new release can now model a heterogeneous selection of hard logic blocks, which could include the hard memory and multipliers that are now ubiquitous in FPGAs. Third, we provide optimized electrical models of a wide range of architectures in different process technologies, including a range of area-delay tradeoffs for each single architecture. Prior releases of VPR did not publish even one architecture file with accurate resistance and capacitance parameters. Finally, to maintain robustness and to support future development the release includes a set of regression tests to check functionality and quality of result of the output of the tools. To illustrate the use of the new features, we present a new look at the FPGA area vs. logic block LUT size question that shows that small LUT sizes, with the use of carefully optimized electrical design and single-driver architectures, have better area (relative to 4-LUTs) than previously thought. Another experiment shows that several of the previous architectural results are invariant in moving from multi-driver to single-driver routing architecture and across a range of process technologies.

ACM Transactions on Reconfigurable Technology and Systems | 2014

VTR 7.0: Next Generation Architecture and CAD System for FPGAs

Jason Luu; Jeffrey B. Goeders; Michael Wainberg; Andrew Somerville; Thien Yu; Konstantin Nasartschuk; Miad Nasr; Sen Wang; Tim X. Liu; Nooruddin Ahmed; Kenneth B. Kent; Jason Helge Anderson; Jonathan Rose; Vaughn Betz

Exploring architectures for large, modern FPGAs requires sophisticated software that can model and target hypothetical devices. Furthermore, research into new CAD algorithms often requires a complete and open source baseline CAD flow. This article describes recent advances in the open source Verilog-to-Routing (VTR) CAD flow that enable further research in these areas. VTR now supports designs with multiple clocks in both timing analysis and optimization. Hard adder/carry logic can be included in an architecture in various ways and significantly improves the performance of arithmetic circuits. The flow now models energy consumption, an increasingly important concern. The speed and quality of the packing algorithms have been significantly improved. VTR can now generate a netlist of the final post-routed circuit which enables detailed simulation of a design for a variety of purposes. We also release new FPGA architecture files and models that are much closer to modern commercial architectures, enabling more realistic experiments. Finally, we show that while this version of VTR supports new and complex features, it has a 1.5× compile time speed-up for simple architectures and a 6× speed-up for complex architectures compared to the previous release, with no degradation to timing or wire-length quality.

field programmable gate arrays | 2011

Architecture description and packing for logic blocks with hierarchy, modes and complex interconnect

Jason Luu; Jason Helge Anderson; Jonathan Rose

The development of future FPGA fabrics with more sophisticated and complex logic blocks requires a new CAD flow that permits the expression of that complexity and the ability to synthesize to it. In this paper, we present a new logic block description language that can depict complex intra-block interconnect, hierarchy and modes of operation. These features are necessary to support modern and future FPGA complex soft logic blocks, memory and hard blocks. The key part of the CAD flow associated with this complexity is the packer, which takes the logical atomic pieces of the complex blocks and groups them into whole physical entities. We present an area-driven generic packing tool that can pack the logical atoms into any heterogeneous FPGA described in the new language, including many different kinds of soft and hard logic blocks. We gauge its area quality by comparing the results achieved with a lower bound on the number of blocks required, and then illustrate its explorative capability in two ways: on fracturable LUT soft logic architectures, and on hard block memory architectures. The new infrastructure attaches to a flow that begins with a Verilog front-end, permitting the use of benchmarks that are significantly larger than the usual ones, and can target heterogenous FPGAs.

field-programmable logic and applications | 2013

Titan: Enabling large and complex benchmarks in academic CAD

Kevin E. Murray; Scott Whitty; Suya Liu; Jason Luu; Vaughn Betz

Benchmarks play a key role in FPGA architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern designs which are large scale systems that make use of heterogeneous resources; however, most current FPGA benchmarks are both small and simple. In this paper we present Titan, a hybrid CAD flow that addresses these issues. The flow uses Alteras Quartus II FPGA CAD software to perform HDL synthesis and a conversion tool to translate the result into the academic BLIF format. Using this flow we created the Titan23 benchmark set, which consists of 23 large (90K-1.8M block) benchmark circuits covering a wide range of application domains. Using the Titan23 benchmarks and a detailed model of Alteras Stratix IV architecture we compared the performance and quality of VPR and Quartus II targeting the same architecture. We found that VPR is at least 2.7× slower, uses 5.1× more memory and 2.6× more wire compared to Quartus II. Finally, we identified that VPRs focus on achieving a dense packing is responsible for a large portion of the wire length gap.

Journal of Biomedical Optics | 2009

Hardware acceleration of a Monte Carlo simulation for photodynamic therapy treatment planning

William Lo; Keith Redmond; Jason Luu; Paul Chow; Jonathan Rose; Lothar Lilge

Monte Carlo (MC) simulations are being used extensively in the field of medical biophysics, particularly for modeling light propagation in tissues. The high computation time for MC limits its use to solving only the forward solutions for a given source geometry, emission profile, and optical interaction coefficients of the tissue. However, applications such as photodynamic therapy treatment planning or image reconstruction in diffuse optical tomography require solving the inverse problem given a desired dose distribution or absorber distribution, respectively. A faster means for performing MC simulations would enable the use of MC-based models for accomplishing such tasks. To explore this possibility, a digital hardware implementation of a MC simulation based on the Monte Carlo for Multi-Layered media (MCML) software was implemented on a development platform with multiple field-programmable gate arrays (FPGAs). The hardware performed the MC simulation on average 80 times faster and was 45 times more energy efficient than the MCML software executed on a 3-GHz Intel Xeon processor. The resulting isofluence lines closely matched those produced by MCML in software, diverging by only less than 0.1 mm for fluence levels as low as 0.00001 cm(-2) in a skin model.

ACM Transactions on Reconfigurable Technology and Systems | 2011

VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process scaling

Jason Luu; Ian Kuon; Peter Jamieson; Ted Campbell; Andy Ye; Wei Mark Fang; Kenneth B. Kent; Jonathan Rose

The VPR toolset has been widely used in FPGA architecture and CAD research, but has not evolved over the past decade. This article describes and illustrates the use of a new version of the toolset that includes four new features: first, it supports a broad range of single-driver routing architectures, which have superior architectural and electrical properties over the prior multidriver approach (and which is now employed in the majority of FPGAs sold). Second, it can now model, for placement and routing a heterogeneous selection of hard logic blocks. This is a key (but not final) step toward the incluion of blocks such as memory and multipliers. Third, we provide optimized electrical models for a wide range of architectures in different process technologies, including a range of area-delay trade-offs for each single architecture. Finally, to maintain robustness and support future development the release includes a set of regression tests for the software. To illustrate the use of the new features, we explore several architectural issues: the FPGA area efficiency versus logic block granularity, the effect of single-driver routing, and a simple use of the heterogeneity to explore the impact of hard multipliers on wiring track count.

field-programmable custom computing machines | 2009

FPGA-based Monte Carlo Computation of Light Absorption for Photodynamic Cancer Therapy

Jason Luu; Keith Redmond; William Lo; Paul Chow; Lothar Lilge; Jonathan Rose

Photodynamic therapy (PDT) is a method of treating cancer that combines light and light-sensitive drugs to selectively destroy cancerous tumours without harming the healthy tissue. The success of PDT depends on the accurate computation of light dose distribution. Monte Carlo (MC) simulations can provide an accurate solution for light dose distribution, but have high computation time that prevents them from being used in treatment planning. To alleviate this problem, a hardware design of an MC simulation based on the gold standard software in biophotonics was implemented on a large modern FPGA. This implementation achieved a 28-fold speedup and 716-fold lower power-delay product compared to the gold standard software executed on a 3 GHz Intel Xeon 5160 processor. The accuracy of the hardware was compared to the gold standard using a realistic skin model. An experiment using 100 million photon packets yielded a light dose distribution that diverged by less than 0.1 mm. We also describe our development methodology, which employs an intermediate hardware description in SystemC prior to Verilog coding that led to significant design effort efficiency.

ACM Transactions on Reconfigurable Technology and Systems | 2015

Timing-Driven Titan: Enabling Large Benchmarks and Exploring the Gap between Academic and Commercial CAD

Kevin E. Murray; Scott Whitty; Suya Liu; Jason Luu; Vaughn Betz

Benchmarks play a key role in Field-Programmable Gate Array (FPGA) architecture and CAD research, enabling the quantitative comparison of tools and architectures. It is important that these benchmarks reflect modern large-scale systems that make use of heterogeneous resources; however, most current FPGA benchmarks are both small and simple. In this artile, we present Titan, a hybrid CAD flow that addresses these issues. The flow uses Altera’s Quartus II FPGA CAD software to perform HDL synthesis and a conversion tool to translate the result into the academic Berkeley Logic Interchange Format (BLIF). Using this flow, we created the Titan23 benchmark set, which consists of 23 large (90K--1.8M block) benchmark circuits covering a wide range of application domains. Using the Titan23 benchmarks and an enhanced model of Altera’s Stratix IV architecture, including a detailed timing model, we compare the performance and quality of VPR and Quartus II targeting the same architecture. We found that VPR is at least 2.8 × slower, uses 6.2 × more memory, 2.2 × more wire, and produces critical paths 1.5 × slower compared to Quartus II. Finally, we identified that VPR’s focus on achieving a dense packing and an inability to take apart clusters is responsible for a large portion of the wire length and critical path delay gap.

field programmable gate arrays | 2014

Towards interconnect-adaptive packing for FPGAs

Jason Luu; Jonathan Rose; Jason Helge Anderson

In order to investigate new FPGA logic blocks, FPGA architects have traditionally needed to customize CAD tools to make use of the new features and characteristics of those blocks. The software development effort necessary to create such CAD tools can be a time-consuming process that can significantly limit the number and variety of architectures explored. Thus, architects want flexible CAD tools that can, with few or no software modifications, explore a diverse space. Existing flexible CAD tools suffer from impractically long runtimes and/or fail to efficiently make use of the important new features of the logic blocks being investigated. This work is a step towards addressing these concerns by enhancing the packing stage of the open-source VTR CAD flow [17] to efficiently deal with common interconnect structures that are used to create many kinds of useful novel blocks. These structures include crossbars, carry chains, dedicated signals, and others. To accomplish this, we employ three techniques in this work: speculative packing, pre-packing, and interconnect-aware pin counting. We show that these techniques, along with three minor modifications, result in improvements to runtime and quality of results across a spectrum of architectures, while simultaneously expanding the scope of architectures that can be explored. Compared with VTR 1.0 [17], we show an average 12-fold speedup in packing for fracturable LUT architectures with 20% lower minimum channel width and 6% lower critical path delay. We obtain a 6 to 7-fold speedup for architectures with non-fracturable LUTs and architectures with depopulated crossbars. In addition, we demonstrate packing support for logic blocks with carry chains.

field-programmable custom computing machines | 2014

On Hard Adders and Carry Chains in FPGAs

Jason Luu; Conor McCullough; Sen Wang; Safeen Huda; Bo Yan; Charles Chiasson; Kenneth B. Kent; Jason Helge Anderson; Jonathan Rose; Vaughn Betz

Under some circumstances, the power flux density produced by emissions from a spacecraft suffers the presence of spurious frequencies. This occurs, for example, when idle data with long sequences of zeros are transmitted. At high data rates, randomizers may not be able to solve the problem. Because of the need to comply with the recommendations and standards, this can reflect on severe limits on the maximum data rates achievable. Such problem, experimentally observed in some recent missions, was first studied by Alvarez and Lesthievent, but an effective solution has not been found yet. We discuss the topic and formulate three proposals to compensate the drawback. We show they permit to reduce significantly the required margin at high data rates.Hardened adder and carry logic is widely used in commercial FPGAs to improve the efficiency of arithmetic functions. There are many design choices and complexities associated with such hardening, including circuit design, FPGA architectural choices, and the CAD flow. There has been very little study, however, on these choices and hence we explore a number of possibilities for hard adder design. We also highlight optimizations during front-end elaboration that help ameliorate the restrictions placed on logic synthesis by hardened arithmetic. We show that hard adders and carry chains, when used for simple adders, increase performance by a factor of four or more, but on larger benchmark designs that contain arithmetic, improve overall performance by roughly 15%. We measure an average area increase of 5% for architectures with carry chains but believe that better logic synthesis should reduce this penalty. Interestingly, we show that adding dedicated inter-logic-block carry links or fast carry look-ahead hardened adders result in only minor delay improvements for complete designs.Wideband channelization is a computationally intensive task within software-defined radio (SDR). To support this task, the underlying hardware should provide high performance and allow flexible implementations. Traditional solutions use field-programmable gate arrays (FPGAs) to satisfy these requirements. While FPGAs allow for flexible implementations, realizing a FPGA implementation is a difficult and time-consuming process. On the other hand, multicore processors while more programmable, fail to satisfy performance requirements. Graphics processing units (GPUs) overcome the above limitations. However, traditional GPUs are power-hungry and can consume as much as 350 watts, making them ill-suited for many SDR environments, particularly those that are battery-powered. Here we explore the viability of low-power mobile graphics processors to simultaneously overcome the limitations of performance, flexibility, and power. Via execution profiling and performance analysis, we identify major bottlenecks in mapping the wideband channelization algorithm onto these devices and adopt several optimization techniques to achieve multiplicative speed-up over a multithreaded implementation. Overall, our approach delivers a speedup of up to 43-fold on the discrete AMD Radeon HD 6470M GPU and 27-fold on the integrated AMD Radeon HD 6480G GPU, when compared to a vectorized and multithreaded version running on the AMD A4-3300M CPU.The ever increasing of product development and the scarcity of the energy resources that those manufacturing activities heavily rely on have made it of great significance the study on how to improve the energy efficiency in manufacturing environment. Energy consumption sensing and collection enables the development of effective solutions to higher energy efficiency. Further, it is found that the data on energy consumption of manufacturing machines also contains the information on the conditions of these machines. In this paper, methods of machine anomaly detection based on energy consumption information are developed and applied to cases on our Syil X4 computer numerical control (CNC) milling machine. Further, given massive amount of energy consumption data from large amount machining tasks, the proposed algorithms are being implemented on a Storm and Hadoop based framework aiming at online real-time machine anomaly detection.

Explore More