Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Guy Lemieux is active.

Publication


Featured researches published by Guy Lemieux.


field-programmable technology | 2004

Directional and single-driver wires in FPGA interconnect

Guy Lemieux; Edmund Lee; Marvin Tom; Anthony J. Yu

Modern FPGA architectures from Altera and Xilinx have shifted away from allowing multiple drivers to connect to each interconnect wire. This work advocates the need for this shift to single-driver wiring by investigating the necessary architectural and circuit design changes. When single-driver wiring is used, area improves by 25%, delay improves by 9%, and area-delay improves by 32% compared to bidirectional wiring. Wiring capacitance is reduced by 37% due to reduced switch loading and physical wire length shrinkage. Furthermore, it is shown that larger circuits tend to realize larger savings. No significant CAD tool changes are needed.


Proceedings of the IEEE | 2006

System-on-Chip: Reuse and Integration

Resve A. Saleh; Steven J. E. Wilton; Shahriar Mirabbasi; Alan J. Hu; Mark R. Greenstreet; Guy Lemieux; Partha Pratim Pande; Cristian Grecu; André Ivanov

Over the past ten years, as integrated circuits became increasingly more complex and expensive, the industry began to embrace new design and reuse methodologies that are collectively referred to as system-on-chip (SoC) design. In this paper, we focus on the reuse and integration issues encountered in this paradigm shift. The reusable components, called intellectual property (IP) blocks or cores, are typically synthesizable register-transfer level (RTL) designs (often called soft cores) or layout level designs (often called hard cores). The concept of reuse can be carried out at the block, platform, or chip levels, and involves making the IP sufficiently general, configurable, or programmable, for use in a wide range of applications. The IP integration issues include connecting the computational units to the communication medium, which is moving from ad hoc bus-based approaches toward structured network-on-chip (NoC) architectures. Design-for-test methodologies are also described, along with verification issues that must be addressed when integrating reusable components.


IEEE Design & Test of Computers | 2007

A Survey and Taxonomy of GALS Design Styles

Paul Teehan; Mark R. Greenstreet; Guy Lemieux

Single-clocked digital systems are largely a thing of the past. Although most digital circuits remain synchronous, many designs feature multiple clock domains, often running at different frequencies. Using an asynchronous interconnect decouples the timing issues for the separate blocks. Systems employing such schemes are called globally asynchronous, locally synchronous (GALS). To minimize time to market, large SoC designs must integrate many functional blocks with minimal design effort. These blocks are usually designed using standard synchronous methods and often have different clocking requirements. A GALS approach can facilitate fast block reuse by providing wrapper circuits to handle interblock communication across clock domain boundaries. SoCs may also achieve power savings by clocking different blocks at their minimum speeds. For example, Scott et al. describe the advantages of GALS design for an embedded-processor peripheral bus.


Archive | 2004

Design of Interconnection Networks for Programmable Logic

Guy Lemieux; David Lewis

1. Introduction.- 2. Interconnection Networks.- 3. Models, Methodology and CAD Tools.- 4. Sparse Crossbar Design.- 5. Sparse Cluster Design.- 6. Routing Switch Circuit Design.- 7. Switch Block Design.- 8. Conclusions.- Appendices.- A Switch Blocks: Reduced Flexibility.- A.1 Introduction.- A.4 Results.- A.5 Summary.- B Switch Blocks: Diverse Design Instances.- C VPRx: VPR Extensions.- C.1 Determination of Router Effort.- C.2 Routing Graph and Netlist Changes (Sparse Clusters).- C.3 Area and Delay Calculation Improvements.- C.4 Runtime Improvements.- C.5 Experimental Noise Reduction.- C.6 Correctness Changes.- References.


design automation conference | 2005

Logic block clustering of large designs for channel-width constrained FPGAs

Marvin Tom; Guy Lemieux

In this paper we present a system level technique for mapping large, multiple-IP-block designs to channel-width constrained FPGAs. Most FPGA clustering tools (Betz, 1999, Bozorgzadeh, 2004 and Singh, 2002) aim to reduce the amount of intercluster connections, hence reducing channel width needs. However, if this exceeds the FPGAs channel width (a hard constraint), then the circuit still cannot be routed. Previous work by Singh (2002) and Tessier (2000) depopulates logic clusters (CLBs) to reduce channel width. By depopulating non-uniformly, i.e. depopulate more in hard-to-route regions, we show a graceful trade-off between channel width and CLB count. This makes it possible to target specific channel-width constraints during clustering with minimal CLB inflation. Results show channel width decreases of up to 20% with a 5% increase in area. Further decreases of nearly 50% are possible at 3.3 times the original area. Despite the area increase, this technique creates routable solutions from otherwise unroutable circuits.


field programmable gate arrays | 2002

Circuit design of routing switches

Guy Lemieux; David Lewis

This paper examines circuit design of buffered routing switches in symmetrical, island-style FPGAs. The effects of switch size, tile length, level-restoring, and slow input slew rates are examined. Two new fanin-based switch designs are used to eliminate nearly all of the increase in delay that arises from fanout with a previous switch design. Alternating between buffers and pass transistors is shown to improve connection delay without fanout by 25%. To take advantage of this, we propose schemes to replace some buffers with pass transistors to simultaneously reduce area and delay. Routing a suite of MCNC benchmark circuits shows that 14% in area-delay, or 7% in delay can be saved using the new switch schemes. Alternatively, approximately 13% in area can be saved with no degradation to delay.


ACM Transactions on Reconfigurable Technology and Systems | 2009

Vector Processing as a Soft Processor Accelerator

Jason Yu; Christopher Eagleston; Christopher Han-Yu Chou; Maxime Perreault; Guy Lemieux

Current FPGA soft processor systems use dedicated hardware modules or accelerators to speed up data-parallel applications. This work explores an alternative approach of using a soft vector processor as a general-purpose accelerator. The approach has the benefits of a purely software-oriented development model, a fixed ISA allowing parallel software and hardware development, a single accelerator that can accelerate multiple applications, and scalable performance from the same source code. With no hardware design experience needed, a software programmer can make area-versus-performance trade-offs by scaling the number of functional units and register file bandwidth with a single parameter. A soft vector processor can be further customized by a number of secondary parameters to add or remove features for a specific application to optimize resource utilization. This article introduces VIPERS, a soft vector processor architecture that maps efficiently into an FPGA and provides a scalable amount of performance for a reasonable amount of area. Compared to a Nios II/s processor, instances of VIPERS with 32 processing lanes achieve up to 44× speedup using up to 26× the area.


international symposium on physical design | 1997

On two-step routing for FPGAS

Guy Lemieux; Stephen Dean Brown; Daniel Vranesic

We present results which show that a separate global and detailed routing strategy can be competitive with a combined routing process. Under restricted architectural assumptions, we compute a new lower bound for detailed routing and show that our detailed router typically requires no more than two extra routing tracks above this computed limit. Also, experimental results show that the Mapping Anomaly presented in [20], which suggests that separated routing may yield arbitrarily poor results in certain instances, is a concern only if nets are restricted to a single track domain. Finally, to motivate future work, we show the latest two-step routing results that we have achieved with the VPR global router and SEGA detailed router tools on the largest CBL benchmark circuits.


field-programmable logic and applications | 2005

Defect-tolerant FPGA switch block and connection block with fine-grain redundancy for yield enhancement

Anthony J. Yu; Guy Lemieux

Future process nodes have such small feature sizes that there will be an increase in the number of manufacturing defects per die. For large FPGAs, it will be critical to tolerate multiple defects (Campregher et al., 2005). We propose a number of changes to the detailed routing architecture of island-style FPGAs to tolerate multiple random, distributed interconnect defects without re-routing and with minimal impact on signal timing. Our scheme is a user option prebuilt into an architecture, requiring +11% area for additional multiplexers. Unused (spare) wiring tracks are also needed, bringing total overhead to 24% to tolerate stuck-at or open faults, or 34% to include bridging. User circuits that do not fully stress the routing network already have these tracks freely available. The delay penalty is programmable: 5-10% if defect rates are expected to be sufficiently low, but can be as high as 25% if defect rates are high. Our schemes can tolerate more than 10 interconnect defects for large array sizes of 128 /spl times/ 128. Unlike row/column redundancy schemes, our schemes are scalable: they naturally tolerate more defects as the FPGA array size increases. This work is the first detailed analysis of fine-grained defect-tolerant schemes in FPGAs.


field-programmable custom computing machines | 2012

VENICE: A Compact Vector Processor for FPGA Applications

Aaron Severance; Guy Lemieux

VENICE is a new soft vector processor (SVP) for FPGA applications that is designed for maximum through-put with a small number (1 to 4) of ALUs. By increasing clock speed and eliminating bottlenecks in ALU utilization, VENICE achieves over 2x better performance-per-logic block than VEGAS, the previous best SVP. VENICE is also simpler to program, as its instructions use standard C pointers into a scratchpad memory rather than vector registers.

Collaboration


Dive into the Guy Lemieux's collaboration.

Top Co-Authors

Avatar

Shahriar Mirabbasi

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

David Lewis

University of Adelaide

View shared research outputs
Top Co-Authors

Avatar

Aaron Severance

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

Ameer M. S. Abdelhadi

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

Mehdi Alimadadi

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

Samad Sheikhaei

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

Steven J. E. Wilton

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

P.R. Palmer

University of Cambridge

View shared research outputs
Top Co-Authors

Avatar

David Grant

University of British Columbia

View shared research outputs
Top Co-Authors

Avatar

Mark R. Greenstreet

University of British Columbia

View shared research outputs
Researchain Logo
Decentralizing Knowledge