Paul Berube
University of Alberta
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul Berube.
IEEE Transactions on Education | 2005
José Nelson Amaral; Paul Berube; Paras Mehta
How should digital design be taught to computing science students in a single one-semester course? This work advocates the use of state-of-the-art design tools and programmable devices and presents a series of laboratory exercises to help students learn digital logic. Each exercise introduces new concepts and produces the complete design of a stand-alone apparatus that is fun and interesting to use. These exercises lead to the most challenging capstone designs for a single-semester course of which the authors are aware. Fast progress is made possible by providing students with predesigned input/output modules. Student feedback demonstrates that the students approve of this methodology. An extensive set of slides, supporting teaching material, and laboratory exercises are freely available for downloading.
international symposium on performance analysis of systems and software | 2006
Paul Berube; José Nelson Amaral
Published studies that use feedback-directed optimization (FDO) techniques use either a single input for both training and performance evaluation, or a single input for training and a single input for evaluation. Thus an important question is if the FDO results published in the literature are sensitive to the training and testing input selection. Aestimo is a new evaluation tool that uses a workload of inputs to evaluate the sensitivity of specific code transformations to the choice of inputs in the training and testing phases. Aestimo uses optimization logs to isolate the effects of individual code transformations. It incorporates metrics to determine the effect of training input selection on individual compiler decisions. Besides describing the structure of Aestimo, this paper presents a case study that uses SPEC CINT2000 benchmark programs with the Open Research Compiler (ORC) to investigate the effect of training/testing input selection on in-lining and if-conversion. The experimental results indicate that: (1) training input selection affects the compiler decisions made for these code transformation; (2) the choice of training/testing inputs can have a significant impact on measured performance.
international symposium on circuits and systems | 2005
Soraya Kasnavi; Vincent C. Gaudet; Paul Berube; José Nelson Amaral
Ternary content addressable memory (TCAM) is a popular device for hardware based lookup table solutions due to its high speed. However TCAM devices suffer from slow updates, high power consumption and low density. In this paper, we present a novel hardware-based longest prefix matching (HLPM) technique for pipelined TCAMs to increase TCAM efficiency. Our HLPM provides very simple and fast table updates, with no TCAM management requirements, as well as potentially decreasing the power consumption and area requirements for a TCAM. Up to 30% power savings for matching entries, compared to previously designed TCAMs, is reported.
Computer Networks | 2008
Soraya Kasnavi; Paul Berube; Vincent C. Gaudet; José Nelson Amaral
This paper proposes a novel Internet Protocol (IP) packet forwarding architecture for IP routers. This architecture is comprised of a non-blocking Multizone Pipelined Cache (MPC) and of a hardware-supported IP routing lookup method. The paper also describes a method for expansion-free software lookups. The MPC achieves lower miss rates than those reported in the literature. The MPC uses a two-stage pipeline for a half-prefix/half-full address IP cache that results in lower activity than conventional caches. MPCs updating technique allows the IP routing lookup mechanism to freely decide when and how to issue update requests. The effective miss penalty of the MPC is reduced by using a small non-blocking buffer. This design caches prefixes but requires significantly less expansion of the routing table than conventional prefix caches. The hardware-based IP lookup mechanism uses a Ternary Content Addressable Memory (TCAM) with a novel Hardware-based Longest Prefix Matching (HLPM) method. HLPM has lower signaling activity in order to process short matching prefixes as compared to alternative designs. HLPM has a simple solution to determine the longest matching prefix and requires a single write for table updates.
field-programmable logic and applications | 2003
Paul Berube; Ashley Zinyk; José Nelson Amaral; Mike H. MacGregor
In this paper we describe a method to implement a large, high density, fully associative cache in the Xilinx VirtexE FPGA architecture. The cache is based on a content addressable memory (CAM), with an associated memory to store information for each entry, and a replacement policy for victim selection. This implementation method is motivated by the need to improve the speed of routing of IP packets through Internet routers. To test our methodology, we designed a prototype cache with a 32 bit cache tag for the IP address and 4 bits of associated data for the forwarding information. The number of cache entries and the sizes of the data fields are limited by the area available in the FPGA. However, these sizes are specified as high level design parameters, which makes modifying the design for different cache configurations or larger devices trivial.
international conference on networking | 2005
Soraya Kasnavi; Paul Berube; Vincent C. Gaudet; José Nelson Amaral
Caching recently referenced IP addresses and their forwarding information is an effective strategy to increase routing lookup speed. This paper proposes a multizone non–blocking pipelined cache for IP routing lookup that achieves lower miss rates compared to previously reported IP caches. The twostage pipeline design provides a half–prefix half-full address cache and reduces the cache power consumption. By adopting a very small non-blocking buffer, the cache reduces the effective miss penalty. This cache design takes advantage of storing prefixes but requires smaller table expansions (up to 50% less) compared with prefix caches. Simulation results on real traffic display lower cache miss rate and up to 30% reduction in power consumption.
international symposium on performance analysis of systems and software | 2012
Paul Berube; José Nelson Amaral
This paper introduces combined profiling (CP): a new practical methodology to produce statistically sound combined profiles from multiple runs of a program. Combining profiles is often necessary to properly characterize the behavior of a program to support Feedback-Directed Optimization (FDO). CP models program behaviors over multiple runs by estimating their empirical distributions, providing the inferential power of probability distributions to code transformations. These distributions are build from traditional single-run point profiles; no new profiling infrastructure is required. The small fixed size of this data representation keeps profile sizes, and the computational costs of profile queries, independent of the number of profiles combined. However, when using even a single program run, a CP maintains the information available in the point profile, allowing CP to be used as a drop-in replacement for existing techniques. The quality of the information generated by the CP methodology is evaluated in LLVM using SPEC CPU 2006 benchmarks.
international conference on performance engineering | 2011
Paul Berube; Adam Preuss; José Nelson Amaral
Feedback-directed optimization (FDO) depends on profiling information that is representative of a typical execution of a given application. For most applications of interest, multiple data inputs need to be used to characterize the typical behavior of the program. Thus, profiling information from multiple runs of the program needs to be combined. We are working on a new methodology to produce statistically sound combined profiles from multiple runs of a program. This paper presents the motivation for combined profiling (CP), the requirements for a practical and useful methodology to combine profiles, and introduces the principal ideas under development for the creation of this methodology. We are currently working on implementations of CP in both the LLVM compiler and the IBM XL suite of compilers.
Microprocessors and Microsystems | 2004
Paul Berube; Mike H. MacGregor; José Nelson Amaral
Abstract Network routers rely on content addressable memories (CAMs) to accelerate the process of looking up the next hop of a packet. The input for this lookup is the destination address of the packet. This article describes our implementation and evaluation of a versatile prototype for a CAM. This prototype allows the empirical evaluation of the idea of caching lookup results in a multizone cache organized according to the length of the network prefix portion of the addresses. Implementing a cache in an FPGA efficiently required the design of a new cache replacement policy, the Bank N th Chance policy. In this article we present results from a functional simulator that allows the comparison of this new policy with existing ones such as least recently used, first in, first out and Second Chance. With a complete and functional prototype in a Xilinx Virtex 2000E device, we also report frequency of operation and occupation of the device. We present programmable logic design techniques that enable the implementation of ternary logic CAM cells, the efficient implementation of the new policy reference fields, and the pipelining of lookups.
symposium on code generation and optimization | 2009
Paul Berube; José Nelson Amaral; Rayson Ho; Raul Esteban Silvera
Feedback-directed optimization is an effective technique to improve program performance, but it may result in program performance and compiler behavior that is sensitive to both the selection of inputs used for training and the actual input in each run of the program. Cross-validation over a workload of inputs can address the input-sensitivity problem, but introduces the need to select a representative workload of minimal size from the population of available inputs. We present a compiler-centric clustering methodology to group similar inputs so that redundant inputs can be eliminated from the training workload. Input similarity is determined based on the compile-time code transformations made by the compiler after training separately on each input. Differences between inputs are weighted by a performance metric based on cross-validation in order to account for code transformation differences that have little impact on performance. We introduce the CrossError metric that allows the exploration of correlations between transformations based on the results of clustering. The methodology is applied to several SPEC benchmark programs, and illustrated using selected case studies.