Martin Labrecque
University of Toronto
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Martin Labrecque.
very large data bases | 2010
Mohammad Sadoghi; Martin Labrecque; Harsh Singh; Warren Shum; Hans-Arno Jacobsen
In this demo, we present fpga-ToPSS (Toronto Publish/Subscribe System Family), an efficient event processing platform for high-frequency and low-latency algorithmic trading. Our event processing platform is built over reconfigurable hardware---FPGAs---to achieve line-rate processing. Furthermore, our event processing engine supports Boolean expression matching with an expressive predicate language that models complex financial strategies to autonomously buy and sell stocks based on real-time financial data.
field-programmable logic and applications | 2007
Martin Labrecque; J.G. Steffan
Designers of FPGA-based systems are increasingly including soft processors-processors implemented in programmable logic-in their designs. Any combination of area, clock frequency, performance, and power may be of importance in the choice of a soft processor design to use, motivating area efficiency as the best metric with which to compare potential designs. In this paper we demonstrate that 3, 5, and 7-stage pipelined multithreaded soft processors are 33%, 77%, and 106% more area efficient than their single-threaded counterparts, the result of careful tuning of the architecture, ISA, and number of threads.
field programmable custom computing machines | 2008
Martin Labrecque; Peter Yiannacouras; J. G. Steffan
As FPGA-based systems including soft-processors become increasingly common we are motivated to better understand the best way to scale the performance of such systems. In this paper we explore the organization of processors and caches connected to a single off-chip memory channel, for workloads composed of many independent threads. In particular we design and evaluate real FPGA-based processor, multithreaded processor, and multiprocessor systems on EEMBC benchmarks - investigating different approaches to scaling caches, processors, and thread contexts to maximize throughput while minimizing area. Our main finding is that while a single multithreaded processor offers improved performance over a single-threaded processor, multiprocessors composed of single-threaded processors scale better than those composed of multithreaded processors.
field-programmable logic and applications | 2009
Martin Labrecque; J. Gregory Steffan
As FPGA-based systems including soft processors become increasingly common, we are motivated to better understand the architectural trade-offs and improve the efficiency of these systems. Previous work has demonstrated that support for multithreading in soft processors can tolerate pipeline and I/O latencies as well as improve overall system throughput-however earlier work assumes an abundance of completely independent threads to execute. In this work we show that for real workloads, in particular packet processing applications, there is a large fraction of processor cycles wasted while awaiting the synchronization of shared data structures, limiting the benefits of a multithreaded design. We address this challenge by proposing a method of scheduling threads in hardware that allows the multithreaded pipeline to be more fully utilized without significant costs in area or frequency. We evaluate our technique relative to conventional multithreading using both simulation and a real implementation on a NetFPGA board, evaluating three deep-packet inspection applications that are threaded, synchronize, and share data structures, and show that overall packet throughput can be increased by 63%, 31%, and 41% for our three applications.
ACM Sigarch Computer Architecture News | 2007
Martin Labrecque; Peter Yiannacouras; J. Gregory Steffan
Embedded systems designers that use FPGAs are increasingly including soft processors in their designs (configurable processors built in the programmable logic of the FPGA). While there has been a significant amount of research on adding custom instructions and accelerators to soft processors, these are typically used to extend an unmodified base ISA targeted by generic compilation such as with unmodified gcc. In this paper we explore several opportunities for the compiler to optimize the code generated for soft processors through application-specific customization of the base ISA---techniques that are orthogonal to adding custom instructions. In particular we explore: (i) low level software-hardware trade-offs between basic instructions; (ii) the utility of ISA-specific features---in particular for the delay slots and Hi/Lo registers in the MIPS ISA; and (iii) application specific register management. We find that through these techniques that have no hardware cost we can improve the area efficiency of soft processors by 12% on average across a suite of benchmarks, and by up to 47% in the best case.
ACM Transactions on Reconfigurable Technology and Systems | 2011
Martin Labrecque; Mark C. Jeffrey; J. Gregory Steffan
As reconfigurable computing hardware and in particular FPGA-based systems-on-chip comprise an increasing number of processor and accelerator cores, supporting sharing and synchronization in a way that is scalable and easy to program becomes a challenge. Transactional Memory (TM) is a potential solution to this problem, and an FPGA-based system provides the opportunity to support TM in hardware (HTM). Although there are many proposed approaches to HTM support for ASICs, these do not necessarily map well to FPGAs. In particular in this work we demonstrate that while signature-based conflict detection schemes (essentially bit-vectors) should intuitively be a good match to the bit parallelism of FPGAs, previous approaches result in unacceptable multicycle stalls, operating frequencies, or false-conflict rates. Capitalizing on the reconfigurable nature of FPGA-based systems, we propose an application-specific signature mechanism for HTM conflict detection. Our evaluation uses real and projected FPGA-based soft multiprocessor systems that support HTM and implement threaded, shared-memory network packet processing applications. We find that our application-specific approach: (i) maintains a reasonable operating frequency of 125 MHz, (ii) achieves a 9% to 71% increase in packet throughput relative to signatures with bit selection on a 2-thread architecture, and (iii) allows our HTM to achieve 6%, 54%, and 57% increases in packet throughput on an 8-thread architecture versus a baseline lock-based synchronization for three of four packet processing applications studied, due to reduced false synchronization.
architectures for networking and communications systems | 2010
Martin Labrecque; J. Gregory Steffan
Software packet processing is becoming more important to enable differentiated and rapidly-evolving network services. With increasing numbers of programmable processor and accelerator cores per network node, it is a challenge to support sharing and synchronization across them in a way that is scalable and easy-to-program. In this paper, we focus on parallel/threaded applications that have irregular control-flow and frequently-updated shared state that must be synchronized across threads. However, conventional lock-based synchronization is both difficult to use and also often results in frequent conservative serialization of critical sections. Alternatively, we propose that Transactional memory (TM) is a good match to software packet processing: it both (i) can allow the system to optimistically exploit parallelism between the processing of packets whenever it is safe to do so, and (ii) is easy-to-use for a programmer. With the NetFPGA platform and four network packet processing applications that are threaded and share memory, we evaluate hardware support for TM (HTM) using the reconfigurable FPGA fabric. Relative to NetThreads, our two-processor four-way-multithreaded system with conventional lock-based synchronization, we find that adding HTM achieves 6%, 54% and 57% increases in packet throughput for three of four packet processing applications studied, due to reduced conservative serialization.
high performance interconnects | 2012
Monia Ghobadi; Geoffrey Salmon; Yashar Ganjali; Martin Labrecque; J. Gregory Steffan
This paper presents Caliper, a highly-accurate packet injection tool that generates precise and responsive traffic. Caliper takes live packets generated on a host computer and transmits them onto a gigabit Ethernet network with precise inter-transmission times. Existing software traffic generators rely on generic Network Interface Cards which, as we demonstrate, do not provide high-precision timing guarantees. Hence, performing valid and convincing experiments becomes difficult or impossible in the context of time-sensitive network experiments. Our evaluations show that Caliper is able to reproduce packet inter-transmission times from a given arbitrary distribution while capturing the closed-loop feedback of TCP sources. Specifically, we demonstrate that Caliper provides three orders of magnitude better precision compared to commodity NIC: with requested traffic rates up to the line rate, Caliper incurs an error of 8 ns or less in packet transmission times. Furthermore, we explore Calipers ability to integrate with existing network simulators to project simulated traffic characteristics into a real network environment. Caliper is freely available online.
field programmable gate arrays | 2011
Martin Labrecque; J. Gregory Steffan
We propose NetTM: support for hardware transactional memory (HTM) in an FPGA-based soft multithreaded multicore that matches the strengths of FPGAs. We evaluate our system using the NetFPGA [6] platform and four network packet processing applications that are threaded and share memory. Relative to NetThreads [5], an existing two-processor four-way-multithreaded system with conventional lock-based synchronization, we find that adding HTM support (i) maintains a reasonable operating frequency of 125MHz with an area overhead of 20%, (ii) can transactionally execute lock-based critical sections with no software modification, and (iii) achieves 6%, 55% and 57% increases in packet throughput for three of four packet processing applications studied, due to reduced false synchronization.
Archive | 2016
Martin Labrecque; J. Gregory Steffan; Geoffrey Salmon; Monia Ghobadi; Yashar Ganjali