Michaela Blott
Xilinx
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Michaela Blott.
field programmable gate arrays | 2017
Yaman Umuroglu; Nicholas J. Fraser; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong; Magnus Jahre; Kees A. Vissers
Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present FINN, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 μs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 μs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.
field-programmable logic and applications | 2013
Zsolt István; Gustavo Alonso; Michaela Blott; Kees A. Vissers
Common web infrastructure relies on distributed main memory key-value stores to reduce access load on databases, thereby improving both performance and scalability of web sites. As standard cloud servers provide sub-linear scalability and reduced power efficiency to these kinds of scale-out workloads, we have investigated a novel dataflow architecture for key-value stores with the aid of FPGAs which can deliver consistent 10Gbps throughput. In this paper, we present the design of a novel hash table which forms the centre piece of this dataflow architecture. The fully pipelined design can sustain consistent 10Gbps line-rate performance by deploying a concurrent mechanism to handle hash collisions. We address problems such as support for a broad range of key sizes without stalling the pipeline through careful matching of lookup time with packet reception time. Finally, the design is based on a scalable architecture that can be easily parametrized to work with different memory types operating at different access speeds and latencies. We deployed this hash table in a memcached prototype to index 2 million entries in 24GBytes of external DDR3 DRAM while sustaining 13 million requests per second for UDP binary encoded memcached packets which is the maximum packet rate that can be achieved with memcached on a 10Gbps link.
ieee hot chips symposium | 2013
Michaela Blott; Kees A. Vissers
Presents a collection of slides covering the following topics: key-value stores; TCP-IP stack; synchronization overhead; L3 cache; FPGA-based dataflow architecture; hash table architecture; memcached evaluation; and code complexity.
IEEE Network | 2014
Gianni Antichi; Muhammad Shahbaz; Yilong Geng; Noa Zilberman; Adam Covington; Marc Bruyere; Nick McKeown; Nick Feamster; Bob Felderman; Michaela Blott; Andrew W. Moore; Philippe Owezarski
Despite network monitoring and testing being critical for computer networks, current solutions are both extremely expensive and inflexible. Into this lacuna we launch the Open Source Network Tester, a fully open source traffic generator and capture system. Our prototype implementation on the NetFPGA-10G supports 4 × 10 Gb/s traffic generation across all packet sizes, and traffic capture is supported up to 2 × 10Gb/s with naïve host software. Our system implementation provides methods for scaling and coordinating multiple generator/capture systems, and supports 6.25 ns timestamp resolution with clock drift and phase coordination maintained by GPS input. Additionally, our approach has demonstrated lower-cost than comparable commercial systems while achieving comparable levels of precision and accuracy; all within an open-source framework extensible with new features to support new applications, while permitting validation and review of the implementation.
field-programmable custom computing machines | 2015
David Sidler; Gustavo Alonso; Michaela Blott; Kimon Karras; Kees A. Vissers; Raymond Carley
TCP/IP is the predominant communication protocol in modern networks but also one of the most demanding. Consequently, TCP/IP offload is becoming increasingly popular with standard network interface cards. TCP/IP Offload Engines have also emerged for FPGAs, and are being offered by vendors such as Intilop, Fraunhofer HHI, PLDA and Dini Group. With the target application being high-frequency trading, these implementations focus on low latency and support a limited session count. However, many more applications beyond high-frequency trading can potentially be accelerated inside an FPGA once TCP with high session count is available inside the fabric. This way, a network-attached FPGA on ingress and egress to a CPU can accelerate functions such as encryption, compression, memcached and many others in addition to running the complete network stack. This paper introduces a novel architecture for a 10Gbps line-rate TCP/IP stack for FPGAs that can scale with the number of sessions and thereby addresses these new applications. We prototyped the design on a VC709 development board, demonstrating compatibility with existing network infrastructure, operating at full 10Gbps throughput full-duplex while supporting 10,000 sessions. Finally, the design has been described primarily using high-level synthesis, which accelerates development time and improves maintainability.
international parallel and distributed processing symposium | 2016
Giulia Guidi; Enrico Reggiani; Lorenzo Di Tucci; Gianluca Durelli; Michaela Blott; Marco D. Santambrogio
Custom hardware accelerators are widely used to improve the performance of software applications in terms of execution times and to reduce energy consumption. However the realization of an hardware accelerator and its integration in the final system is a difficult and error prone task. For this reason, both Industry and Academy are continuously developing Computer Aided Design (CAD) tools to assist the designer in the development process. Even if many of the steps have been nowadays automated, system integration and SW/HW interfaces definition and drivers generation are still almost completely manual tasks. The last tool released by Xilinx, however, aims at improving the hardware design experience by leveraging the OpenCL standard to enhance the overall productivity and to enable code portability. This paper provides an overview of the SDAccel potentiality comparing its design flow with other methodologies using two case studies from the Bioinformatics field: brain network and protein folding analysis.
high performance embedded architectures and compilers | 2017
Nicholas J. Fraser; Yaman Umuroglu; Giulio Gambardella; Michaela Blott; Philip Heng Wai Leong; Magnus Jahre; Kees A. Vissers
Binarized neural networks (BNNs) are gaining interest in the deep learning community due to their significantly lower computational and memory cost. They are particularly well suited to reconfigurable logic devices, which contain an abundance of fine-grained compute resources and can result in smaller, lower power implementations, or conversely in higher classification rates. Towards this end, the FINN framework was recently proposed for building fast and flexible field programmable gate array (FPGA) accelerators for BNNs. FINN utilized a novel set of optimizations that enable efficient mapping of BNNs to hardware and implemented fully connected, non-padded convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. However, FINN was not evaluated on larger topologies due to the size of the chosen FPGA, and exhibited decreased accuracy due to lack of padding. In this paper, we improve upon FINN to show how padding can be employed on BNNs while still maintaining a 1-bit datapath and high accuracy. Based on this technique, we demonstrate numerous experiments to illustrate flexibility and scalability of the approach. In particular, we show that a large BNN requiring 1.2 billion operations per frame running on an ADM-PCIE-8K5 platform can classify images at 12 kFPS with 671 μs latency while drawing less than 41 W board power and classifying CIFAR-10 images at 88.7% accuracy. Our implementation of this network achieves 14.8 trillion operations per second. We believe this is the fastest classification rate reported to date on this benchmark at this level of accuracy.
ACM Transactions on Reconfigurable Technology and Systems | 2015
Zsolt István; Gustavo Alonso; Michaela Blott; Kees A. Vissers
FPGA-based data processing is becoming increasingly relevant in data centers, as the transformation of existing applications into dataflow architectures can bring significant throughput and power benefits. Furthermore, a tighter integration of computing and network is appealing, as it overcomes traditional bottlenecks between CPUs and network interfaces, and dramatically reduces latency. In this article, we present the design of a novel hash table, a fundamental building block used in many applications, to enable data processing on FPGAs close to the network. We present a fully pipelined design capable of sustaining consistent 10Gbps line-rate processing by deploying a concurrent mechanism to handle hash collisions. We address additional design challenges such as support for a broad range of key sizes without stalling the pipeline through careful matching of lookup time with packet reception time. Finally, the design is based on a scalable architecture that can be easily parameterized to work with different memory types operating at different access speeds and latencies. We have tested the proposed hash table in an FPGA-based memcached appliance implementing a main-memory key-value store in hardware. The hash table is used to index 2 million entries in 24GB of external DDR3 DRAM while sustaining 13 million requests per second, the maximum packet rate that can be achieved with UDP packets on a 10Gbps link for this application.
languages, compilers, and tools for embedded systems | 2009
Paul Edward McKechnie; Michaela Blott; Wim Vanderbauwhede
The fine-grained parallelism inherent in FPGAs has encouraged their use in packet processing systems. Debugging and performance evaluation of such complex designs can be significantly improved through debug information that provides a system-level perspective and hides the complexity of signal-level debugging. In this paper we present a debugging system that permits transaction-based communication-centric monitoring of packet processing systems. We demonstrate, using two different examples, how this system can improve the debugging information and abstract lower level detail. Furthermore, we demonstrate that transaction monitoring systems require fewer resources than conventional RTL debugging systems and can provide a system-level perspective not permitted by traditional tools.
design, automation, and test in europe | 2017
Lorenzo Di Tucci; Kenneth O'Brien; Michaela Blott; Marco D. Santambrogio
Smith-Waterman is a dynamic programming algorithm that plays a key role in the modern genomics pipeline as it is guaranteed to find the optimal local alignment between two strings of data. The state of the art presents many hardware acceleration solutions that have been implemented in order to exploit the high degree of parallelism available in this algorithm. The majority of these implementations use heuristics to increase the performance of the system at the expense of the accuracy of the result. In this work, we present an implementation of the pure version of the algorithm. We include the key architectural optimizations to achieve highest possible performance for a given platform and leverage the Berkeley roofline model to track the performance and guide the optimizations. To achieve scalability, our custom design comprises of systolic arrays, data compression features and shift registers, while a custom port mapping strategy aims to maximize performance. Our designs are built leveraging an OpenCL-based design entry, namely Xilinx SDAccel, in conjunction with a Xilinx Virtex 7 and Kintex Ultrascale platform. Our final design achieves a performance of 42.47 GCUPS (giga cell updates per second) with an energy efficiency of 1.6988 GCUPS/W. This represents an improvement of 1.72x in performance and energy efficiency over previously published FPGA implementations and 8.49x better in energy efficiency over comparable GPU implementations.