[PDF] FPGAs-as-a-Service Toolkit (FaaST)

Abstract

Computing needs for high energy physics are already intensive and are expected to increase drastically in the coming years. In this context, heterogeneous computing, specifically as-a-service computing, has the potential for significant gains over traditional computing models. Although previous studies and packages in the field of heterogeneous computing have focused on GPUs as accelerators, FPGAs are an extremely promising option as well. A series of workflows are developed to establish the performance capabilities of FPGAs as a service. Multiple different devices and a range of algorithms for use in high energy physics are studied. For a small, dense network, the throughput can be improved by an order of magnitude with respect to GPUs as a service. For large convolutional networks, the throughput is found to be comparable to GPUs as a service. This work represents the first open-source FPGAs-as-a-service toolkit.

Full PDF

FFPGAs-as-a-Service Toolkit (FaaST)

Dylan Rankin, Jeffrey Krupa,Philip Harris

Massachusetts Institute of Technology

Cambridge, MA 02139, USA

Maria Acosta Flechas, Burt Holzman,Thomas Klijnsma, Kevin Pedro,Nhan Tran

Fermi National Accelerator Laboratory

Batavia, IL 60510, USA

Scott Hauck, Shih-Chieh Hsu,Matthew Trahms, Kelvin Lin, Yu Lou

University of Washington

Seattle, WA 98195, USA

Ta-Wei Ho

National Tsing Hua University

Hsinchu, Taiwan 300044, R.O.C.

Javier Duarte

University of California San Diego

La Jolla, CA 92093, USA

Mia Liu

Purdue University

West Lafayette, IN 47907, USA

Abstract —Computing needs for high energy physics are al-ready intensive and are expected to increase drastically in thecoming years. In this context, heterogeneous computing, specif-ically as-a-service computing, has the potential for signiﬁcantgains over traditional computing models. Although previousstudies and packages in the ﬁeld of heterogeneous computinghave focused on GPUs as accelerators, FPGAs are an extremelypromising option as well. A series of workﬂows are developedto establish the performance capabilities of FPGAs as a service.Multiple different devices and a range of algorithms for use inhigh energy physics are studied. For a small, dense network, thethroughput can be improved by an order of magnitude withrespect to GPUs as a service. For large convolutional networks,the throughput is found to be comparable to GPUs as a service.This work represents the ﬁrst open-source FPGAs-as-a-servicetoolkit.

Index Terms —FPGAs, machine learning, as a service, highenergy physics

I. I

NTRODUCTION

The breakdown of Dennard scaling [1] in the last decade haschanged the landscape of modern computing [2]. Without thepromise of ever-faster central processing units (CPUs) at a ﬁxedpower consumption, users have been forced to search elsewherefor solutions to their ever-growing computing needs [3, 4].Some improvements in processor performance have comefrom the advent of multi-core processors. However, there isgrowing interest in alternative computing architectures, suchas graphics processing units (GPUs), ﬁeld-programmable gatearrays (FPGAs), and application-speciﬁc integrated circuits(ASICs). All of these architectures have been used in thepast for various specialized tasks that make explicit use oftheir speciﬁc strengths, but a broader range of use cases hasbeen encouraged in recent years by heterogeneous computing.Heterogeneous computing denotes systems which make use ofmore than one type of computing architecture, typically a CPUand one of the alternative architectures above. The alternativearchitecture is typically referred to as the “coprocessor” or“accelerator.” The advantage of this computing paradigm is thateach algorithm can be run on the best-suited architecture. Toolsand strategies to simplify the use of heterogeneous computing solutions have enabled a growing list of applications to takeadvantage of alternative architectures.Implementations of heterogeneous computing can takevarious forms. The simplest design is typically to connecteach accelerator to a CPU, and then have each CPU ofﬂoadsome portion of its work to the accelerator. However, this isnot necessarily the most effective design for a given system.One alternative paradigm, called computing “as a service”,consists of separate server and client CPUs [5–7]. ServerCPUs are directly connected to the accelerators, and areresponsible only for managing requests to communicate withthe accelerator. Client CPUs have network connections to theservers and are responsible for all other parts of the computingworkﬂow; to use the accelerator they must send requests andreceive replies from the servers. This design separates themanagement of the accelerator from rest of the workﬂow, andsimpliﬁes the integration of the accelerator. Replacing theaccelerated application with a request to and reply from theserver allows the client to remain insensitive to speciﬁcs ofthe accelerator such as the architecture, physical connections,transfer protocols, and other details of handling the data.Heterogeneous workﬂows involving GPUs have been usedfor machine learning (ML) with great success [8], but work-ﬂows involving FPGAs have been slower to develop. Tradition-ally, algorithm development for FPGAs has been restricted toexperts well-versed in hardware description languages (HDLs),greatly limiting the pool of possible developers. Conversely,high-level synthesis (HLS) compilers are capable of transform-ing untimed C into applications written in HDL, reducingthe barrier to entry for FPGA algorithm development [9]. Forcertain tasks, modern HLS tools are able to achieve performancecomparable to that of handwritten HDL [10].Despite their relative immaturity as accelerators in hetero-geneous workﬂows, FPGAs have many appealing featuresfrom a computing standpoint. Fast algorithms can be runin nanoseconds, allowing large speedups in comparison tothe same algorithms on CPUs. FPGAs are also capable ofrunning many smaller operations in parallel, thus allowingfurther improvements in speed. Although ASICs are capableof providing similar or better factors of improvement in terms

FERMILAB-CONF-20-426-SCD a r X i v : . [ phy s i c s . c o m p - ph ] O c t f speed, the ability to customize FPGAs allows them tobe adapted to many different tasks or updated as algorithmsand needs change. FPGAs are also capable of providing thisperformance with reduced power consumption when comparedto CPUs or GPUs.As with GPU-based heterogeneous computing tools, manytools focused on FPGAs are designed with ML algorithmsin mind, speciﬁcally deep neural networks (DNNs). Thecharacteristics of most ML algorithms, speciﬁcally a smallnumber of inputs and a large number of operations, are well-suited for as-a-service computing models. The algorithmsconsidered in our work are all ML algorithms of differentsizes, meant to span a wide range of possible requirementsand design parameters. All of the algorithms explored hereare contenders for integration into heterogeneous workﬂowsinvolving FPGAs.We use a combination of custom and existing tools intendedto target the speciﬁc needs of each algorithm. These arepackaged into a cohesive set of implementations that containboth the server and client code required to deploy both smalland large DNN models, with different NN architectures, onmultiple different hardware platforms. We refer to this as the FPGAs-as-a-Service Toolkit (FaaST) [11, 12]. The frameworkwe employ and the server design are capable of supportingboth ML and non-ML algorithms.The rest of this paper is structured as follows. In Section II,we review related work. Section III describes the set of toolsand ML models used. In Section IV, we give results for theFaaST approach and compare it to other approaches with GPUsand CPUs. Finally, Sections V and VI provide discussion andoutlook. II. R

ELATED W ORK

As-a-service computing for ML algorithms is a growingarea of development at the intersection of the ﬁelds of MLand on-demand cloud computing [13, 14]. The bulk of thetools available focus mainly on accelerating inference for largeconvolutional neural networks (CNNs) using GPUs. Our workbuilds directly on some of these existing platforms.High energy physics (HEP) workﬂows typically process anevent using distinct modules, each responsible for executinga speciﬁc algorithm or computing a particular property ofthe event. These modules can depend on the output of othermodules, and therefore must be scheduled and in some casesrun in a particular order to process an event successfully [15].The Services for Optimized Network Inference on Coprocessors(SONIC) [16] approach is designed with HEP workﬂows inmind. With this approach, the client API for a given server isintegrated into an experiment’s C++-based software framework,speciﬁcally the Compact Muon Solenoid (CMS) experiment atthe CERN Large Hadron Collider. Notably, SONIC supportsaccelerating generic algorithms using asynchronous, non-blocking methods. This allows event processing on the CPU toproceed simultaneous with the accelerated algorithm, makingmaximal use of the computing resources. The feasibility of the as-a-service computing model for HEPworkﬂows has been previously demonstrated using SONICto interact with a GPU-based server for inference [17]. Theserver/client design employed within this paper is similarto previous work, allowing for a direct comparison of theperformance. In addition, the design similarity showcases theversatility of the SONIC framework to handle both GPU andFPGA-based coprocessor servers.Our results utilize multiple DNNs that are all relevant forHEP. These networks span a range of sizes, use cases, andconstraints. For small networks, the server is implemented using hls4ml [18, 19] and Vitis Accel (previously SDAccel) [20].For large networks, the server is implemented using XilinxML Suite [21].The hls4ml package translates neural network models intoFPGA ﬁrmware. The ﬁrmware description is generated in anHLS language and is then compiled into a ﬁrmware descriptionin VHDL/Verilog. hls4ml contains various tunable parametersto control the resource usage and performance, which are veryuseful to maximize the performance of the design. Vitis Accelis a tool designed by Xilinx to allow for implementation andacceleration of generic FPGA kernels and their managementfrom a host CPU. In some of our work, the server and itscommunication with the FPGA is written using Vitis Accel,while the FPGA kernels to perform the inference are createdwith hls4ml .Xilinx ML Suite [21] is a library developed by Xilinx thatcan deploy CNNs to Xilinx FPGAs. It contains a utility toquantize models, a compiler that coverts T

ENSOR F LOW [22]or C

AFFE [23] models to an internal format, a CNN processingunit implementation on FPGA, and a P

YTHON interface.Azure Machine Learning [24] is a cloud-based environmentdeveloped by Microsoft to train, deploy, and manage MLmodels. It provides, among other things, a P

YTHON softwaredevelopment kit to interact with the Microsoft Azure StackEdge (ASE) [25], a physical on-premises network appliancecapable of providing several ML models as a service.III. T

OOL D ESCRIPTION

We use the SONIC framework to implement the client. Theclient employs asynchronous non-blocking gRPC calls to sendrequests to the server [26].In order to perform inference on FPGAs, we use a combina-tion of commercial and self-developed tools. We design servicesfor two benchmark networks: FACILE and ResNet-50. Thesenetworks differ dramatically in size and design constraints,and therefore we use separate methodologies to construct aservice for each. For FACILE, we use hls4ml and Vitis Accel,while for ResNet-50, we use either Xilinx ML Suite or AzureMachine Learning Studio.For both networks, we ﬁrst use the same formatting for client-server messages and requests as the Nvidia Triton InferenceServer [27]. This allows the exact same client to be used witheither GPUs or FPGAs as a service with no modiﬁcations. Inthe case of ResNet-50, we also investigate an alternative serverig. 1: Schematic of the task schedule for a DDR buffer size equal to 4 times a single batched input. The scheduling is shownafter the buffers have stabilized after startup.design (still using gRPC ) that runs on the Microsoft AzureStack Edge.In many cases, we ﬁnd that the server performance islimited ﬁrst by the server itself. Speciﬁcally, even withoutperforming any acceleration or explicit processing in the server,the throughput is limited by the gRPC server’s ability to acceptrequests and return replies. In order to remove this limitation,we employ proxy servers to allow multiple server instances torun simultaneously while remaining visible to the client as asingle entity.

A. FACILE

FACILE is a small fully-connected neural network trainedto regress the energy of a particle based on time-sequencedata read out from a calorimeter, an experimental apparatusthat measures the energy a particle loses as it passes throughit [28]. The network takes multiple measurements of the energydeposited in a region of the CMS hadron calorimeter asinput and outputs the incident particle’s energy. This typeof regression task is very common, with many algorithms ofvarious sizes and different architectures employed across HEPexperiments [29, 30]. FACILE is quite compact, consisting ofthree hidden layers with widths 31, 11, and 3 and rectiﬁed linearunit (ReLU) activation functions [31], three batch normalizationlayers [32], and 1,001 trainable parameters, and therefore servesas a useful benchmark for ultrafast accelerated algorithms. Thesynthesized FPGA kernel accepts all inputs simultaneously andproduces the output in 34 clock cycles, with a clock frequencyof 300 MHz. This means that the inference result is availablein 104 ns. This application can also be run with a batched inputconsisting of all 16,000 channels of the calorimeter. In order toprovide inputs to and receive outputs from the kernel runningon an FPGA, we use Vitis Accel.Vitis Accel provides a software framework to manage signalsbetween an FPGA and a CPU. The core of this framework isa “shell” on the FPGA that connects the programmable logicto external memories, such as double data rate synchronousdynamic random-access memory (DDR SDRAM) banks, andto the CPU via a PCI Express (PCIe) connection. The necessityof implementing the inference kernel inside this frameworkplaces various restrictions on its design. The shell’s designrequires that the FPGA kernels are connected to the CPU onlythrough the external DDR memory banks. As a result, theinputs and outputs are not sent directly from the CPU to the FPGA kernel, but instead are streamed from the CPU to theDDR SDRAM via PCIe, and then from the DDR SDRAMto FPGA kernel. This design means that the kernel must becapable of buffering inputs and outputs until they are all presenton the chip for a given inference. Although some computationsnecessary for the inference result can proceed without thefull set of inputs available, the overall latency will still bedictated by the last-arriving input. Further, for small networksespecially, the resources and performance cannot be improvedmeaningfully by adapting the kernel design speciﬁcally forstreaming inputs.Vitis Accel also provides a framework for managing thedata transfers (between the CPU and DDR memory) and theinference execution. This is performed through the use ofexecution queues, where the dependence of a given task onprevious tasks can be fully speciﬁed. This means that the queuemay contain tasks that are blocking or non-blocking; the call toplace a task in the queue itself is non-blocking. The three maintasks to queue are, in order, memory migration of the inputsfrom the CPU to the FPGA DDR SDRAM, kernel execution,and memory migration of the outputs from the FPGA DDRSDRAM to the CPU. The simplest command ﬂow to executesuccessfully is to make each of these three calls blocking. Theresult is that they execute sequentially, and a new inferencemay only begin once the result of the previous inference callis received.There are two main improvements that can be made to thisbasic command ﬂow. The ﬁrst is to fully utilize the FPGAresources. The two chips used in this work are the XilinxVirtex UltraScale+ VU9P FPGA via Amazon Web Services(AWS) Elastic Compute Cloud (EC2) F1 instances, and theXilinx Alveo U250 Data Center Accelerator Card. Both of thesechips are large modern FPGAs constructed from multiple superlogic regions (SLRs) with limited connections between SLRs.As a result, it is signiﬁcantly simpler to design algorithmsthat can be placed on a single SLR. The VU9P comprises 3SLRs, while the Alveo U250 comprises 4 SLRs. Since the hls4ml kernel above can be placed on a single SLR, it isstraightforward to place 3 (4) copies of the kernel on a VU9P(Alveo U250). The copies of the kernel are referred to as“compute units” (CUs) in the language of Vitis Accel. In orderto avoid the need for crossing SLR boundaries to access theDDR memory, we must also restrict each CU to access onlythe DDR memory connected directly to the SLR on which itig. 2: Server structure. gRPC workers communicates with the Internet. The FPGA worker sends the data to the FPGA. TheFPGA waiter waits for the job completion signal.is placed. By creating multiple CUs, this design can providea proportional improvement in the throughput that can beachieved.The other improvement that can be made is to make moreeffective use of the task queue. This can be done by using abuffer in each DDR memory bank. Since the total time for thekernel execution with large batch exceeds the total time formemory migration from the CPU to the FPGA DDR SDRAM,we deﬁne a region in the DDR memory with size equal to aninteger multiple of the size of a single input batch. Then foreach CU, instead of requiring that the three main tasks areexecuted sequentially, we copy inputs such that the DDR bufferis always full. This allows each CU to execute continuously byiterating sequentially through the DDR buffer. Some tracking ofthe CU completion is still necessary for the memory migrationof outputs from the FPGA DDR SDRAM to the CPU as wellas at startup when the DDR buffer is not yet full. Figure 1shows the schedule for a buffer size of four inputs once thedesign is running stably. The optimal scheduling for this designrequires that the input buffer always contains an input when theinference kernel is available. If the ratio of the total transfertime to the kernel execution time is R , we expect that thebuffer must at least be large enough for R inputs to ensurethat optimal scheduling is possible.Finally, we ﬁnd that the gRPC server itself cannot handlemore than approximately 2,000 requests per second. In orderto increase this limit, we spawn 8 threads to handle inferencerequests, each listening on a distinct address but sharing thesame task queue for the FPGA. We then use a HAProxyserver [33] to accept requests on a single address and forwardrequests in a round-robin fashion to the 8 addresses thatcorrespond to the threads above. This conﬁguration allowsthe FaaST server to fully utilize the FPGA. B. ResNet-50

ResNets belong to a class of neural network architecturesthat use the residual learning technique [34], with ResNet-50denoting a particular version with 50 layers. While ResNet-50 was initially designed for natural image classiﬁcation,it has been adapted to many other types of problems. Inthis work, we use a ResNet-50 model trained to classifycollimated showers of particles, or jets, generated from proton-proton collisions [35, 36]. Speciﬁcally, the model is trainedto distinguish jets originating from a top quark from otherjets. Similar image-based algorithms have been shown to bevery effective at this particular classiﬁcation task [37]. Largenetworks like ResNet-50 are a useful benchmark in contrast toFACILE, since they require much longer latencies and thereforerepresent a different class of possible as-a-service use cases. Toconstruct the image used as input, we map the detector’s surfaceto a two-dimensional grid and assign each pixel’s value to bethe total transverse momentum detected at the correspondingposition. For this task, after the primary ResNet-50 featureextractor resulting in 2,048 features, a custom classiﬁer isadded, which comprises one fully connected layer of width1,024 with ReLU activation and another fully connected layerof width 2 with softmax activation, whose output representsthe probability of the jet arising from a top quark or not.

1) Xilinx ML Suite:

To provide ResNet-50 as a service, weﬁrst used Xilinx ML Suite to quantize and load the model.We considered Vitis AI, but, at the time of writing, Xilinxdid not ofﬁcially support Vitis AI on AWS. Although we didnot convert our ResNet-50 model for top quark tagging tothe format used by Xilinx ML Suite, the default ResNet-50model has a similar number of parameters and operations.Therefore, we expect that the performance in terms of latencyand throughput should be similar.L Suite is used with an asynchronous inference call. Eachinference request to the FPGA is assigned a job ID to identifythe request. We restrict the server so that at most 8 jobs are inprocess simultaneously; other requests are queued. To utilize theasynchronous feature, we create two threads that communicatewith ML Suite. The ﬁrst thread, called the FPGA worker,fetches new data from a pending job’s queue and passes itto the ML Suite runtime as soon as there is an available jobID. Another thread, called the FPGA waiter, waits for a job’scompletion signal and then fetches the inference result whenit becomes available. Figure 2 shows the workﬂow inside theserver process.As with FACILE, the public gRPC interface is the same asthe Nvidia Triton server. Thus, existing SONIC clients canconnect to this server without any modiﬁcation.

2) Azure Stack Edge:

A second method to provide ResNet-50 as a service is tested via an ASE. Its main acceleratorcomponent is an Intel Arria 10 FPGA, to which several MLmodels may be deployed via the Azure Machine LearningStudio. No HLS or HDL is necessary, at the cost of not beingable to run arbitrary ML models. Additionally it includes a dual-core CPU, 12 TB storage, four 10/25 GbE network interfaces,and 128 GB of RAM. The ASE was installed in the FeynmanComputing Center at Fermi National Accelerator Laboratoryand connected to the local network with a 10 GbE connection.The ASE has a builtin network interface that accepts requestsusing the gRPC protocol. Inference requests were sent to theASE using the gRPC client implemented in SONIC. To reduceany effects the networking might have on the latency, weexclusively used locally-connected CPU nodes for any inferencerequests. We deployed the quantized version of the ResNet-50top quark tagging model as provided in the Azure MachineLearning Studio software.IV. R

ESULTS

A. FACILE

In order to evaluate the maximal theoretical throughput for aFaaST server running FACILE, we built a custom applicationcombining the server and client. The values of the inputs for thetest were determined during initialization and left unchangedthroughout the test. This design removes both the transfer andinput preprocessing steps, and ensures that the throughput islimited only by FPGA inference capabilities. We then use thisapplication to scan a range of values for the size of the DDRbuffer and number of CUs. The results are shown in Fig. 3for both an Alveo U250 and an AWS EC2 F1 instance. Weconﬁrm that using more CUs allows higher throughput, andobserve that throughput is maximal for DDR buffer sizes largerthan 4 input batches. This is expected because the ratio of thetotal transfer time to the kernel execution time for this designis roughly 3. We also conﬁrm the expectation that using buffersizes larger than this optimal size does not affect the serverperformance. These settings motivate our ultimate FaaST serverdesign that uses a DDR buffer size of 8 inputs (in units of thebatch) and one CU per SLR on the device (3 for the AWSEC2 F1, 4 for the Alveo U250). With these settings we are able to achieve a throughput of approximately 10,000 eventsper second using the Alveo U250 and 6,700 events per secondusing the AWS EC2 F1.

Size of DDR buffer ( T h r o u g h p u t ( e v e n t s / s e c ) Alveo U250 (a) Alveo

Size of DDR buffer ( T h r o u g h p u t ( e v e n t s / s e c ) AWS f1 (b) F1 instance

Fig. 3: Throughput achieved locally for different numbers ofCUs and sizes of the DDR buffer for the Alveo (a) and AWSEC2 F1 instance (b).We use these settings to perform two tests of the FaaST serverperformance. For the ﬁrst test, we run a workﬂow involvingonly the SONIC client module, and use a FaaST server runningon an Alveo U250. The clients and server are both located atFermi National Accelerator Laboratory. The performance ofthe server is measured for varying numbers of simultaneousclients and the results are shown in Fig. 4. We ﬁnd that theserver is capable of running at a throughput of over 5,000events per second, or 80 million inferences per second. Thisis signiﬁcantly below the maximal throughput possible forthe Alveo U250 alone. Despite the optimizations included inthe server design, we ﬁnd that the server CPU still limits theoverall throughput. This is largely a consequence of the smallize (low latency) and large batch for FACILE; we expect thatfor most algorithms the CPU should be able to process requestsfast enough to saturate the FPGA kernel.

10 50 100 200 325

Simultaneous processes T h r o u g h p u t [ e v e n t s / s ] T h r o u g h p u t [ i n f e r e n c e s / s ] Alveo U250

Fig. 4: Throughput achieved by a FaaST server with a AlveoU250 for different numbers of simultaneous clients.This ﬁrst test is useful for understanding the maximumthroughput possible with a FaaST server running FACILEsince the workﬂow involves no tasks that can be performedasynchronously to the accelerated module. However, this isnot representative of most workﬂows, in which there are manytasks that either cannot be accelerated or are simply betterperformed on the CPU. In this case, the CPU is able to scheduleother tasks while the server processes the requests, therebymasking some of the latency of the acceleration. Therefore,the second test examines the feasibility of using FaaST in arealistic HEP setting, namely the CMS high-level trigger (HLT),which is the second tier of the trigger system implemented insoftware running currently entirely on CPUs. It is responsiblefor performing a reconstruction of the full detector, but thismust be done quickly, with latency on the order of 100 ms. Thisis therefore a good candidate for usage with FaaST. This secondtest is completed by running the full CMS HLT workﬂow andbechmarking the default HLT conﬁguration to one with thehadron calorimeter reconstruction performed using the SONICclient and FaaST server described above. For reproducibility inthe HLT tests, we use the HEPCloud framework which allowsvarious experiments to run analysis workloads on demandin the public cloud as well as some allocation-based high-performance computing (HPC) sites [38]. We deploy the HLTclient jobs in the form of AWS EC2 r4.4xlarge instances.These client Virtual Machines are provisioned with 16 HighFrequency Intel Xeon E5-2686 v4 (Broadwell) processors,122 GiB DDR4 Memory and support for enhanced networking.We ﬁnd that a single FaaST server running FACILE iscapable of serving 1500 simultaneous clients without anyincrease in processing time. Above 1500 simultaneous clients,we ﬁnd that the 25 Gbps network bandwidth limit of anAWS f1.16xlarge introduces delays in processing fromthe as-a-service model. Based on the results achieved using

400 600 800 1000 1200 1400 1600 1800 2000

Simultaneous processes T o t a l t i m e [ s ] AWS f1 FPGAFitNominal HLT algorithm

Fig. 5: Total processing time required for running a realisticHLT workﬂow using the FACILE FaaST server as a functionof the number of simultaneous clients. The black dashed linerepresents the total processing time required for running theHLT workﬂow with no as-a-service component. The blue dottedline displays a piecewise linear ﬁt to the measurements.an Alveo U250 with a 100 Gbps network bandwidth limit, weestimate that one FaaST server could serve approximately 3,600simultaneous HLT processes with no reduction in performance.

B. ResNet-501) Xilinx ML Suite:

To maximize the throughput achievablewith Xilinx ML Suite, we investigated an alternative to thenominal gRPC request interface called “StreamInfer.” In thistype of connection, the server can receive a stream of imagesand returns a stream of inference results. We found that thistype of connection is more efﬁcient than the standard “Infer”requests because it avoids the overhead of reconnecting theserver for every request. On an AWS f1.x16large instance,the streaming connection’s throughput is 17% higher (487inference/second using 8 FPGAs) compared to the standardconnections when multiple clients are connected.Although the Xilinx ML Suite runtime supports connectingto multiple FPGAs, we found that the server performance doesnot increase proportionally with the number of FPGAs. Wefound no signiﬁcant performance gain when connecting tomore than 2 FPGAs simultaneously. Figure 6 shows resultswith “StreamInfer” and “Infer” requests when connecting tovarious numbers of FPGAs.We suspect that this poor scalability is caused by theP

YTHON

Global Interpreter Lock (GIL), which limits the serverto use only one CPU core at a time. To bypass this limit, westarted 4 server processes on the same machine, each connectingto only 2 FPGA cards. An Nginx load balancer, also runningon the same machine, is then used to distribute the requests toeach process [39]. Since each inference request is much largerthan that of common use cases for gRPC , we need to increaseNginx’s buffer size to achieve optimal performance.ig. 6: FPGA scaling test using a single ML Suite runtimeinstance and a single client. “SteamInfer” denotes a streamingconnection, while “Infer” indicates a standard connection. Theserver is run on an AWS f1.x16large instance.

Simultaneous processes T h r o u g h p u t [ e v e n t s / s ] T h r o u g h p u t [ i n f e r e n c e s / s ] AWS f1 8 FPGA

Fig. 7: FPGA scaling test result. Simultaneous processes meansthe number of client instances running at the same time. Eachevent contains a batch of 10 images.To verify our design’s scalability, we ran a server on anAWS f1.x16large instance in the us-west region. Thistype of instance is connected to 8 FPGAs. We used a clusterat Fermi National Accelerator Laboratory to issue requests tothe server from multiple instances of SONIC clients. Resultsfrom these tests are shown in Fig. 7. Our design is able toachieve a 550% improvement in throughput when using 8FPGAs (1,350 inferences/second) compared to a single FPGA(220 inferences/second).

2) Azure Stack Edge:

In order to measure the achievablethroughput of the ASE providing the ResNet-50 model as aservice, we used 200 CPUs concurrently sending inferencerequests via the local network at Fermi National AcceleratorLaboratory. We ﬁnd that, using SONIC to send the inferencerequests of our benchmark ResNet-50 model, the averagethroughput of the ASE is . ± . inferences/second, with amaximum achieved throughput of 460.1 inferences/second. Us-ing less than 200 cores reduces the throughput slightly: with 50 Simultaneous processes T h r o u g h p u t ( i n f e r e n c e s / s e c ) Azure Stack Edge L a t e n c y ( m s ) Fig. 8: Throughput (red, left axis) and latency (blue, right axis)as a function of number of simultaneous processes sendinginference requests of the ResNet-50 model to the Azure StackEdge.cores, the average throughput is . ± . inferences/second.As expected for a fully utilized FPGA, the latency, measured tobe the time difference between the start of the inference requestand the time a response is received, depends approximatelylinearly on the number of simultaneous processes sendinginference requests. For 50 (200) cores, we ﬁnd an averagelatency of . ± . ms ( . ± . ms). Sending requestswith a single CPU severely underutilizes the FPGA, but yieldsa picture of the minimum achievable latency. We ﬁnd a meanlatency of . ± . ms when using a single core, noting thatthe latency is not normally distributed but actually stronglyinﬂuenced by networking effects. In a minority of inferencerequests, the latency jumped to 100 ms or larger, which is solelyattributable to network effects, and unrelated to the inferencetime of the ASE. The median of the distribution, which isless affected by these high-latency outliers, is 12.7 ms. Thethroughput and latency as a function of number of simultaneousprocesses are shown in Fig. 8.Finally, the ASE’s own CPU can be used as a client toperform the inference on the internal FPGA. By sendinginference requests from the internal CPU to the internal FPGAof the ASE, we ﬁnd an average throughput of 70 inferencesper second, or 14 ms per inference. It should be noted thatwe used the Azure Machine Learning P YTHON

SDK ratherthan SONIC for this test, as it was technically less complexto deploy on the ASE CPU. The throughput in this test islargely driven by the extent to which the CPU manages toutilize the full FPGA. It is nevertheless a helpful comparisonfor the large-scale test described above.. D

ISCUSSION

We present the FPGAs-as-a-service toolkit (FaaST) forintegrating FPGA-based machine learning (ML) inference as aservice into scientiﬁc workﬂows. We have shown examples ofhow FaaST can be used for a broad range of applications andhardware. A summary of the results for all implementationsare shown in Table I. For large networks, we ﬁnd that thethroughput of FaaST servers is comparable to or better thansimilar GPU as-a-service designs. In the case of small densenetworks, such as FACILE, a FaaST server outperforms GPU as-a-service implementations by over an order of magnitude. Theseresults are not contingent on the precise details of the networkswe use as benchmarks. Indeed, we expect similar performancefrom FaaST for other network inference applications. FaaSTrepresents the ﬁrst open source toolkit intended to make highperformance FPGAs as-a-service available generically.TABLE I: Summary of the performance of FaaST servers interms of events and inferences per second, and bandwidth.Results for performance on GPUs are taken from Ref. [17].

Algorithm Platform Number of Batch Inf./s BandwidthDevices Size [Hz] [Gbps]FACILE AWS EC2 F1 1 16,000 36 M 23FACILE Alveo U250 1 16,000 86 M 55FACILE T4 GPU 1 16,000 8 M 5.1ResNet-50 AWS EC2 F1 8 10 1400 6.7ResNet-50 V100 GPU 8 10 1,700 8.1ResNet-50 ASE 1 1 460 2.2ResNet-50 T4 GPU 1 10 250 1.2

For inference on GPUs, performance gains with respect toCPUs typically occur in tasks that can be run with large batchsizes. This is due to the ability of the GPU to run many paralleloperations. FPGAs, on the other hand, do not gain exclusivelyby using large batches. Rather, FPGAs are able to achievelow inference latency as a result of their ability to performcomputations signiﬁcantly faster than CPUs and GPUs. As aresult, for ResNet-50, our FaaST server running on the ASEwith batch 1 almost doubles the throughput when compared toa T4 GPU running with batch 10. This is especially noteworthygiven that many tasks in high energy physics (HEP) workﬂowsthat require complex algorithms are naturally run with lowbatch size. For example, in the case of the top quark taggingResNet-50 model used in this work, a batch size of 2 may besufﬁcient for most HEP events.One caveat to the performance of FPGAs with small batchesis that transfers to and from the device are typically moreefﬁcient for large batches. This is because the overhead fortransfers can be quite signiﬁcant. For a similar network toFACILE, inference at batch size 1 was found to be only 15times faster than inference at batch 16,000 [40]. However, notevery ML algorithm should be run at maximum batch; thislatency improvement must be weighed against the additionalresources and infrastructure needed to handle a large numberof concurrent inputs on the FPGA.We have exclusively used ML applications in this workbecause of their widespread and growing use in HEP workﬂows, as well as their ability to be parallelized. This makes them veryuseful target applications for acceleration. However, the FaaSTserver design is highly generic. Provided that an algorithm canbe successfully executed on an FPGA, the FaaST model iscapable of enabling as-a-service acceleration. Any functionalFPGA kernel can be accelerated using Vitis Accel in a similarmanner to FACILE. VI. O

UTLOOK

FPGAs have been traditionally been used for various special-ized tasks. Their low power consumption and extremely fastprocessing make them particularly suited for applications acrossindustry and high energy physics. Their advantages, however,are not exclusive to these domains and can be leveraged formany other high-performance computing tasks. The FPGAs-as-a-service toolkit we present can assist in the implementationof FPGAs as a service in a variety of computing workﬂowsacross science. A

PPENDIX

A. Artifact Description

We ran tests of the FACILE hardware kernel throughput atFermi National Accelerator Laboratory (FNAL) on a XilinxAlveo U250 running XRT 2.3.1301 and Vitis 2019.2, with thehardware installed locally to a Intel Xeon Silver 4210 CPU@ 2.20GHz running Scientiﬁc Linux release 7.8. Tests of theFaaST server for FACILE v1.0.0 were run using this samemachine for the server and the batch submission nodes at theFNAL LHC Physics Center (LPC) Computing Cluster for theclients. Tests of FACILE in a realistic workﬂow were run usingHEPCloud using an AWS f1.16xlarge instance for the serverand r4.4xlarge instances for the clients. Tests of ResNet-50 in Xilinx ML Suite were run using our FaaST interfacev0.5.0, with a f1.x16large instance for the FaaST serverand the batch submission nodes at the FNAL LPC ComputingCluster for the clients. Tests of ResNet-50 with the AzureStack Edge were run locally at the FNAL Feynman ComputingCenter, using the batch submission nodes at the FNAL LPCComputing Cluster for the clients.Our author-created artifacts are given in Ref. [11] andRef. [12].

B. Artifact Evaluation

In all cases we ensure that behavior in critical regions (i.e.high throughput) can be reproduced with slightly different testsettings, thus verifying that the results are both stable andreliable. We run using a large number of events for all tests toensure the accuracy of and reduce statistical uncertainties on ourresults. All our results are expected to be generalizable to othernetworks and applications with similar performance. They arecross checked on multiple similar devices whenever possibleto ensure the stability with respect to machine speciﬁcationsor device conditions and details. We use monitoring tools forcloud tests to ensure no signiﬁcant issues are occurring thatcould affect our results. For tests run using FNAL resourceswe have good control of the machines and devices in use andan ensure that there are no transient sources impacting theresults. We also run results over the course of hours or dayssuch that any intermittent issues should not persist across datapoints. A

CKNOWLEDGEMENTS

We acknowledge the Fast Machine Learning collective asan open community of multi-domain experts and collaborators.This community was important for the development of thisproject. We would like to thank Steven Timm for his supportof our work with HEPCloud.M. A. F., B. H., T. K., K. P., and N. T. are supportedby Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy (DOE),Ofﬁce of Science, Ofﬁce of High Energy Physics. N. T. ispartially supported by the DOE Early Career Award. K. P. ispartially supported by the High Velocity Artiﬁcial Intelligencegrant as part of the DOE High Energy Physics ComputationalHEP sessions program. P. H., and D. R. are supported by NSFgrants

EFERENCES [1] R. H. Dennard, F. H. Gaensslen, H. Yu, V. L. Ride-out, E. Bassous, and A. R. LeBlanc, “Design of ion-implanted MOSFET’s with very small physical dimen-sions,”

IEEE J. Solid-State Circuits , vol. 9, p. 256, 1974.doi:10.1109/JSSC.1974.1050511[2] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankar-alingam, and D. Burger, “Dark silicon and the end ofmulticore scaling,” in

Proceedings of the 38th AnnualInternational Symposium on Computer Architecture , ser.ISCA ’11. New York, NY, USA: ACM, 2011, p. 365.doi:10.1145/2000064.2000108[3] CMS Collaboration, “CMS ofﬂine and com-puting public results,” 2020. [Online]. Avail-able: https://twiki.cern.ch/twiki/bin/view/CMSPublic/CMSOfﬂineComputingResults[4] ATLAS Collaboration, “Computing and soft-ware — public results,” 2020. [Online]. Avail-able: https://twiki.cern.ch/twiki/bin/view/AtlasPublic/ComputingandSoftwarePublicResults[5] P. Banerjee, R. Friedrich, C. Bash, P. Goldsack, B. Huber-man, J. Manley, C. Patel, P. Ranganathan, and A. Veitch,“Everything as a service: Powering the new informationeconomy,”

Computer , vol. 44, p. 36, 2011. [6] F. M. Aymerich, G. Fenu, and S. Surcis, “An approach toa cloud computing network,” in , 2008, p. 113.[7] K. Bennett, P. Layzell, D. Budgen, P. Brereton,L. Macaulay, and M. Munro, “Service-based software:the future for ﬂexible software,” in

Proceedings SeventhAsia-Paciﬁc Software Engineering Conference. APSEC2000 , 2000, p. 214.[8] S. Mittal and J. S. Vetter, “A survey of CPU-GPUheterogeneous computing techniques,”

ACM Comput.Surv. , vol. 47, 2015. doi:10.1145/2788396[9] R. Nane et al. , “A survey and evaluation of FPGA high-level synthesis tools,”

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 35,p. 1591, 2016.[10] N. Ghanathe et al. , “Software and ﬁrmware co-development using high-level synthesis,”

J. Instrum. ,vol. 12, p. C01083, 2017. doi:10.1088/1748-0221/12/01/C01083[11] D. Rankin, J. Duarte, K. Pedro, andB. Holzman, “FaaST: FACILE,” [software], 82020. doi:10.5281/zenodo.3992377 v1.0.0 (accessed2020-08-19). [Online]. Available: https://github.com/hls-fpga-machine-learning/FaaST[12] Y. Lou, “ML Suite gRPC Interface Implementation,”[software], 2020, v0.5.0 (accessed 2020-08-19). [Online].Available: https://github.com/LouYu2015/ml-suite/tree/master/examples/gRPC[13] M. Armbrust et al. , “A view of cloud com-puting,”

Commun. ACM , vol. 53, p. 50, 2010.doi:10.1145/1721654.1721672[14] A. Bouguettaya et al. , “A service computing manifesto:The next 10 years,”

Commun. ACM et al. , “GPU coprocessors as a service fordeep learning inference in high energy physics,” 2020,”arXiv:2007.10359, submitted to

Mach. Learn.: Sci. Tech-nol. [18] J. Duarte et al. , “Fast inference of deep neural networksin FPGAs for particle physics,”

J Instrum. , vol. 13,p. P07027, 2018. doi:10.1088/1748-0221/13/07/P07027.arXiv:1804.06913[19] V. Loncar et al. , “hls-fpga-machine-learning/hls4ml:v0.3.0,” 6 2020. doi:10.5281/zenodo.3969548 V0.3.0(accessed 2020-08-19).[20] V. Kathail, “Xilinx Vitis uniﬁed software platform,”in

The 2020 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays . ACM, 2020.doi:10.1145/3373087.337588721] Xilinx, Inc., “Xilinx ML Suite,” [software], 2020,v1.5 (accessed 2020-07-31). [Online]. Available: https://github.com/Xilinx/ml-suite[22] M. Abadi et al. , “T

ENSOR F LOW : Large-scale machinelearning on heterogeneous distributed systems,” 2015.[Online]. Available: http://download.tensorﬂow.org/paper/whitepaper2015.pdf[23] Y. Jia et al. , “Caffe: Convolutional architecture forfast feature embedding,” in

Proceedings of the 22ndACM International Conference on Multimedia , ser. MM’14. New York, NY, USA: ACM, 2014, p. 675.doi:10.1145/2647868.2654889. arXiv:1408.5093[24] Microsoft Corporation, “Microsoft AI platformwhitepaper,” 2017, accessed: 2020-08-17. [Online].Available: https://azure.microsoft.com/en-us/resources/microsoft-ai-platform-whitepaper/[25] ——, “Azure Stack Edge Datasheet,” 2020, accessed:2020-08-03. [Online]. Available: https://azure.microsoft.com/en-us/resources/azure-stack-edge-datasheet/[26] Google LLC, “gRPC,” [software], 2018, v1.19.0 (accessed2020-08-14). [Online]. Available: https://grpc.io/[27] Nvidia, “Triton Inference Server,” [software],2019, v1.8.0 (accessed 2020-08-14). [On-line]. Available: https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/index.html[28] C. Fabjan and F. Gianotti, “Calorimetry for particlephysics,”

Rev. Mod. Phys. , vol. 75, p. 1243, 2003.doi:10.1103/RevModPhys.75.1243[29] M. Rovere, Z. Chen, A. Di Pilato, F. Pantaleo, andC. Seez, “CLUE: A Fast Parallel Clustering Algorithm forHigh Granularity Calorimeters in High Energy Physics,”

Front. Big Data , 2020. doi:10.3389/fdata.2020.591315.arXiv:2001.09761[30] A. Massironi, V. Khristenko, and M. D’Alfonso, “Hetero-geneous computing for the local reconstruction algorithmsof the CMS calorimeters,”

J. Phys. Conf. Ser. , vol. 1525,p. 012040, 2020. doi:10.1088/1742-6596/1525/1/012040[31] A. F. Agarap, “Deep learning using rectiﬁed linear units(ReLU),” 2018,” arXiv:1803.08375.[32] S. Ioffe and C. Szegedy, “Batch normalization:Accelerating deep network training by reducing internalcovariate shift,” in

Proceedings of the 32nd InternationalConference on Machine Learning , ser. ICML’15.JMLR.org, 2015, p. 448, arXiv:1502.03167. [Online].Available: http://proceedings.mlr.press/v37/ioffe15[33] W. Tarreau, “HAProxy,” [software], 2020, v2.0.14(accessed 2020-08-14). [Online]. Available: https://haproxy.org[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in . IEEE, 2016, p. 770. doi:10.1109/CVPR.2016.90.arXiv:1512.03385[35] G. Kasieczka, T. Plehn, J. Thompson, and M. Rus-sel, “Top quark tagging reference dataset,” Mar. 2019.doi:10.5281/zenodo.2603256 [36] J. Duarte et al. , “FPGA-accelerated machine learninginference as a service for particle physics comput-ing,”

Comput. Softw. Big Sci. , vol. 3, p. 13, 2019.doi:10.1007/s41781-019-0027-2. arXiv:1904.08986[37] A. Butter et al. , “The machine learning landscapeof top taggers,”

SciPost Phys. , vol. 7, p. 014, 2019.doi:10.21468/SciPostPhys.7.1.014. arXiv:1902.09914[38] B. Holzman et al. , “HEPCloud, a new paradigm forHEP facilities: CMS amazon web services investigation,”

Comput. Softw. Big Sci. , vol. 1, 2017. doi:10.1007/s41781-017-0001-9. arXiv:1710.00100[39] W. Reese, “Nginx: The high-performance web serverand reverse proxy,”

Linux J.2020 IEEE 28thAnnual International Symposium on Field-ProgrammableCustom Computing Machines (FCCM)