[PDF] Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Abstract

Focal-plane Sensor-processors (FPSPs) are a camera technology that enable low power, high frame rate computation, making them suitable for edge computation. Unfortunately, these devices' limited instruction sets and registers make developing complex algorithms difficult. In this work, we present Cain - a compiler that targets SCAMP-5, a general-purpose FPSP - which generates code from multiple convolutional kernels. As an example, given the convolutional kernels for an MNIST digit recognition neural network, Cain produces code that is half as long, when compared to the other available compilers for SCAMP-5.

Full PDF

CCain: Automatic Code Generation forSimultaneous Convolutional Kernels onFocal-plane Sensor-processors

Edward Stow , Riku Murai , Sajad Saeedi , and Paul H J Kelly Dept of Computing, Imperial College London180 Queens Gate, London, SW7 2AZ, United Kingdom { edward.stow16, riku.murai15, p.kelly } @imperial.ac.uk Dept of Mechanical and Industrial Engineering, Ryerson University,350 Victoria Street, Toronto, Ontario, M5B 2K3, Canada [email protected]

Abstract.

Focal-plane Sensor-processors (FPSPs) are a camera technol-ogy that enable low power, high frame rate computation, making themsuitable for edge computation. Unfortunately, these devices’ limited in-struction sets and registers make developing complex algorithms diﬃ-cult. In this work, we present Cain – a compiler that targets SCAMP-5,a general-purpose FPSP – which generates code from multiple convo-lutional kernels. As an example, given the convolutional kernels for anMNIST digit recognition neural network, Cain produces code that is halfas long, when compared to the other available compilers for SCAMP-5. Keywords:

Convolution · SIMD · Image sensor · Analogue computing · Edge inference

Real-time computer vision applications are currently bound to traditional cam-era sensors that transfer each pixel at each frame to a host where it is processed.This requires high-performance buses between the sensors and hosts, especiallywhere high frame-rates are required. A self-driving car may need to receive newinformation for every 1cm travelled to be vigilant of unexpected scenarios, so at80 km/hr a frame rate of 2222 Hz would be required. A 2 mega-pixel camera,with 10-bit pixel depth, running at such a frame rate, requires a bus capable of45.6 Gbit/s — which is currently only possible with devices such as a PCI-e x8Gen3 interface [21]. For many applications, however, streaming data at such vol-umes is too demanding – both in power and computation time – hence requiringan alternative solution.Codesign of hardware and software for computer vision applications is anemerging research ﬁeld to address the limitations of conventional systems [17].Focal-plane Sensor-processors (FPSPs) are a promising avenue for reducing the Available at https://github.com/ed741/cain a r X i v : . [ c s . A R ] J a n E. Stow et al. data transfer between the camera and the processing unit. FPSPs, often syn-onymous with Cellular Processor Arrays (CPAs) and Pixel Processor Arrays(PPAs), perform processing on the sensor chip itself and are often designed fortasks which require high frame rates or low latency [22]. The principle behindthem is that a small processor is embedded directly with each pixel of the sensor.While FPSPs come in various forms for speciﬁc applications, we in this paperwe explore a general-purpose ﬁne-grain architecture SCAMP-5 [5], but one canimagine alternatives that could be designed for various use cases.One of the most widely used methods for image analysis is convolution ker-nels. From edge detection using Sobel ﬁlters to document recognition using Con-volutional Neural Networks [13], convolutional kernels are the foundation formany complex computer vision applications. Traditionally, application of theconvolutional kernels to the image data occurs on a CPU, but more recentlyGPUs and FPGAs are used to accelerate the computations in parallel [1], [9].Several systems have been designed to optimise the processing of convolutionalkernels on GPUs and FPGAs, leading to a vast array of techniques to reduce thenumber of operational cycles needed to apply kernels to input data. While thissigniﬁcantly increased throughput, these methods are still bounded in latency asthe image must make its way from the camera through to the host system. Asfor FPSPs, the ability to process the data on the focal plane enables the kernelsto be applied to the image data at very low latency. Furthermore, the uniqueability to select the data which is transferred from the device to the host reducesthe data volume, which allows for high frame rates. However, the technology iscomparatively new. By design, they oﬀer novel ways to interact with the data,and while work has been done to provide a Domain-Speciﬁc-Language and asso-ciated tools to program such hardware [14], there has been less work done so farto produce code generation systems to make eﬃcient use of their architecturalfeatures when applying convolutional kernels in particular.One such system that does exist, however, is AUKE [11]. Given an N × N con-volutional kernel, AUKE’s reverse-split algorithm generates code for SCAMP-5which applies the kernel eﬃciently to the captured image on the focal-plane us-ing analogue computation. AUKE is, however, limited to compiling just a singleconvolutional kernel at a time using a reduced instruction set that omits themore powerful instructions available in SCAMP-5.In this work, we present an improved alternative to AUKE, with the ability toproduce code for applying multiple convolutional kernels at a time. The problemis presented as a dynamic graph search problem in which we must eﬃcientlygenerate and traverse possible processor states to ﬁnd a path that describes therelevant convolutional computation. By incorporating instruction selection andinstruction scheduling into the core of search process, we enable the use of morenovel features of CPA architectures than AUKE is able to use. By optimisingthe code for multiple kernels simultaneously, common sub-expressions betweenkernels can be exploited and produced only once rather than for each kernel. Thisreduces the computational expense of applying the kernels, enabling applicationsto run at a faster frame rate. ain: Convolutional Filter Compiler for Focal-plane Sensor-processors 3 Input Kernels

Goal ApproximationConﬁgurableTraversal System ExploreNode Generate NextGoal Pair Apply InstructionIn ReverseRegister Allocation Code GenerationMatrices of CoeﬃcientsFinal-Goals (root node) ParentNode SpeciﬁcInstructions

Initial-Goal FoundOtherwiseNode Culled

Parent Node and Child Node

Fig. 1.

Cain System Overview.

The primary objective of this work is to push the boundary of code gener-ation for FPSP devices through simultaneous kernel optimisation. We oﬀer thefollowing contributions: – Cain: A code generation algorithm which eﬀectively makes use of commonsub-expressions across ﬁlters consisting of multiple convolutional kernels.Our graph search strategy – which enables Cain to eﬃciently search largegraphs – combines instruction scheduling, instruction selection and register-allocation constraints into the core of the search to make better use of speciﬁchardware capabilities in SIMD processors. – We show how this search can be tractable for problems of interest through aproblem formulation based on AUKE’s multi-set–of–Atoms problem repre-sentation, combined with a ranking heuristic and a hybrid graph-generator–graph-search exploration strategy. – We show how this approach allows ﬂexible exploitation of hardware capabil-ities (such as three-operand adds and multi-step shifts), and generates veryeﬃcient use of additions to avoid multiplies. – Evaluation of the eﬀectiveness of Cain on the SCAMP-5 Focal-plane Sensor-processor. We compare against AUKE and test the eﬀectiveness of simulta-neous kernel optimisation. We conclude by exploring how our simultaneouskernel optimisation extends to future devices with more registers per pixel.The remainder of the paper is organised as follows. Section 2 describes theSCAMP-5 and its instruction sets, Section 3 explains our proposed code gen-eration algorithm Cain, and in Section 4 detailed comparison is made betweenCain and AUKE, together with an evaluation of the eﬀectiveness of simultaneouskernel optimisation. Section 5 reviews the related work AUKE in detail. Finally,Section 6 concludes our work, with a discussion about potential future research.

In this section, we discuss the capabilities of the next generation camera tech-nology SCAMP-5, and give an overview of the functionality used by Cain.SCAMP-5 has been demonstrated in many diﬀerent computer vision applica-tions, ranging from Visual Odometry systems [16], [3], [10], an end-to-end neural

E. Stow et al. (cid:110)(cid:68) (cid:69)(cid:111)(cid:110)(cid:68) (cid:69) , (cid:68) (cid:69)(cid:111)(cid:110)(cid:68) (cid:69) , (cid:68) (cid:69)(cid:111)(cid:110)(cid:68) (cid:69) , (cid:68) (cid:69)(cid:111) mov() . . . mov() . . . add() (cid:110)(cid:68) (cid:69) , (cid:68) (cid:69)(cid:111)(cid:110)(cid:68) (cid:69)(cid:111)(cid:110)(cid:68) (cid:69) , (cid:68) (cid:69)(cid:111)(cid:110)(cid:68) (cid:69)(cid:111) mov() . . . add() . . . mov() . . . add() . . . Step: Instruction:1 7 mov(B,A,south) add(A,A,B) mov(B,A,north) add(A,A,B)

123 4567

Fig. 2.

Graph showing how Cain might search a simpliﬁed 1-dimensional problem usingCGDS. Numbered steps show the order that the paths are explored with child nodesgenerated the ﬁrst time a search step starts at a parent node. Nodes are checked forbeing the Initial-Goal when pointed too. The red node, and edge, correspond to adead-end where a duplicate node has been found at a higher cost than previously seenand so the node is not traversed further. We see a path to the Initial-Goal is foundafter 7 steps, and the code produced by this path is presented on the right. The mov() instruction in step 5 exploits a common sub-expression such that the two Goals in itsoutput Goal-Bag are produced together, thus shortening the code. sensor which performs learnt pixel exposures [15], to Convolutional Neural Net-works [20], [4]. Its distinctive ability to perform computation on the focal-planereduces power consumption and data transfers, making the device promising foredge computation.The SCAMP-5 architecture is a general-purpose ﬁne-grain SIMD FPSP [6].It has a 256 ×

256 pixel array, and along with each pixel is a small ProcessingElement (PE). All 65,536 processors execute the same instruction at one time.In addition to 14 binary registers, each PE has analogue registers A through to F as well as a NEWS register. Each PE can also address an XN , XE , XS , and XW register that is actually that PE’s respective neighbours’ NEWS registers. EachPE uses an analogue bus to link its available analogue registers, and becausevalues are stored as charge; analogue arithmetic is done directly on the bus thatconnects the registers rather than on a separate arithmetic unit.Instructions in the architecture control how register values are let into andout of the bus with the caveat that values are inverted due to the nature ofthe analogue electronics. Each macro instruction like add , sub , and mov aremade of multiple bus instructions that create the desired behaviour, where the bus n ( w , ..w n , r ..r k ) instruction has the general rule that the values of registers r ..r k are summed up, negated, and divided equally between the n receiving-registers w ..w n . Since a bus operation directly controls which registers areopened to the PE’s common analogue bus, a register may only appear oncein each bus instruction. Each bus instruction also incurs signiﬁcant noise anderror factors, especially for bus2 and bus3 [8].Macro instruction arguments are written as if they are assignment state-ments. For example; the macro instruction add(A, B, C) means A := B + C and is made up of two bus instructions: bus(NEWS, B, C) meaning the NEWS register now contains the value of − ( B + C ); and then bus(A, NEWS) so that reg- ain: Convolutional Filter Compiler for Focal-plane Sensor-processors 5 ister A contains B + C . We can see here that the add instruction has additionalconstraints, such that the two operands cannot be the same register, and thatthe NEWS register is overwritten, and left containing − ( B + C ) as a side eﬀect.When using macro instructions, we restrict the registers to A to F , and allow themacros themselves to make use of the NEWS and neighbouring

NEWS registersfor us by means of a direction value. We use subscripts to denote the registersof neighbouring PEs. For example: mov2x(A, B, north, east) computes A := B north , east in two bus instructions: bus(XS, B); bus(A,XE) . The ﬁrst meansthat XS north , east := B north , east which is equivalent to NEWS east := B north , east andthen the second instruction means A := XE = NEWS east = ⇒ A = B north , east .While interesting uses of the bus instructions exist, allowing adding and sub-tracting from neighbouring PEs, individual macro instructions are still highlyrestricted in comparison to most modern instruction sets. Only primitive ana-logue operations are available to each PE such as: Move, Add, Subtract, Divideby two, and to acquire the value from the sensor [8]. The lack of a multiplica-tion instruction means the problem of generating convolutional ﬁlter code forSCAMP-5 builds on the theory of multiplier-free FIR ﬁlters [7].The chip has been shown to be capable of operating at 100,000 FPS, largelybecause it is not limited by the speed of an output bus to transfer all the pixeldata [5]. Instead of only oﬀering an analogue or digitally encoded output ofall pixels at a time, like traditional camera sensors, the SCAMP-5 architectureallows binary outputs per pixel, and even event driven outputs. This allowseach PE to come to a judgement on its input pixel data and ﬁre its own eventthat sends the coordinates of the PE to the host; allowing information transferwithout divulging the actual image.The architecture uses an oﬀ-chip controller to manage the fetch-decode-execute cycle, with every pixel’s processor receiving the same instruction, makingit a single-instruction-multiple-data (SIMD) design. This has beneﬁts in termsof simplicity and eﬃciency as none of the Processing Elements need to be ableto fetch instructions for themselves. There is also provision for masking pixelssuch that only selected PEs execute instructions.One important consideration to be made when using and designing algo-rithms related to the SCAMP-5 chip is noise introduced by the nature of theanalogue computation. Every use of the 7 analogue registers introduces noise tothe values stored. This makes ﬁnding optimal code to perform the convolutionsever more vital for accurate results. Cain is a framework for compiling convolutional ﬁlters, designed to search througha conﬁgurable Cellular Processor Array (CPA) instruction set to ﬁnd eﬃcientcode. A fundamental concept Cain uses is to only consider a single arbitraryPE in the CPA, and perform everything relative to it. This works for SIMDarchitecture like SCAMP-5 because every PE will be executing the same stepssynchronously in parallel. The assumption we make when producing code is that

E. Stow et al. the neighbours of our arbitrary PE will exist and so will have done the samework but at a relative oﬀset in the input image. The aim is to search throughthe graph of possible Processing Element states in such a way that commonsub-expressions in the given kernels are exploited and used to reduce the costof any path from initial to ﬁnal PE states. To do this Cain searches backwards,starting with a set of ﬁnal kernels, these are the convolutional ﬁlter, and apply-ing instructions in reverse to simplify the kernels until only the identity kernel is left. Fig. 1 shows a high level overview of this process. Searching backwards isa design choice that makes the search more eﬀective because it means the aimat each step is to make what needs to be solved simpler than before. This meansheuristics can be produced to always direct the search towards the identity kernelrather than a system of heuristics trying to accurately predict the path towardsan arbitrary set of ﬁnal Goals. We present this as a dynamic graph search prob-lem because the size of the graph is intractable. Given the AnalogNet2 ﬁlter inEquation 1, Cain identiﬁes 37163 potential children nodes in the ﬁrst step alone.This can be reduced to 239 if we are willing to accept a less than exhaustivesearch of the solution space. This restriction is applied when the computationalcost of computing the full set of children nodes is too high. This section provides an overview of notation and deﬁnition used in this paper.Cain is designed such that diﬀerent deﬁnitions could be used without changingthe fundamental search algorithm but the deﬁnitions we use here for SCAMP-5are based largely on AUKE’s, which provides an elegant way to conceptualisethe convolutional kernels without multiplication.

Example 1.

We will look at a simple example of how a convolutional kernel isrepresented in Cain. Here we use AnalogNet2 [20][12] which is a CNN designedfor SCAMP-5.

AnalogNet2 = (cid:26) (cid:20) − − (cid:21) , (cid:20) − − − − (cid:21) , (cid:34) − − − − (cid:35) (cid:27) (1)Since SCAMP-5 does not have multiplication we must approximate the kerneland because it does have division-by-two instructions the natural approximationto make is to ﬁnd the nearest integer multiple of d for each coeﬃcient in thekernel, given some number of divisions d . In our example we have already ex-tracted the common denominator such that d = 2 and this perfectly representsthe kernel. The larger d is, the larger the search space and complexity of theproblem, so d can be limited to allow an acceptable amount of approximationerror such that the resulting program is shorter and computational expense ofcompiling it is reduced. Deﬁnition 1.

Let an Atom, denoted as ( x, y, z, sign ) , be a representation of d of a pixel value at coordinate x, y , on the z th channel. x, y are coordinates relativeto the arbitrary PE and so also the centre of the kernel, and z refers to an imageinput channel. The sign is used to negate the value if necessary. Single-entry matrix. Not to be confused with identity matrixain: Convolutional Filter Compiler for Focal-plane Sensor-processors 7

Deﬁnition 2.

Let a Goal, denoted as { atom , atom , ... } , be a multi-set of Atoms.The Goal represents an arbitrary kernel, however, scaled by d . The aggregate ofthe values represented by each of the Atoms yields the same result as applyingthe scaled kernel. Representing a convolutional kernel as a Goal is a convenient way to supportmultiply-free instruction set, such as SCAMP-5. One can simply view this asunrolling the multiply instruction into additions. Using Goals simply re-framesthe problem by scaling everything by 2 d , and approximating coeﬃcients to thenearest number of Atoms. Deﬁnition 3.

Let a Goal-Bag, denoted as { goal , goal , ... } , be a multi-set ofGoals. The Goal-Bag is used to capture the state of our arbitrary PE. This in-cludes deﬁning the Final-Goals, the set of convolution kernels we wish to com-pute; and the Initial-Goals, the set of Goals which the computation will startfrom. Using these deﬁnitions of Goals and Atoms we see that the ﬁrst kernel fromExample 1 can be represented by GK = 14 (cid:104) − − (cid:105) , G = (cid:26) ( − , , , − ) , ( − , , , − ) , ( − , , , − ) , (0 , , , +) , ( − , − , , − ) , ( − , − , , − ) , ( − , − , , − ) , (1 , − , , +) , (1 , − , , +) (cid:27) As our Goal notation is verbose, we provide a compact version that disam-biguates Goals from kernels G = (cid:68) − − (cid:69) = ⇒ (cid:104) − − (cid:105) (cid:63) Image Input where the (cid:63) operator applies the left-hand convolutional kernel to the right-hand array (2)By repeating this for process the rest of the convolutional kernels in theAnalogNet2 ﬁlter, the Final-Goals Goal-Bag FG is produced: FG = (cid:26)(cid:68) − − (cid:69) , (cid:68) − − − − (cid:69) , (cid:68) − − − − (cid:69)(cid:27) (3)Since, in our example, d = 2; the Goal representation of the identity kernel( G ID ) that makes up the Initial-Goals, is based on the approximation of theFinal-Goals: K ID = 14 (cid:104) (cid:105) = ⇒ G ID = (cid:68) (cid:69) (4)Moving a value around the processor array is expressed by translating everyAtom of a Goal. Addition and subtraction can be expressed by combining twoGoals into one, making sure to cancel out positive and negative Atoms with thesame coordinates. Since Cain searches backwards, we apply these operations inreverse. For 2-operand addition this means we take a Goal, G , that we wish togenerate code for, then produce 2 new Goals that when added together produce G . Deﬁning Goals as multi-sets of Atoms makes this process intuitive as wecan simply split the Atoms between two Goals in every possible permutation(or fewer if we are willing to assume some are non-optimal, or willing to misspotentially better code for the sake of more eﬃcient code generation). Thisdeﬁnition also restricts the reverse search process since when splitting a Goal wecannot split an Atom. To compute the red Atoms in G naively, PEs must sumthem and read this value from the west thus translating the Atoms eastward. E. Stow et al.

Cain’s reverse search algorithm works iteratively taking the state of an arbitraryPE, deﬁned as a Goal-Bag: F := { G , G , G , G ... } (5)This is a node in our search graph and represents the state we aim to achieveby executing the instructions that form a path from the initial-Goals to this node.In the search graph, nodes are generated dynamically as the graph is explored.Fig. 2 shows a simpliﬁed view of how a graph might look as it is generated andsearched. We simplify the exploration such that in each iteration of the searchalgorithm we produce a Goal-Bag Pair of an Uppers

Goal-Bag and a

Lowers

Goal-Bag as well as an instruction, with the following constraints:( U , L ) , inst = nextPair ( F ) where U ⊆ F , U = inst ( L ) (6)This is in contrast to AUKE’s method, shown later in Equation 16. The newchild node, C , is then produced by applying the instruction in reverse using thefollowing rule, with the instruction becoming an edge in the graph: C = ( F \ U ) ∪ L (7)Following our AnalogNet2 example from Equation 3, the ﬁrst iteration of thesearch algorithm will start with FG and the Pair of Goal-Bags Cain producesis as follows: U = (cid:26) (cid:42) − − − − (cid:43) (cid:27) , L = (cid:110) (cid:28) − − (cid:29) , (cid:28) −

30 0 0 (cid:29) , (cid:28) − (cid:29) (cid:111) (8) inst = U ← add ( L , L , L ) (9) C = (cid:26)(cid:68) − − (cid:69) , (cid:68) − − − − (cid:69) , (cid:68) − − (cid:69) , (cid:68) −

30 0 0 (cid:69) , (cid:68) − (cid:69)(cid:27) (10)The multi-set semantics here mean that if the Goals in L are all already partof F then the number of Goals to solve is reduced, and so by applying more pairs( U , L ) we traverse the graph of Goal-Bags, until we reach the initial-state, wherethe only Goal in the Goal-Bag is the identity Goal. In our example (Equation10) we see that the sub-expression of 3 negative Atoms is reused in C and C since applying a mov2x next could eliminate C from C . There is also furtherpotential to reuse this by how we split C . Once the initial Goal-Bag is foundthe path from the initial Goal-Bag back to the Final-Goals becomes the list ofinstructions that form our generated program.After this point Cain continues searching for shorter paths, and can cullany nodes with longer paths. During the search the same Goal-Bags may bereproduced in diﬀerent ways, we cull the current node any time a Goal-Bag isproduced that has already been seen at a lower or equal cost, or if the Goal-Baghas more Goals than available registers.The second part of the search strategy deﬁnes the search order. Each invo-cation of the reverse search algorithm produces one new node, C , and the inputnode is incremented to know how many of its children have been produced so ain: Convolutional Filter Compiler for Focal-plane Sensor-processors 9 far. Cain uses this simple deﬁnition to allow several graph traversal algorithmsto be implemented. Using Depth-First-Search (DFS), Cain can simply maintaina stack of the nodes. On each cycle the top node is popped oﬀ the stack andgiven to the reverse search algorithm. Then the incremented parent node is putback on the stack, followed by the new child node. Algorithm 1:

CGDS GraphSearch

Input: s deque ← [( s, null )] while deque (cid:54) = [] do n, g ← deque [0] deque ← deque [1 .. ] if g = null then do node computation on n g ← childGenerator ( n ) end c ← g.yield() if c (cid:54) = null then deque ← [( c, null )] + deque + [( n, g )] end end While DFS performs well inAUKE, it struggles in Cain becausethe number of child nodes at ev-ery level is far greater, since eachedge is only one instruction andthere are multiple kernels to consider.This means the size of the graph wewould like to search is much largerand we are unable to search even asmall fraction of it. To overcome thiswe use a graph-traversal algorithmthat, for our purposes, we call Child-Generator-Deque-Search (CGDS).The aim of this algorithm is to en-sure that the search does not endup ‘trapped’ in one small part of thegraph, but can eﬀectively search traverse many children of many of the nodesthat are found where DFS will search all of the children of nodes at the extentof the paths it searches before searching the second children of nodes earlier inthe graph. Algorithm 1 shows a pseudo-code implementation of CGDS. In eachcycle the front of the queue is polled, if the node has not been seen before, Cainchecks to see if it can be directly transformed from the initial-state Goal-Bag,this is the ‘node computation’. The node is then passed to the reverse search al-gorithm to attempt to produce the next new child node and to increment parentnode – this is implicit in calling ‘ yield() ’ on g . The child node, if it exists, is puton the front of the queue and the incremented parent node is put on the back.We do not claim that CGDS is novel, but we have found it superior to obviousalternatives, and the strategy used in [2]; for details see [18]. In the reverse search algorithm we see that the pairs of

Uppers and

Lowers are produced one at a time. While this simpliﬁcation allows us to produce moregeneric graph traversal implementations; what allows Cain to eﬃciently ﬁnd so-lutions, are the heuristics that allow us to order the pairs that are produced fora node from the most promising to the least. This type of heuristic provides theorder of siblings to search so we call it a ‘local heuristic’. It doesn’t comparenodes in diﬀerent parts of the graph, which we would call a ‘global heuristic’.We found that we were unable to ﬁnd eﬀective global heuristics because traver-sal algorithms that take advantage of such heuristics end up producing hugefrontier sets of nodes making the memory requirements too large. The use oflocal heuristics drives the SCAMP-5 code generation in Cain instead, though support for best-ﬁrst-search with global heuristics is available in Cain. The localheuristics used for SCAMP-5 are based on generating every child node of theparent and then ordering them based on a cost function. There are 3 main com-ponents considered for the cost: Atom distance, repeated Goals, and divisions.A simpliﬁed formula is shown in Equation 11. cost ( C ) = dists ( C ) + reps ( C ) + divs ( C ) (11) dists ( C ) = (cid:88) G ∈ C (cid:32) | G | + (cid:88) a ∈ G ( | a.x | + | a.y | ) × (cid:26) if (cid:54) ∃ B ∈ C .G ⊂ B (cid:27)(cid:33) (12) reps ( C ) = (cid:88) { G ∈ C : G is unique wrt any translations } (cid:26) | G | ∃ a, b ∈ G.a (cid:54) = b (cid:27) (13) divs ( C ) = 2 d min( multiplicity ( a ) ∀ a ∈ G. ∀ G ∈ C ) (14)The Atom distance part counts up how many Atoms every Goal in C has,and how far from the centre they are, with some relief if the Goal is a sub-Goalof another Goal in C . The repeated Goals portion of the cost penalises C bythe square of number of Atoms in each Goal, unless that Goal is equal to atranslation of another Goal in C . The divisions component penalises C for thenumber of division operations that would be required to produce the Goals fromthe identity-kernel Goal, G ID . All performance evaluation is conducted on an Intel Core i7-7700HQ CPU (4cores, 8 threads) with a base frequency of 2.80GHz. The computer has 16GBof RAM, runs Ubuntu 18; as well as Java 1.8 (Oracle) and Python 3.6 to runCain and AUKE respectively. The implementation of AUKE used, as developedby Debrunner, can be found on Github . Cain source code can be found atgithub.com/ed741/cain, and the speciﬁc version and sources for experimentalsetups presented in this evaluation can be found at [19]. Comparison of our work Cain against AUKE is performed by comparing resultingcode generated from the respective compilers, given the same input ﬁlters. Bothcompilers are given 60 seconds to ﬁnd a solution using all 6 registers. Note asCain supports multi-threading, it spawns 4 worker threads to perform the search.As shown in Table 1, Cain signiﬁcantly outperforms AUKE. Cain supportsa wider set of instructions in contrast of AUKE, enabling generation of moreeﬃcient code. Not only this, the search strategy used by Cain is better thanAUKE’s, as shown in 5 × github.com/najiji/auto code cpa/tree/75c017e5ad28c0f3f040fb9f84d7f8727d035baaain: Convolutional Filter Compiler for Focal-plane Sensor-processors 11 of AUKE’s. Although, in further testing, AUKE is able to produce less ineﬃcientcode for this kernel given fewer registers. When given multiple kernels, Cain isable to perform simultaneous kernel optimisation. For example when combin-ing 3 × × µ s and 9 µ s respectively,showing almost 4 times speedup.Name Approximated Filter AUKE CainBasic All Basic3 × (cid:110) (cid:104) (cid:105)(cid:111) × (cid:40) (cid:34) (cid:35)(cid:41) × × (cid:40) (cid:34) (cid:35) , (cid:34) (cid:35)(cid:41) (50 + 12) (cid:26) (cid:104) − − (cid:105) , (cid:104) − − − (cid:105) , (cid:104) − − − − (cid:105) (cid:27) (13 + 21+15) Table 1.

Kernels tested in AUKE and Cain. Values on the righthand side of the tablerefer to the number of SCAMP-5 macro instructions in the programs generated byAUKE and Cain for each ﬁlter. AUKE can only use the ’basic’ macro instructions, soCain is run twice; to compare its eﬀectiveness under the same restrictions as AUKE.Since AUKE does not oﬀer a way to compile multiple kernels at once, values for eachkernel are given separately.2 E. Stow et al.

AUKE Cain

Kernel 21 mov(B,A); divq(B,B); divq(B,B); movx(C,B,north); neg(C,C); neg(D,C); movx(E,D,west); neg(E,E); add(F,B,E); movx(B,D,east); add(B,B,E); movx(D,E,south); movx(D,D,south); sub(B,B,D); add(B,B,F); add(B,C,B); movx(C,C,west); add(B,B,C); movx(C,F,south); add(B,C,B); add(B,B,F); Kernel 322 mov(C,A); divq(C,C); divq(C,C); movx(D,C,south); neg(D,D); movx(E,C,east); sub(D,D,E); movx(E,C,north); add(E,E,D); add(D,D,D); add(D,E,D); movx(E,C,west); sub(C,C,E); add(D,D,C); movx(C,C,north); add(C,D,C); Kernel 138 divq(A,A); divq(A,A); movx(D,A,west); neg(D,D); movx(E,D,south); add(D,D,E); add(E,A,D); movx(A,A,south); movx(A,A,east); add(A,D,A); add(A,A,A); add(A,E,A); diva(A,D,E); div(D,E,C,A); movx(E,D,west); movx(C,E,north); neg(F,E); subx(B,F,east,A); addx(E,E,D,south); add2x(D,F,D,north ,north); sub2x(F,D,south ,south ,C); add2x(D,C,D,east,south); add(E,E,D); movx(D,A,north); add2x(A,C,A,east,east); movx(C,B,east); add(D,F,D); add2x(F,F,E,east,south); movx(E,B,south); addx(A,B,A,south); addx(A,B,A,west); add2x(B,F,B,north ,west); add(C,D,C,E); Table 2.

Comparison of Code for the AnalogNet2 ﬁlter generated by AUKE and Cain.The Input Register is ‘A’ and the output registers for the 3 kernels are ‘A’,‘B’,‘C’respectively. For AUKE, kernel 2 is run ﬁrst since testing showed it was longest so thisgives AUKE more registers to use.

If Cain has an eﬀective heuristic we will quickly see a point of diminishing re-turns in code length, as Cain continues to search new nodes and takes more time.We can track the number of nodes that are explored before ﬁnding any plan inCain, and so use this as a measure of the search strategy and heuristics thatis more independent of physical compute performance. With this in mind wetest the eﬀectiveness of our heuristic by constructing 100 samples of randomlygenerated single kernel ﬁlters as in Equation 15. Running Cain as per the follow-ing conﬁguration – Maximum Nodes to Explore: 20000, Maximum Search Time:60s, Worker Threads: 1 – allows us to collect as many plans as can be foundin the given time limit. We then ran Cain again, but with Cain’s SCAMP-5heuristic disabled and replaced with a random sort. This allows us to compareCains heuristics against an unaided benchmark. (cid:104) u u u u u u u u u (cid:105) Given u ..u are integers sampled uniformly from the range [0 .. (15)We found that Cain was unable to ﬁnd any plan for any of the 100 sampleﬁlters without its heuristics, principally demonstrating that eﬀective heuristicsare required in Cain for any tangible progress to be made. We plot the lengthsof the best plans found against the number of nodes expanded before the planis found in Fig. 3. We can see that improvements are fewer and further betweenafter the ﬁrst 2500 nodes are explored. After this we see that we can expect atmost a reduction equal to the reduction seen at 2500 for the rest of the nodesexplored. This clearly demonstrates a point of diminishing returns for theseﬁlters. If the heuristic is eﬀective we expect it to direct the search towards shortplans ﬁrst, and try instructions less likely to be optimal later. This model ﬁtsthe data well as we see short plans are found quickly, and while improvementscan be made, it is clear that they are found less often as the search continues. ain: Convolutional Filter Compiler for Focal-plane Sensor-processors 13 S m a ll e s t P l a n L e n g t h s F o un d Lengths of the Shortest Plans Found Given the Nodes Explored So Far th PercentileMedian5 th Percentile 1 2 3 4Kernels Compiled010203040506070 S m a ll e s t P l a n L e n g t h F o un d Shortest Plans Found for Kernels Processed Simultaneously

Simultaneous Kernels

Fig. 3.

Left: Graph showing the median number of instructions in the best plans foundbefore n nodes have been explored by Cain. With 100 samples of randomly generatedsingular 3 × × One of the signiﬁcant features of Cain is to eﬃciently generate code for ﬁlterswith multiple kernels, and do this simultaneously such that shared commonsub-expressions can be reused. As it is possible for Cain to perform exhaustivesearches for plans, given suﬃcient time, it will ﬁnd a solution that simply com-putes the individual kernels independently, or ﬁnd a solution with lower cost –utilising the common sub-expressions.First, we wish to test whether the length of generated code is sub-linear tothe number of input kernels. To test this, we again generate kernels using theusing the method in Equation 15. For kernel counts from 1 to 4 we generated25 ﬁlters each and test them all using the same conﬁguration as before exceptthat we remove the maximum nodes explored constraint, and allow 4 workerthreads. We plot the results in Fig. 3 and see that the results appear worse thanlinear, suggesting that common sub-expressions are not eﬀectively being takenadvantage of.We hypothesise that the limited number of registers in the SCAMP-5 archi-tecture is the major limiting factor in producing eﬃcient code. To test this weincrease the number of available registers to 18. For ﬁlters with 1 kernel up to 10kernels we generate 10 samples each. Every kernel in the 100 ﬁlters is producedas in Equation 15. For each sample, Cain compiles the kernels individually, giventhe appropriate number of registers such that other kernels in the ﬁlter wouldnot be overwritten. Then we compile the kernels simultaneously using Cain. Allcompilations are given 60s to run, with 4 worker threads.Fig. 4 shows the results of this test. We see clearly that when register lim-itations are not a restricting factor Cain is able to consistently improve theperformance of ﬁlter implementations by compiling them simultaneously. Wesee that improvements grow with more kernels, and it appears that the length ofcode generated for simultaneously compiled kernels increases sub-linearly. This S m a ll e s t P l a n L e n g t h F o un d Comparison of Shortest Plans Found for KernelsProcessed Individually and Simultaneously Given 18 Available Registers

Sum of Individual KernelsSimultaneous Kernels

Fig. 4.

Graph comparing the sum of the shortest SCAMP-5 code lengths found forkernels compiled individually, against the same kernels compiled simultaneously as oneﬁlter. For each ﬁlter a total of 18 registers were made available (more than in SCAMP-5) to reduce register availability as a limiting factor. In total 100 ﬁlters are produced,10 for each number of kernels per ﬁlter. Each kernel is a randomly generated 3 × supports the idea that with more kernels, ever more common-sub expressionscan be exploited. In this section we look at how AUKE operates to provide extra context and con-trast for Cain. Automatic Kernel Code Generation for Analogue SIMD (AUKE)is an algorithm for generating code given a single convolutional kernel createdby T. Debrunner [11]. It can be characterised by 4 main steps: kernel approxima-tion; the reverse split algorithm; graph relaxation; and ﬁnally register allocation.First, AUKE approximates the input kernel into the Goal representation. In thisprocess Cain is similar to AUKE and the reasoning and mechanics have beendiscussed in Section 3.1.Unlike in Cain, multiple instructions are represented by a single elementaltransformation of Goals. These elemental transformations form edges of a graphthat describe the translation, addition, subtraction and division of Goals toproduce the desired convolutions ﬁlter. This abstraction allows AUKE to reducethe eﬀective size of the search space at the cost of granularity in instructionselection and being extensible to hardware features such as 3-operand addition.Debrunner called this the ‘Reverse-Split Algorithm’.The graph of elemental transformations is dynamically generated via a re-cursive depth-ﬁrst search that tries to split a Goal G , that needs to be produced,into 3 sub-Goals: G = U ∪ L ∪ R where U = elementalTransformation ( L ) (16) ain: Convolutional Filter Compiler for Focal-plane Sensor-processors 15 This recursive algorithm then means that if the search can ﬁnd solutionsfor L and R (two smaller problems) it can trivially create U and therefore thedesired Goal.In the ideal case R = ∅ and so only L needs to be produced and we saveone addition. In the worst case L = U = ∅ and R is a transformation of G andso less useful work is done in that step. If two Goals are equal they are mergedsuch that they aren’t calculated twice, to exploit common sub-expressions in theGoals. This process is repeated until a single Goal, the initial-Goal, is left. Thisalgorithm is able to entirely search the relevant problem space, given a coupleof assumptions. Most notably, the assumption that every sub-Goal generated isa subset of the Final-Goal. This reduces the search space signiﬁcantly to themost promising but not necessarily the best solutions, allowing AUKE to ﬁndgenerally eﬀective solutions.The algorithm is made eﬃcient and useful by intelligently selecting the orderwith which U s, L s, and R s are generated at every recursive step. By selectingpairs of U and L that are likely to lead to eﬃcient code, the algorithm canquickly ﬁnd some path to the initial-Goal. From then on the recursive searchcan stop early if a lower cost solution has already be found.The Graph Relaxation step aims to mitigate missing optimal solutions be-cause of the assumption that sub-Goals are always subsets of the Final-Goal byusing a ‘retiming’ algorithm used in integrated circuit design. This is not neededin Cain since Cain searches instruction by instruction, and so any optimisationsfound via graph relaxation are already a part of the search space.The ﬁnal step is to perform register allocation on the graph to be able togenerate usable code. A maximum bound of registers is already accounted forin the search algorithm, since spilling is not an option for the SCAMP-5 archi-tecture. For this task; variable liveness is considered for each node of the graphrepresentation, and a graph colouring algorithm is used to ﬁnd a solution. We have presented Cain, a compiler which produces SCAMP-5 instructions froma set of convolutional kernels. Although the eﬀectiveness of simultaneous kerneloptimisation is limited on the current iteration of the SCAMP-5, we demon-strate, that with the increased number of registers, the length of the output ofCain is sub-linear to the number of kernels given. We have conducted extensivecomparison against AUKE, and we demonstrate that the code generated by Cainis more eﬃcient, and exhibits almost 4x speed up when the generated kernel isexecuted on the SCAMP-5 device. We believe that SCAMP-5 is a strong can-didate for edge computation, and by providing easy to use, yet eﬃcient codegeneration toolkit, we hope to accelerate the relevant research in this ﬁeld.

Acknowledgements

We would like to thank Piotr Dudek, Stephen J. Carey, andJianing Chen at the University of Manchester for kindly providing access toSCAMP-5, and their support in our work. This work was partially supported bythe EPSRC, grant reference EP/P010040/1.

References

1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghe-mawat, S., Irving, G., Isard, M., et al.: Tensorﬂow: A system for large-scale machinelearning. In: 12th USENIX symposium on operating systems design and implemen-tation OSDI 16). pp. 265–283 (2016)2. Barthels, H., Psarras, C., Bientinesi, P.: Linnea: Automatic generation of eﬃcientlinear algebra programs (2019), https://arxiv.org/pdf/1912.12924.pdf3. Bose, L., Chen, J., Carey, S.J., Dudek, P., Mayol-Cuevas, W.: Visual Odometryfor Pixel Processor Arrays. In: 2017 IEEE International Conference on ComputerVision (ICCV). pp. 4614–4622 (Oct 2017)4. Bose, L., Chen, J., Carey, S.J., Dudek, P., Mayol-Cuevas, W.: A camera that CNNs:Towards embedded neural networks on pixel processor arrays. In: Proceedings ofthe IEEE International Conference on Computer Vision (ICCV). pp. 1335–1344(2019)5. Carey, S.J., Barr, D.R.W., Wang, B., Lopich, A., Dudek, P.: Locating high speedmultiple objects using a scamp-5 vision-chip. In: 2012 13th International Workshopon Cellular Nanoscale Networks and their Applications. pp. 1–2 (Aug 2012)6. Carey, S.J., Lopich, A., Barr, D.R.W., Wang, B., Dudek, P.: A 100,000 fps vi-sion sensor with embedded 535GOPS/W 256 ×

256 SIMD processor array. In: 2013Symposium on VLSI Circuits. pp. C182–C183 (2013)7. Chandra, A., Chattopadhyay, S.: Design of hardware eﬃcient ﬁr ﬁl-ter: A review of the state-of-the-art approaches. Engineering Scienceand Technology, an International Journal (1), 127–138 (2016)10. Debrunner, T., Saeedi, S., Bose, L., Davison, A.J., Kelly, P.H.J.: Camera Trackingon Focal-Plane Sensor-Processor Arrays (2019)11. Debrunner, T., Saeedi, S., Kelly, P.H.J.: AUKE: Automatic kernel code generationfor an Analogue SIMD Focal-Plane Sensor-Processor Array. ACM Trans. Archit.Code Optim. (4) (Jan 2019)12. Guillard, B.: Cnns-on-fpsps (May 2019), https://github.com/brouwa/CNNs-on-FPSPs/tree/c6b5c51839e9e3c453681e5b0a3e3ef541ba3cce13. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE (11), 2278–2324 (1998)14. Martel, J.: Unconventional Processing with Unconventional Visual Sensing. Ph.D.thesis, Institut National des Sciences Appliqu´ees de Lyon (2019)15. Martel, J.N.P., M¨uller, L.K., Carey, S.J., Dudek, P., Wetzstein, G.: Neural sensors:Learning pixel exposures for HDR imaging and video compressive sensing withprogrammable sensors. IEEE Trans. Pattern Anal. Mach. Intell. (7), 1642–1653(2020)16. Murai, R., Saeedi, S., Kelly, P.H.J.: BIT-VO: Visual Odometry at 300 FPS usingBinary Features from the Focal Plane. In: IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS) (2020)ain: Convolutional Filter Compiler for Focal-plane Sensor-processors 1717. Saeedi, S., Bodin, B., Wagstaﬀ, H., et al.: Navigating the landscape for real-timelocalization and mapping for robotics and virtual and augmented reality. Proceed-ings of the IEEE106