Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices
Oscar Rahnama, Duncan Frost, Ondrej Miksik, Philip H.S. Torr
aa r X i v : . [ c s . C V ] F e b Real-Time Dense Stereo Matching with ELAS onFPGA Accelerated Embedded Devices
Oscar Rahnama , , Duncan Frost , Ondrej Miksik , and Philip H.S. Torr Abstract —For many applications in low-power real-timerobotics, stereo cameras are the sensors of choice for depthperception as they are typically cheaper and more versatilethan their active counterparts. Their biggest drawback, however,is that they do not directly sense depth maps; instead, thesemust be estimated through data-intensive processes. Therefore,appropriate algorithm selection plays an important role inachieving the desired performance characteristics.Motivated by applications in space and mobile robotics, weimplement and evaluate an FPGA-accelerated adaptation of theELAS algorithm. Despite offering one of the best trade-offsbetween efficiency and accuracy, ELAS has only been shownto run at . − fps on a high-end CPU. Our system preservesall intriguing properties of the original algorithm, such as theslanted plane priors, but can achieve a frame rate of fps whilstconsuming under W of power. Unlike previous FPGA based de-signs, we take advantage of both components on the CPU/FPGASystem-on-Chip to showcase the strategy necessary to acceleratemore complex and computationally diverse algorithms for suchlow power, real-time systems.
Index Terms —Range Sensing, RGB-D Perception
I. I
NTRODUCTION I N many areas of robotics, such as autonomous navi-gation [1], [2], [3] and manipulation/grasping [4], notonly is the ability to perceive depth critical, but it needsto be obtained very accurately and in real-time. On mobileor embedded platforms, power consumption, cost, size andweight also become important factors to consider. For instance,assistive augmented glasses should be mobile, lightweight andergonomic whilst retaining the ability to operate for longperiods on limited battery power [5], [6].Active methods of measuring depth, which are commonlyused due to their high accuracy, carry certain disadvantages.LIDAR systems are often bulky, heavy and costly. Infraredsystems, on the other hand, are limited in their range, sus-ceptible to interference and, more importantly, constrained byambient lighting. Passive methods may not be limited by thesefactors, however, they are computationally very expensive andtheir accuracy/latency depends heavily on the techniques used.Stereo matching algorithms can be broadly split into globalenergy minimization methods and local matching techniques.The former are often more accurate, but the generallylarge/irregular memory requirements and sequential/iterativenature of their underlying algorithms make them dependent
This work was supported by People Programme (Marie Curie Actions -“Initial Training Networks”) of the EU FP7 under REA grant No. 317497(EDISON), Technicolor, ERC grant ERC-2012-AdG 321162-HELIOS, EP-SRC grant Seebibyte EP/M013774/1, EPSRC/MURI grant EP/N019474/1. Department of Engineering, University of Oxford, UK OxSight Ltd Emotech Labs on powerful processors for speed up, and even then, theirframe rate is limited. Conversely, local methods struggle withtextureless and occluded regions, but the uniformity of theircomputations and the absence of dependencies between pixelsmakes them very suitable parallel acceleration.Benefiting in part from the greater accessibility providedby CUDA/OpenCL, such acceleration has been predominantlydone with Graphics Processing Units (GPU). However, FieldProgrammable Gate Arrays (FPGA) are becoming increasinglycompetitive alternatives, especially in power limited systems,with their capacity for in-stream processing, adherence to stricttimings and supremacy at sliding-window operations [7].Their effectiveness for stereo image processing has beenpreviously demonstrated [8], [9], [10] with the most accurateimplementations usually relying on Semi-Global Matching(SGM) [11]. However, as SGM is highly recursive, memoryintense and, in its entirety, ill suited to acceleration, those thatdo either only partially implement it or sacrifice latency andthroughput by relying heavily on external memory.In this paper, we investigate the adoption and accelerationof a competing algorithm for low-power embedded systems.The algorithm, Efficient Large-Scale Stereo (ELAS) [12], isthe fastest CPU algorithm w.r.t. resolution on the Middleburydataset [13] and one of the most accurate non-global meth-ods. ELAS is attractive since it very efficiently implementsa slanted plane prior while its dense depth estimation isfully decomposable over all pixels and, hence, suitable forparallel processing. Unfortunately, the intermediate step, i.e. estimation of coarse scene geometry through the triangulationof support points, is a very iterative, sequential and conditionalprocess with an unpredictable memory access pattern; makingit difficult to accelerate on an FPGA.To overcome this challenge, we propose the first stereoimplementation which collaboratively utilizes both compo-nents of an embedded CPU-FPGA System on Chip (SoC)for the purpose of algorithm acceleration . Other publishedlow-power systems achieve good frame rates by limiting thealgorithms they implement to those that can be fully processedby the FPGA, even when closely coupled processors areavailable e.g. [2], [1]. We, instead, seek to take advantage ofboth available components to efficiently accelerate the morecomplicated/accurate ELAS algorithm and demonstrate itsfeasibility for low-power systems. Accomplishing this involvesoffloading the different stages of the processing pipeline ontothe component that best suits the computations involved. Wediscuss the rational behind the chosen partitioning and explain Source code available at https://github.com/torrvision/ELAS SoC
Fig. 1. ELAS Overview: Extract a set of support points from gradient images that are then used to establish priors for the dense matching stage. why its the most suitable, describe the key traits required inthe design of efficient accelerators as well as the changes madeto best adapt the algorithm to the platform. Tested on both theKITTI and New Tsukuba data sets, our system outperformsthe frame rate of the original by ∼ − × with a rate of47fps (1242 ×
375 images) and, in addition, improves upon itsaccuracy - all the while with under 4W of power consumption.II. R
ELATED WORK
The pursuit of real-time stereo began in the 1980’s [14].Initial implementations of dense stereo minimized relativelysimple matching costs, e.g.
Sum of Absolute Differences(SAD) or Sum of Squared Differences (SSD), between leftand right image patches evaluated along the epipolar lines.Kayaalp and Eckman [15] were one of the first to present sucha system, capable of estimating disparity over a 64 disparityrange in about one second for 256 ×
256 images.The first system capable of at least 30 fps - on 200 × etal. [16], [17]. Similarly to [15], they used a Sum of Sum ofAbsolute Difference (SSAD) but rather than summing over thedifferent color channels, they summed over the six differentcameras of their multi-camera system.The first FPGA implementation of dense stereo matchingused 16 Xilinx 4025 FPGAs [18]. It relied on the CensusTransform (CT) [19] and computed 24 disparity levels over320 ×
240 images at 42 fps. Over about the next decade,FPGAs were repeatedly demonstrated as suitable platformsfor dense real-time stereo, however they mostly implementedonly variations of the SAD, SSAD, zero-mean SAD (ZSAD)and CT with additional noise suppressing post-processingsteps [20], [21], [22], [23], [24]. Hence, the accuracy of suchapproaches was not typically comparable with state-of-the-artmodels which were formulated in global energy minimizationframeworks [13], [25], [26].Notable improvements in accuracy of FPGA implemen-tations were made by incorporating Semi-Global Matching(SGM) [11]. For instance, Gehrig et al. [27] used a 3 × et al. [8] proposeda similar solution, but only aggregated costs over 4 directionswith a rank transform [19]. SGM’s larger scope managed topartly bridge the accuracy gap between strictly local operatorsand global optimization methods whilst remaining suitablefor acceleration. However, SGM still has disadvantages suchas large memory requirements and a fronto-parallel bias.Alternative recent approaches have shown some improvements in both frame rate and accuracy [28], [29], [30], [31], [9], [10],[32], however, they still lack the accuracy of SGM.During the same period (2000-2010), graphics processingunits (GPU) began to appear as alternative platforms foralgorithm acceleration [33], [34], [35]. Although they offeredspeedup for sliding window algorithms such as local stereo,GPUs typically under-performed and consumed more powerthan their FPGA counterparts [7]. They were therefore a lessfavorable option for truly embedded and real-time systems.Currently, the fastest CPU stereo algorithm on the Mid-dlebury dataset [13], normalized w.r.t. resolution, is EfficientLarge-Area Stereo [12]. It competes with SGM in accuracy,but its diverse computational nature has it overlooked infavor of other fully FPGA implementable algorithms. Withnew closely coupled CPU/FPGA System-on-Chip devices,however, it stands to benefit a lot in terms of acceleration.III. P RELIMINARIES
A. Original ELAS algorithm
The Efficient Large-Scale Stereo Matching (ELAS) algo-rithm [12] relies on the assumption that not all correspon-dences are equally difficult . It first establishes a set of sparsecorrespondences whose estimation is simpler and at the sametime comes with a higher degree of confidence. These cor-respondences provide a coarse approximation of the scenegeometry and are used to define a slanted plane prior whichguides the dense matching stage.An overview of the ELAS is shown in Fig. 1. To obtainthe set of sparse but confident correspondences (the “supportpoints”), the stereo pair first passes through a SAD matchingstage over the horizontal and vertical gradients of the im-ages, Fig. 1(b-d). The resulting set is sparse as only pixelswith sufficiently unambiguous disparity values are kept. Thiscriterion is measured by comparing the distance between thefirst and second minima of the SAD evaluations across thedisparity range. The results from this stage then undergo afurther filtering procedure, Fig. 1(e), to remove implausibleand redundant values which would respectively corrupt orunnecessarily complicate the coarse 3D representation. Thisfiltering process compares the sparse values to neighborswithin a window region to ensure that they are consistent andremoves identical values along the same row or column.The set of support points is then used to guide the densestereo matching stage (Fig. 1(h)) in two separate ways. First,the set of support points is used to define a slanted planeprior which guides the dense matching stage. To this end,ELAS uses Delaunay triangulation to construct a mesh whichapproximates coarse scene geometry (Fig. 1(g)). Second, this
27 2727 30 303026 2626 2626 34 3535 3027343526 3027333426252829313536
Fig. 2. Pooling support points within a sub-region to create a grid vector.The sparse set of correspondences in every given grid-region of an image arepooled together to create a characteristic search vector. This search vector isexpanded to include immediate neighbors of included support points ( ± slanted plane prior is used to limit the disparity values evalu-ated during the dense matching stage. This range is expandedto include immediately neighboring values ( ± B. Platform
We use Xilinx’s ZC706 development board with theXC7Z045 SoC; this is a heterogeneous chip that incorporatesan ARM Cortex A9 processor operating at 800MHz and a28nm Kintex series FPGA on the same die. The collocationof the two components ultimately serves to increase the overallthroughput of the system as it allows for rapid and efficientexchange of data. The resources available on the FPGAinclude: 218600 Look Up Tables (LUT), 437200 Flip Flops(FF), 900 DSP48 Blocks, 1090 18K Block RAMs (BRAM).
C. High level synthesis
Vivado High Level Synthesis (VHLS) provides a higherabstraction approach to FPGA block implementation by syn-thesizing designs described in a high level language sucha C/C++ into equivalent hardware descriptions. A deep un-derstanding of the underlying hardware architecture is stillrequired, but it alleviates the burden of adopting a low levelhardware description languages such as VHLD/Verilog. Byabstracting fine grained, less critical details of the design,VHLS accelerates and facilitates development with FPGAs.Although accelerators designed with this higher abstractionapproach may not be as optimized or as resource efficient asthose designed with low level hardware description languages,the ability to deploy, modify and test them much more rapidlyis a reasonable compromise. Hence, we implemented allaccelerators using VHLS.IV. S
YSTEM O VERVIEW
Fig. 3 shows the overall system implemented on ZynqSoC platform. Determining which parts of the algorithm are
Fig. 3. Overview of System-on-Chip (SoC) implementation. Compute-intensive tasks are offloaded to FPGA accelerators whereas condi-tional/sequential tasks are handled by ARM CPU. Communication betweenCPU and FPGA is handled by Direct Memory Access blocks in the FPGA. offloaded onto dedicated accelerators and which are proceedon the ARM CPU is a twofold process. First, the entirealgorithm’s CPU runtime is profiled to get an estimate of thetime spent in each function. The main bottlenecks are thenidentified and the algorithm is mapped and broken down intoits main components. In the subsequent step, the computationalnature of the different components are evaluated and matchedto the most appropriate component for processing. a) FPGA:
FPGA accelerators are severely hampered ifthey require communicating with external memory or if theycontain many divergent datapaths through them. However, theyexcel at performing a variety of operations, simultaneously, ona large range of data. As such, functions that process blocksof data with well defined, relatively local, memory accesspatterns and limited amounts of conditional branching canbenefit tremendously from such acceleration.In ELAS, the functions responsible for support point extrac-tion, filtering and block matching fit such criteria. Therefore,as denoted by the green blocks in Fig. 3, these are offloadedonto dedicated FPGA accelerators. These accelerators caneither have data transferred in-between them directly (Sparse → Filtering) or to and from the the RAM through DirectMemory Access (DMA) blocks in the FPGA (programmedby the CPU). The accelerators process the data in-stream -only storing small portions of the overall stream ( cf. , Sec. V)and outputting data at the same rate at which it is received.This processing style best compliments the raster pixel readoutof modern image sensors and allows for top level pipeliningin between successive FPGA accelerators. b) ARM CPU:
Functions with very unpredictable mem-ory access patterns as well as those with a high amount ofconditional branching are very ill suited for FPGA accelera-tion. These, instead, benefit more from the ARM CPU’s fasterprocessing speeds, its sequential processing style (invariantto branching) and its equal, but longer, access to memory(disregarding cache hit/misses).As denoted by the yellow blocks in Fig. 3, the ELASprocesses that are handled in this manner are the Delaunay
Fig. 4. Memory requirements for sliding window operations in FPGAaccelerators. Line buffers (blue) are used to store large amounts of data butcan used to provide a single value per clock cycle. Window buffers (green)are local registers used to store immediately required data. triangulation as well as the remapping of the slanted planepriors into disparity priors. Grid vector extraction and one-hotencoding ( cf. , Sec. VI) are also done on the ARM. Althoughgrid vector extraction would appear to be a candidate forFPGA acceleration as it “pools” values within a local memoryregion with a window like operation, as shown in Fig. 2, inreality, as it only operates on a single value at a time, it benefitsmore from the faster clock of ARM.V. K EY A CCELERATOR D ESIGN T RAITS
A. FPGA memory management
To achieve the required parallelism within the FPGA blocks,all the necessary data for a set of computations must beavailable on the same clock cycle as those computationsare to occur. Memory management, therefore, has the largestimpact on accelerator throughput and latency. Fig. 4 shows thecombination of block RAMs and local memory that are used asthe core components for this purpose. Block RAMs (BRAMs)are the most resource efficient stores of large quantities ofdata. In the accelerators, they behave as line buffers, storingpreviously received pixel information into the FPGA fabric.For a given W × W matching window, each image’s pixel datais collected by a set of 2 × W line buffers. On every clockcycle, each line buffer shifts its contents at a given index to theline buffer above it and a new pixel is read into the free spacecreated at that index of the bottom-most buffer. Similarly, thedata that was in the topmost buffer is shifted out as it is nolonger required - only a fraction of the overall image data isever held in the FPGA.Each BRAM, however, is only able to read and write onevalue per clock cycle. Therefore, a set of line buffers onlysupplies one column of pixel data per clock cycle. Most com-putations in the accelerators, however, operate over windows of pixels and therefore this alone is insufficient in meetingthe memory requirements for a high throughput/low latencydesign. Instead, an additional W × W size “window” buffer isnecessary (Fig. 4). As it consists of local registers within theFPGA block which are all instantaneously accessible, this isresource expensive. On every clock cycle, the contents of thewindow are updated by shifting all columns once to the left. The rightmost column is read in from the values stored in theline buffers (including the latest pixel value).By combining the use of storage elements in this manner,we efficiently achieve access to all the necessary data on thesame clock cycle on which it is used. No additional clockcycles need to be spent on memory access. B. Pipelining
Although VHLS handles timing considerations and dataflow control of FPGA accelerators, the throughput and latencyit achieves depends on the propagation delays within theaccelerators as well as the desired amount of pipelining andoverall clock frequency.When maximizing the throughput of an accelerator, pipelin-ing is necessary when its total internal propagation delayexceeds the clock period to which it is constrained (pixelread in/out rate). By introducing pipelining, the acceleratoris able to meet the clock frequency constraint by dividingand spreading its operation over multiple clock cycles. Eachsub-stage is separated from the next with flip flops thatstore intermediate values and therefore pipelining improvesutilization. Ultimately, however, it results in the acceleratorbeing shared across a set of inputs as each sub-stage processesa new input on every clock cycle - a larger amount of data isbeing simultaneously acted upon.Other than the flip flop requirement, the trade-off withpipelining is that the number of clock cycles between firstinput and output increases by the amount of pipeline stagesand the propagation delay experienced by a single pixel islonger as each sub-stage’s delay is extended to that of thelongest sub-stage. In image processing, however, as largequantities of pixel data pass through the accelerators, thelatency introduced by pipelining is not only negligible, butsignificantly outweighed by the ability to output data at a muchfaster rate. It plays a significant role in achieving the desiredframe rate in our design.VI. P
LATFORM C ONSCIOUS A LGORITHM C HANGES
A. Feature selection
Efficient implementation of original ELAS uses SIMDaccelerators with fixed widths of 16 bytes for feature extrac-tion and matching. Such an implementation, however, lacksflexibility since the number of pixels it can process is limitedand must be in multiples of 16 (for 8 bits). The result is thatfor a given W × W window, only a subset of pixels containedwithin it are used for matching purposes.Due to our memory management, all pixel data within awindow is available and therefore no speed penalty is incurredby using it (Fig 4). It also improves the accuracy since we usea larger number of pixels in the matching process. We use theCensus Transform descriptors with Hamming Distance insteadof the SAD as it achieves illumination invariant matchingwithout the need for an additional pre-processing step (Sobelfilter Fig. 1(b)).Even if the pre-processing is discounted, the SAD generallyrequires more resources than the CT. As shown in Fig. 5,unlike the CT that reuses previously extracted features, the Fig. 5. Comparison of Extraction and Matching Implementation with CensusTransform and Sum of Absolute Differences
SADs must be recomputed every time. Also, SAD implemen-tations that achieve similar throughput, such as the one in [36],require an additional window buffer to store previous columnSAD computations (bottom of Fig. 5). Thus the resourcerequirement is greater due to both the greater number ofcomputations as well as the greater need for memory.These changes result in a feature descriptor that is shorterin bit length (81 /
25 compared to the original 512 / B. Measuring ambiguity
The support point extraction is done slightly differently tothe original algorithm. We replace the original criteria whichassumes a match is unambiguous if m m ≤ . , (1)where m and m are the first and second mimima respectively.This is equivalent to thresholding m m ≤ . m = T err ( m ) , (2)However, implementing such comparison on an FPGA requiresa number of DSP blocks. Hence we approximate the threshold T err ( m ) with a shift-sum T err ( m ) = . m ≈ m + m + m + m = . m . (3)This eliminates the need for DSP blocks as shift-summingis fully accomplished within the LUT fabric of the FPGA. Value RedundantValue Not RedundantIdentical Value
Fig. 6. One dimensional simplification of redundancy verification (A) Search-ing both forwards and backwards propagates redundancy into the value’s non-existence (B) Searching strictly backwards retains the value at the desiredfrequency
C. Filtering support points
The original algorithm uses both past and future valuesin the data stream for redundancy check. As shown in thesimplified 1-D example of Fig. 6(A), the shared values used toflag redundancy are often made redundant by instances furtherahead. Instead, the “filter” FPGA accelerator only relies onpast values when determining redundancy. This ensures theshared values are less frequent rather than non-existent.
D. Data Reduction
To accommodate for the static bit-width of acceleratorports and to minimize the data transferred to the FPGA, weintroduce new data reduction steps to ELAS (on the ARM).Referring to Fig. 3, the first one-hot encodes the grid vectorsfrom a variable length byte array into a statically sized bitwiserepresentation. The second converts the result of the DelaunayTriangulation from a variable mesh of triangles into a static,input image sized, matrix of disparity priors.VII. E
XPERIMENTAL R ESULTS
To demonstrate the effectiveness of our approach, we pro-vide a detailed evaluation across differently parameterized setsof implementations and evaluate them on both the KITTI andNew Tsukuba data sets.As FPGA accelerators can not be easily reconfigured fordifferent image resolutions during testing, we only use 310of the 400 KITTI image pairs that have the same 1242 × × A. Accuracy
We begin by verifying how the accuracy of the FPGAaccelerated version of ELAS, following the modificationsmade to adapt it to the SoC platform, compares to that of theoriginal algorithm. To this end, we use the standard accuracymetric from the KITTI benchmark which measures the relativenumber of estimated disparities which differ from ground truthby both an absolute amount of at least 3 and a relative one ofat least 5%.To ensure a fair comparison, the number of support pointsused to establish the prior should be approximately the same.As not all pixels are considered for support point extraction in the original algorithm, we find that this occurs when weuse -th of the number of total extracted support pointsin our method. We also use the same window sizes forboth matching stages, i.e. × ×
5, respectively. Aspreviously explained, accuracy is measured without post-processing/refinement as these processes aren’t unique toELAS and are more susceptible to dataset “fine tuning”. Withthese parameters, the original implementation tested withoutpost-processing over the same image set, obtains an average er-ror of 17 .
9% while our implementation achieves an improved16 . × × B. Per-frame processing time
Fig. 7 (left axes) illustrates the proportion of the overallprocessing time of the ARM against that of the FPGA. TheFPGA portion is inclusive of the time spent transferring databy the DMAs. Additionally, the results are reported acrossdifferent support point densities which we regulate throughdown sampling. As shown, although the time spent on theARM is proportional to the number of support points usedand shortens significantly with down sampling, it nonethelessdominates the overall processing time.In contrast, the combined processing time of the FPGAaccelerators is mostly unaffected by changes in parameterssuch as matching window size, disparity range or the numberof support points. As they have a constant throughput of1 pixel/clock cycle, their processing time is, instead, pre-dominantly a function image resolution. On average, it takesonly 4 . ± . × . ± . × . × difference corresponds exactly to the pixel ratio difference.The line plot in Fig. 7 (right axes) shows the error per-centage vs. the number of support points controlled by down-sampling. Whilst slightly unintuitive, the best accuracy is notachieved with the largest number of support points (peakat of the support points). This is likely due to the noisynature of sparse correspondence matching. Following thispeak, accuracy gradually decreases with reduced number ofsupport points as the resulting planar surfaces become lessaccurate coarse approximations of scene geometry.Interestingly, although the FPGA accelerators run at con-stant time, the matching window size of the support pointextraction stage is negatively correlated with overall framerate. Larger windows do incur a greater initial latency to ac-count for the additionally used line buffers, but this differenceis negligible (evidenced by invariance of the frame rate tothe window size of dense matching) and can not account for E rr o r % Fraction of Support Points Used (1/2 x ) P r o c e ss i ng T i m e ( m s ) FPGAARM
Fig. 7. Processing time and related accuracy w.r.t down sampling this difference. Instead, the frame rate reduction is actuallydue to the increased number of support points resulting fromthe extraction using larger matching windows - this impactsthe ARM’s workload. Therefore, the main bottleneck is theDelaunay Triangulation which is in stark contrast to the onereported in the original CPU implementation.
C. Power and resource consumption
The high throughput capability of accelerators is limitedby the number of circuit elements available within the FPGAfabric. In order to report this “resource utilization” (Table I),we split it into the three main types of blocks (we exclude DSPblocks as they are negligibly used). As expected, the LUTs,used for the combinational logic and instantaneous memory,are the most predominantly utilized and this amount dependsstrongly on the window size of dense matching. The flip flops,utilized primarily for pipelining, share a similar dependency,but to a lesser extent. The BRAMs, used as line buffers arepurely a function of the cumulative image rows required for agiven set of windows.One of the most important advantages of the proposedimplementation is the power efficient computing that it enables( cf. , last two columns of Table I). The ARM processor, runningat a steady 800 Mhz, accounts for a constant but majority shareof the power. In contrast, the power consumed by the FPGA ismuch more controlled and directly proportional to the portionof FPGA logic that is being utilized. Altogether, however, theresults highlight that the implementation is not only capableof running the algorithm in real-time, it succeeds in doingthis with under 3W of power (in contrast to powerful desktopCPUs which typically require > D. Throughput
To further increase frame rate, we explore operating overmultiple images simultaneously by taking advantage of the
TABLE II
MPACT OF W INDOW S IZES USED DURING MATCHING ON FRAME RATE , ACCURACY AND RESOURCE UTILIZATION
Data Set WindowSize WindowSize CPU ELAS 1/8 1/32 Resource Utilization [%] Power (Watts)FPS Error % FPS Error % FPS Error % LUT FF BRAM ARM FPGAKITTI1242 ×
375 7 × × × × × × × × ×
11 3 × × × ×
13 3 × × × ×
480 9 × × ×
11 7 × IME B REAKDOWNS
Cones (900 x 750) Time (ms)i7-only (orig.) Time (ms)ARM+FPGA Time (ms)i7+FPGASupport Points 118 3.5 3.5Triangulation 7 84.42 7Matching 359 3.5 3.5 additional core of the ARM and separate but identical ac-celerators in the FPGA. This effectively doubles the systemand therefore the resources and power used by the FPGAdouble (minus some shared overhead). Conversely, the ARM’spower consumption remains the same, at 1 . ( × ) and ( × ) matching windowsutilizes 56 .
8% of LUTs, 35% of FFs, 22 .
8% of BRAMsand consumes 3 . . . . . OMPARISON & D
ISCUSSION
Comparing these results to what was achieved in the originalpaper, it is clear that parallelizing key parts of the algorithmhas successfully led to significantly faster - up to 30 × -real-time frame-rates. Despite the achievement, the resultsalso reveal some of ELAS’s weaknesses for power-limitedplatforms. From Table II, where the time breakdown foreach stage is compared across systems, we see how the SoCmanages 100 × throughput increase for both matching stageseven though it processes more data, i.e. all pixels consideredfor support point and full matching windows. However, withthe low-power CPU paling in performance compared to itsdesktop counterpart, the triangulation procedure - seeminglyinsignificant in the original paper - is >
12 times longer anddominates the processing time on the SoC. Thus, althoughELAS is exemplified as one of the fastest stereo algorithm,its dependence on a very sequential procedure makes it alsodependent on a powerful processor to achieve maximal speedup. In the last column of Table II, we show the processingtimes which one could obtain if an SoC, combining the sameFPGA with an Intel Core i7 CPU instead of the ARM, were used to accelerate the algorithm with the same proposedapproach. On such a platform, the frame rate of ELAS exceeds70fps, but power consumption would also exceed 100W.In terms of accuracy, the improvements over the originalcan be attributed to the full matching windows/CT as opposedto the randomly sub-sampled SAD of the original. These sub-sampled windows were needed in the original to speed up theCPU processing time. In contrast, with the FPGA accelerators,not only is window matching speed independent of the numberof pixels considered, but full windows result in a more efficientuse of resources. As a corollary, unlike CPU ELAS whoseruntime is coupled to its tailored windows, our embeddedversion can accommodate various window sizes and disparityranges without being concerned about the impact on latency.We compare the performance of our system to the fastestimplementation currently reported on the KITTI benchmark,referred to as “CSCT+SGM+MF” (CSM) [37] and which,at its core, implements SGM - the competing algorithm inembedded, real-time systems. It reports an 8.24% error rateat 156 fps on a 250W NVIDIA Titan X. As CSM’s resultincorporates smoothing/refinement both inherently throughSGM and through an additional median filter, we pass theresults of our most accurate configuration through a medianfilter for the sake of comparison. With only this one additionalpost-processing step, we obtain a new error rate of 9.52% -already slightly better than the accuracy ELAS reports on theKITTI benchmark following all post-processing/refinement.Although CSM may still be marginally more accurate witha faster frame rate, it requires substantially more power at250W. This is equivalent to a per Watt frame rate of 0 . . . . × improvement over thefaster Titan X system and 5 × over the slower Tegra X1 one.The FPGA accelerators in this system were described inC++ and then converted into logic with VHLS. In our experi-ence, although the tool did eliminate the need for writing low- level VHDL/Verilog, it still relied heavily on the user’s deepknowledge of the target circuit. The original algorithm hadto be completely re-engineered to comply with the hardwareframework. As well as motivating the previously describedmodifications, this included adhering to a regular, timing-strictprocessing chain, minimizing any inter-process dependenciesand eliminating conditional operations/branches/exceptions.IX. C ONCLUSION
In this work, we have disassembled and reconstructed theELAS algorithm onto an ARM + FPGA SoC with the purposeof evaluating its suitability for low-power, real-time embeddedsystems. By taking advantage of the immense parallelismavailable with FPGAs and by better adapting the algorithm forit, we not only successfully accelerate the frame rate by upto 30 × , but we also demonstrate an improvement in accuracy.All this is achieved with under 4W of power which makesit 5 −
10 more efficient, on a frame rate per Watt basis, thancompeting algorithms on KITTI.Through the iterative process of adapting the algorithm tothe platform as well the starkly different resulting processingtime breakdown we obtained, fundamental principles weregleaned for the future design of accurate, but ultimately real-world applicable, algorithms. Specifically, with parallelismbeing of paramount importance, any strictly sequential oriterative processes must be kept to a minimum as these willcause severe bottlenecks. Their acceleration depends on fasterprocessors, and as CPU frequency is directly proportional topower consumption, this quickly incurs greater power require-ments that are unrealistic in space, aerial or mobile robotics.Conversely, accelerators excel at simultaneously processingvast amounts of data as long as it is available and effectivelymanaged in the fabric of the FPGA. Therefore, compromisesthat may have made sense for a strictly CPU system, such assacrificing accuracy for speed by computing with fewer pixels,are no longer necessary and should be entirely avoided.Finally, although the newly developed high-level designtools by Xilinx do indeed facilitate the access, speed andand transportability of designing on FPGAs, one must be stillpossess a strong understanding of hardware design in order toefficiently implement accelerators with them.R
EFERENCES[1] H. Oleynikova, D. Honegger, and M. Pollefeys, “Reactive avoidanceusing embedded stereo vision for mav flight,” in
ICRA , 2015.[2] G. Camellini, M. Felisa, P. Medici, P. Zani, F. Gregoretti, C. Passerone,and R. Passerone, “3DV-An embedded, dense stereovision-based depthmapping system,” in
Intelligent Vehicles Symposium Proceedings , 2014.[3] N. A. Zainuddin, Y. M. Mustafah, Y. A. M. Shawgi, and N. K.A. M. Rashid, “Autonomous navigation of mobile robot using kinectsensor,” in
International Conference on Computer and CommunicationEngineering , 2014.[4] C. Lehnert, I. Sa, C. McCool, B. Upcroft, and T. Perez, “Sweet pepperpose detection and grasping for automated crop harvesting,” in
ICRA ,2016.[5] S. L. Hicks, I. Wilson, L. Muhammed, J. Worsfold, S. M. Downes,and C. Kennard, “A Depth-Based Head-Mounted Visual Display to AidNavigation in Partially Sighted Individuals,”
PLoS ONE , 2013.[6] O. Miksik, V. Vineet, M. Lidegaard, R. Prasaath, M. Nießner,S. Golodetz, S. L. Hicks, P. Perez, S. Izadi, and P. H. S. Torr, “Thesemantic paintbrush: Interactive 3d mapping and recognition in largeoutdoor spaces,” in
ACM CHI , 2015. [7] J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A performance andenergy comparison of fpgas, gpus, and multicores for sliding-windowapplications,” in
FPGA , 2012.[8] C. Banz, S. Hesselbarth, H. Flatt, H. Blume, and P. Pirsch, “Real-timestereo vision system using semi-global matching disparity estimation:Architecture and fpga-implementation,” in
SAMOS , 2010.[9] M. P´erez-Patricio, A. Aguilar-Gonz´alez, M. Arias-Estrada, H.-R.Hernandez-de Leon, J.-L. Camas-Anzueto, and J. de Jes´us Osuna-Couti˜no, “An fpga stereo matching unit based on fuzzy logic,”
Micro-processors and Microsystems , 2016.[10] G. Cocorullo, P. Corsonello, F. Frustaci, and S. Perri, “An efficienthardware-oriented stereo matching algorithm,”
Microprocessors andMicrosystems , 2016.[11] H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in
CVPR , 2005.[12] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereomatching,” in
ACCV , 2010.[13] H. Hirschmuller and D. Scharstein, “Evaluation of cost functions forstereo matching,” in
CVPR , 2007.[14] M. Drumheller and T. Poggio, “On parallel stereo,” in
ICRA , 1986.[15] A. E. Kayaalp and J. L. Eckman, “Near real-time stereo range detectionusing a pipeline architecture,”
IEEE Transactions on Systems, Man, andCybernetics , 1990.[16] T. Kanade, H. Kano, S. Kimura, A. Yoshida, and K. Oda, “Developmentof a video-rate stereo machine,” in
IROS , 1995.[17] T. Kanade, A. Yoshida, K. Oda, H. Kano, and M. Tanaka, “A stereomachine for video-rate dense depth mapping and its new applications,”in
CVPR , 1996.[18] J. Woodfill and B. Von Herzen, “Real-time stereo vision on the partsreconfigurable computer,” in
FCCM , 1997.[19] R. Zabih and J. Woodfill, “Non-parametric local transforms for comput-ing visual correspondence,” in
ECCV , 1994.[20] M. Hariyama, Y. Kobayashi, H. Sasaki, and M. Kameyama, “Fpga im-plementation of a stereo matching processor based on window-parallel-and-pixel-parallel architecture,”
IEICE Transactions on Fundamentals ofElectronics, Communications and Computer Sciences , 2005.[21] S. Perri, D. Colonna, P. Zicari, and P. Corsonello, “Sad-based stereomatching circuit for fpgas,” in
Electronics, Circuits and Systems , 2006.[22] C. Cuadrado, A. Zuloaga, J. L. Martin, J. Laizaro, and J. Jimenez, “Real-time stereo vision processing system in a fpga,” in
IECON , 2006.[23] C. Georgoulas, L. Kotoulas, G. C. Sirakoulis, I. Andreadis, andA. Gasteratos, “Real-time disparity map computation module,”
Micro-processors and Microsystems , 2008.[24] K. Ambrosch and W. Kubinger, “Accurate hardware-based stereo vi-sion,”
CVIU , 2010.[25] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms,”
IJCV , 2002.[26] ——, “High-accuracy stereo depth maps using structured light,” in
CVPR , 2003.[27] S. K. Gehrig, F. Eberli, and T. Meyer, “A real-time low-power stereovision engine using semi-global matching,” in
ICVS , 2009.[28] C. Georgoulas and I. Andreadis, “A real-time fuzzy hardware structurefor disparity map computation,”
Journal of Real-Time Image Processing ,2011.[29] M. Werner, B. Stabernack, and C. Riechert, “Hardware implementationof a full hd real-time disparity estimation algorithm,”
IEEE Transactionson Consumer Electronics , 2014.[30] W. Wang, J. Yan, N. Xu, Y. Wang, and F.-H. Hsu, “Real-time high-quality stereo vision system in fpga,”
IEEE Transactions on Circuitsand Systems for Video Technology , 2015.[31] M. P´erez-Patricio and A. Aguilar-Gonz´alez, “Fpga implementation of anefficient similarity-based adaptive window algorithm for real-time stereomatching,”
Journal of Real-Time Image Processing , 2015.[32] Y. Li, K. Huang, and L. Claesen, “Soc and fpga oriented high-qualitystereo vision system,” in
FPL , 2016.[33] M. Gong and Y.-H. Yang, “Near real-time reliable stereo matching usingprogrammable graphics hardware,” in
CVPR , 2005.[34] J. Lu, G. Lafruit, and F. Catthoor, “Fast variable center-biased window-ing for high-speed stereo on programmable graphics hardware,” in
ICIP ,2007.[35] I. Ernst and H. Hirschm¨uller, “Mutual information based semi-globalstereo matching on the gpu,” in
ISVC , 2008.[36] O. Rahnama, A. Makarov, and P. H. S. Torr, “Real-time depth processingfor embedded platforms,” in
Proceedings of the SPIE , 2017.[37] D. Hernandez-Juarez, A. Chac´on, A. Espinosa, D. V´azquez, J. C. Moure,and A. M. L´opez, “Embedded real-time stereo estimation via semi-globalmatching on the gpu,”