A fully pipelined FPGA accelerator for scale invariant feature transform keypoint descriptor matching,
Luka Daoud, Muhammad Kamran Latif, H S. Jacinto, Nader Rafla
AA Fully Pipelined FPGA Accelerator for Scale Invariant FeatureTransform Keypoint Descriptor Matching
Luka Daoud, Muhammad Kamran Latif, H S. Jacinto, Nader Rafla
Department of Electrical and Computer EngineeringBoise State UniversityBoise, ID 83725, USA
Abstract
The scale invariant feature transform (SIFT) algorithm is considered a classical featureextraction algorithm within the field of computer vision. The SIFT keypoint descriptormatching is a computationally intensive process due to the amount of data consumed. Inthis paper, we designed a fully pipelined hardware accelerator architecture for the SIFTkeypoint descriptor matching. It was implemented and tested on a field programmable gatearray (FPGA). The proposed hardware architecture is able to properly handle the memorybandwidth necessary for a fully-pipelined implementation and hit the roofline performancemodel achieving the potential maximum throughput. The fully pipelined matching archi-tecture was designed based on consine angle distance approach. It was optimized for 16-bitfixed-point operations and implemented on hardware using Xilinx Zynq-based FPGA devel-opment board. Our proposed architecture showed a noticeable reduction of area resourcescompared with its counterparts in the literature maintaining high throughput by alleviatingthe memory bandwidth restrictions. The results showed reduction in device-resources up to91% in LUTs and 79% of BRAMs. Our hardware implementation is 15.7 × faster than thecomparable software approach. Keywords:
Scale Invariant Feature Transform, SIFT, Matching algorithm, FPGA,Pipeline, Acceleration, High Level Synthesis, HLS.
1. Introduction
Object recognition using feature-based algorithms are generally computationally inten-sive. The scale-invariant feature transform (SIFT) algorithm proposed in 1999 by DavidLowe [1], is a classical and well-known algorithm within the field of computer vision. SIFTalgorithm is a feature-based algorithm that can be applied in object recognition. The bestcandidate match for a SIFT keypoint is found by identifying its nearest-neighbor in the
Email addresses:
[email protected] (Luka Daoud),
[email protected] (Muhammad Kamran Latif),
[email protected] (H S. Jacinto), [email protected] (Nader Rafla)
Preprint submitted to Microprocessors and Microsystems Journal February 08, 2019 a r X i v : . [ c s . C V ] D ec eypoint database. The matching process often involves operating on data-at-rest but morerecently real-time applications using feature-based object recognition have gained popular-ity. Feature extraction based object recognition is an approach commonly applied in severalvarying applications such as medical imaging [2], satellite imaging [3], facial recognition [4],and the landing of unmanned aerial vehicles (UAVs) [5].Various steps in the extracting SIFT descriptors often require the use of complex softwareroutines that require intensive computations [1]. However, in a running scenario of keypointextraction, the extraction only occurs once per test image. The limitations of keypointdescriptor matching thus requires that matching must be performed every time a test imageis compared with a possible match in the database. Each time the database needs to beaccessed, the overall matching time for the test image increases as the overall size of thedatabase grows.The SIFT descriptor matching is based on the nearest-neighbor algorithm [1] where, fora single test keypoint descriptor match, the Euclidean distances [6] of the test descriptor arecalculated between each descriptor in the descriptor database. The calculated distances arethen sorted such that the minimum and second minimum distances are found. A positivematch between the test descriptor and the descriptor database is found if the Euclideandistance ratio is above a pre-set threshold, suggested by David Lowe in [1].Since a SIFT keypoint descriptor is an array of 128 elements, calculated based on allpixels of an image around the centered keypoint in a 16 ×
16 sliding window. The generateddescriptor by this method can be defined mathematically as: d αk = { f αk, , f αk, , . . . . . . , f αk, } . The Euclidean distance between two descriptors, d αk and d βm , is thus calculated: (cid:88) i =1 ( f αk,i − f βm,i ) ( f αk,i + f βm,i ) . In the process of matching a descriptor, d αk , with a database, the Euclidean distances of d αk in relation to the database’s descriptors is calculated. The process of calculating Euclideandistances is computationally intensive however, resource consumption can effectively be re-duced by changing the calculation of Euclidean distance. Instead of using a conservativeapproach of calculating the Euclidean distance as mentioned, a cosine angle distances canbe calculated between the descriptors [7]. Since SIFT descriptors are normalized duringkeypoint extraction, calculating the angular distances by taking the arc-cosine of the dot-products of normalized descriptors prove to be a close approximation for Euclidean distances[7]. Utilizing a method of angular distance will significantly reduce the hardware resourceconsumption.If an image of m descriptors is represented by a matrix of size m × • A hardware implementation of the SIFT keypoint descriptor matching based on cosineangle distance on FPGA including: – A fully pipelined architecture. – Minimal resource utilization. – High throughput hardware accelerator. • Resulting analysis of memory bandwidth usage and its effect on the overall computa-tional performance.The rest of this paper is organized as follows: Section 2 summarizes the literature reviewand the related work of SIFT descriptors matching on accelerating platforms. Section 3 pro-vides background and related definitions along with the software approach of the matchingalgorithm based on calculating cosine angle distances. Section 4 studies the computationand memory bandwidth optimization. Section 5 presents our proposed matching architec-ture on FPGA. Section 6 evaluates our proposed matching architecture and provides theexperimental results. Finally, Section 7 concludes the paper.
2. Related Work
This section particularly focuses on different approaches of calculating nearest-neighbordistances for descriptors matching algorithms on FPGA-based accelerators. It also providesa brief overview of the matching implementations on other platforms.There have been several hardware-based implementations of descriptors matching onFPGA [8, 9, 10, 11]. Most recently, Vourvoulakis et al. [8] proposed an FPGA-based archi-tecture for SIFT descriptors matching based on the calculation of the distances between thedescriptors in the database. The similarity between the descriptors was determined based onthe minimum value of SAD (Sum of Absolute Distances) calculators. Their implementationwas based on comparing the currently extracted descriptor with 128 previously detectedones to find a potential match. The authors proposed a moving window of 16 descriptors tofit the entire matching architecture on an FPGA. In their implementation, a total 8 clock-cycles were required to calculate 128 SAD values to report a potential match using a singlematching core that required significant memory resources.Lentaris et al. [9] implemented a pipelined architecture for SIFT descriptor matchingusing the Euclidean norm for computing distances between descriptors. In their implemen-tation, a finite state machine fetches all the descriptors from the test image d αk,i and thedescriptors from the database d βk,i stored in memory one by one. The descriptor pair ispassed to a chi-square distance state, where the similarity of the two descriptors was eval-uated by calculating the distance between them. The distance calculating state consists of328 chi-square ( χ ) calculators and each calculator performs ( d αk,i − d βk,i ) / ( d αk,i + d βk,i ) calcu-lation where i is the i th element of the 128-dimensional vector. Each multiplier and dividerin chi-square state is 16-bit and produces a 32-bit result. The output from χ calculatorsis summed using linear systolic array and the result is passed to matching state to keeptracking of the two best matches. At the end of the database, the distance ratio of thesetwo matches is compared with a fixed threshold to accept or reject the best match. Theirused technique of the matching algorithm by calculating the Euclidean distance necessitatedmore resources than our approach as explained in Section 6.Wang et al. [10] proposed an embedded System-on-Chip for features detection andmatching. Their system extracts binary robust independent elementary features (BRIEF)[12] descriptors from the detected SIFT ones. Unlike SIFT descriptors that has 128 elements,the BRIEF descriptor is a vector of 64 elements. The BRIEF matching detection wasperformed by calculating the distances between two BRIEF descriptors. A successful matchis reported if the calculated distance is smaller than a minimum threshold [10].Kapela et al. [11] presented a hardware-software platform in which fast retina keypoint(FREAK) [13] descriptors were extracted in software and matched by calculating the Ham-ming distance which was implemented on Xilinx Zynq-7000 FPGA. Their proposed matchingcore included multiple Hamming distance calculator circuits that are running in parallel tocalculate the distance between the descriptors. The overall performance of their systemdepends on the number of the Hamming distance cores. Additionally, the number of LUTsand registers increases proportionally with the number of Hamming calculators.Condello et al. [14] presented an OpenCL-based feature matching algorithm that madeuse of the capabilities of GPUs to speedup the matching process for speeded-up robust fea-ture (SURF) descriptors. The matching algorithm uses Euclidean distances to calculate thenearest neighbors for a test descriptors with the others in the database. They implementedtheir matching core on NVIDIA’s GTX275, which has a theoretical peak of 2760 GFlops.However, the latency of the global memory access affected on the computation power ofthe GPU where it limited the memory reuse during the distance computation step of thematching process.Fassold et al. [15] used NVIDIA’s Tesla K20 GPU for the SIFT descriptor matchingby calculating the nearest-neighbors between descriptors using Euclidean distance. Theirimplementation of the matching architecture on the GPU achieved 13 milliseconds for a setof 2,800 descriptors.The matching algorithm for most of the implementations is based on calculating thenearest-neighbor distances between the current feature and the features in the database. Tothe best of our knowledge, this paper is the first attempt for hardware implementation ofSIFT matching algorithm on FPGA, where the matching technique is based on calculatingthe nearest-neighbor distances using cosine angle distance rather than using the traditionaldescriptor distance calculations. The following part of this paper moves on to describe ingreater detail the SIFT matching algorithm based on cosine angle distance technique.4 . Matching Algorithm based on Cosine Angle Distance An image is a 2-D array of pixels that carry information and keypoint descriptors arehighly distinctive features in an image. A SIFT descriptor is a vector of 128 elements thatdescribe a scale-invariant local image region. It can be given as d αk , where k and α are the k th descriptor in an image ˜ α . d αk = { f αk, , f αk, , . . . . . . , f αk, } , where f αk,i is the i th element of the k th descriptor of image ˜ α and (0 ≤ f αk,i ≤ α that has a m set of descriptors is described as d α : d α = d α d α ... d αm = f α , f α , f α , . . . f α , f α , f α , f α , . . . f α , ... ... ... . . . ... f αm, f αm, f αm, . . . f αm, . The dot-product operation of two descriptors, d αl and d βs , is denoted as dp α,βl,s , calculated inEquation 1. dp α,βl,s = d αl (cid:12) d βs = (cid:88) i =1 f αl,i · f βs,i (1)Thus, dp α,βk is a dot-product of the k th descriptor of image ˜ α , with each descriptor of image˜ β , defined as dp α,βk = dp α,βk, dp α,βk, ... dp α,βk,n = d αk (cid:12) d β d β ... d βn = d αk (cid:12) d β d αk (cid:12) d β ... d αk (cid:12) d βn = (cid:80) i =1 f αk,i · f β ,i (cid:80) i =1 f αk,i · f β ,i ... (cid:80) i =1 f αk,i · f βn,i . Therefore, the dot-product of all descriptors of image ˜ α and image ˜ β can be denoted as dp α,β , defined by dp α,β = dp α,β dp α,β ... dp α,βm = dp α,β , dp α,β , dp α,β , . . . dp α,βm, dp α,β , dp α,β , dp α,β , . . . dp α,βm, ... ... ... . . . ... dp α,β ,n dp α,β ,n dp α,β ,n . . . dp α,βm,n T .
5n the SIFT matching algorithm the cosine inverse (arc-cosine), denoted by ci , of eachdot-product operation is calculated. Similarly, ci α,β is the arc-cosine of dp α,β , defined as ci α,β = ci α,β ci α,β ... ci α,βm = ci α,β , ci α,β , ci α,β , . . . ci α,βm, ci α,β , ci α,β , ci α,β , . . . ci α,βm, ... ... ... . . . ... ci α,β ,n ci α,β ,n ci α,β ,n . . . ci α,βm,n T . The SIFT matching algorithm iterates through several steps to check if a match of asingle descriptor of image ˜ α corresponds with another descriptor in image ˜ β . The efficientdesign of a matching algorithm depends largely on the platform in which implementation isto occur. In this section a software approach is detailed with a description of the resultingimplementation of the SIFT matching algorithm based on our proposed angular distancemeasure between descriptors. Algorithm 1
Software approach for SIFT descriptor matching.
Input: kth descriptor of image α with size × Database descriptors of image β with size m × Output:
Matched result for kth descriptor with the database descriptors for j = 1 to m do for k = 1 to 128 do p [ i ][ j ]+ = A [ i ][ k ] · B [ j ][ k ]; end for [ sort vals, index ] = sort ( arccos ( p [ i ][ j ])); if sort vals (1) < ( threshold ∗ sort vals (2)) then return Match Found else return No Match Found end if end for
Algorithm 1 provides the computational software flow of the SIFT matching algorithmfor the k th descriptor, d αk , of image ˜ α with the database descriptors of image ˜ β , d β , whereimage ˜ β has n descriptors, represented as d β = { d β , d β , . . . . . . , d βn } . The first step of the SIFT matching algorithm is to calculate the dot-product of the descrip-tor, d αk , with each descriptor in the database according to Equation 1. The result of thedot-product operation is a vector, dp α,βk , of n elements, shown as dp α,βk = [ dp α,βk, , dp α,βk, , . . . . . . , dp α,βk,n ] . dp α,βk , saving the result inmemory or cache, presented mathematically as ci α,βk = [ ci α,βk, , ci α,βk, , . . . . . . , ci α,βk,n ] . The resulting output array, ci α,βk , is sorted in ascending order where the first and secondminimums are calculated.David Lowe defined a threshold criteria [1], typically 0 .
6, to determine matching success.Matching success is determined by the match between the k th descriptor, d αk , of image ˜ α with the database descriptors, d β , of image ˜ β , according to Equation 2. (cid:40) minimum < (0 . × second minimum ) Match otherwise No Match (2)The calculations listed are repeated for each descriptor in image ˜ α to determine thematching features in image ˜ β . From Algorithm 1, the SIFT matching algorithm requiresan equally large number of calculations and memory resources; quickly showing large timedependency due to both calculation and memory access latency.
4. Proposed Optimization of Memory Bandwidth
In this section, we study the impact of the memory bandwidth on the overall performanceof the matching process and explore an optimization scheme to fully utilize the computationcore and the memory bandwidth.
Image descriptors are streamed to the SIFT descriptor matching algorithm subsystemvia an attached memory to the computing core. The total memory bandwidth plays a vitalrole in achieving maximum performance for a given system. In order for the matching core tostart processing, one descriptor for each image, ˜ α and ˜ β , should be ready at the input portsof the matching core. We assume that the k th descriptor of image ˜ α is always ready at theinput port of the computational core. Since each descriptor is composed of 128 elements ,each data transfer between memory and the computational core is 256 bytes.In order to study the effect of the memory bandwidth in the overall performance of thesystem, let’s assume that only one computational core exists in the system, that is pipelinedand works at 100 MHz. To execute one operation, a full descriptor (256 bytes) should beready at the input port of the computational core. Hence, the memory bandwidth takepart in the system throughput. For example, if the memory bandwidth reaches 32 bytesper clock-cycle (3.2 GB/s), the computational core will wait for 8 clock-cycles to completelyreceives a single descriptor to start the process. This will achieve 12.5 Mega operation/second(M op/s). When the memory bandwidth increases to 6.4 GB/s, similarly, the performance Each element would be nominally composed of a 16-bit fixed point for the angular distance method. S p ee d ( M op / s ) Bandwidth (GB/s)
Bound based on speed limitation Figure 1: Performance and memory bandwidth effect.
In this paper, we implemented the matching core on Zedboard [16]. The platform includestwo DDR3 memory components. The multi-protocol DDR controller is configured for 32-bitwide accesses to a 512 MB address space. For 32-bit data width access of the DDR memory,64 bits (8 bytes) are accessed in one clock-cycle. This limits the performance of the matchingcore to 100/32 M op/s for 100 MHz running clock, where the core waits for 32 clock-cyclesto receive a complete descriptor to start its operation. However, by optimizing the memoryaccess, we can achieve the peak performance of the platform, 100 M op/s, as illustrated insection 4.2.
Due to the maximum memory bandwidth limitations presented by the hardware plat-form, our goal is to increase throughput by executing one dot-product operation everyclock-cycle. To assure one dot-product operation can be computed every clock-cycle, a newdescriptor must be valid every clock-cycle. In order to alleviate the memory bandwidthbottleneck, an internal memory (cache) is used for storing 32 descriptors. Since the time tocalculate 32 dot-product operations is 32 clock-cycles (one operation per clock-cycle), withinthat time period, one complete descriptor can be fetched from external memory. The newly8etched descriptor will execute a dot-product operation with each descriptor in the internalcache (32 descriptors stored in internal cache). The result of the fetching optimization op-erations allows calculating 32 dot-product operations in 32 clock-cycles. While executingthe 32 dot-products of the current descriptor, a new descriptor is received and the processis repeated until the entirety of descriptors of image ˜ β is completed.Therefore, in order to alleviate the memory bandwidth restriction, the descriptors ofimage ˜ α are divided into blocks of 32 descriptors each. Each block is passed to an internalcache and the dot-product operation is executed with one block and the entirety of descrip-tors of image ˜ β . Architecturally, two first-in first-out (FIFO) buffers are used to store thedescriptors from external memory as a linear cache, with a total fetch time of 1024 clock-cycles per block . The result from the block latency is that the highest throughput can beachieved when image ˜ β has more than 32 descriptors. To further alleviate computation timeand memory requirements for the SIFT descriptors matching, the architecture must be fullycompatible with the platform. In our approach, the matching architecture is implementedsuch that full utilization of the core is achieved.
5. Proposed Matching Algorithm Architecture on Hardware
FPGA is generally utilized for accelerating computational processes by increasing con-current operations. It further increases the overall throughput of the system by pipeliningand overlapping the instructions. The goal of this work is to accelerate the SIFT descriptorsmatching on FPGA and efficiently handling memory bandwidth limitations which is oftenseen in software implementation, as explained in Section 4.In our proposed architecture, the descriptors of image ˜ α and image ˜ β are streamed fromexternal memory into an internal FIFO. Each descriptor is composed of 128 elements andits location, ( x, y ), in the image. Although each element should be represented as a double-precision floating-point to increase the accuracy, such floating-point adder circuits [17] ismore complicated and consumes more resources. Therefore, for further optimization, eachelement of the descriptor is represented as a 16-bit fixed-point value and a total of 32-bitsfor its location, leading to an individual descriptor size of 2080 bits.The proposed SIFT matching architecture [18] consists of four main sub-cores and twointernal caches to alleviate memory bottlenecks; all of which are fully pipelined and imple-mented onto FPGA: • Dot Product. • Cosine Inverse. • Minimum Search (MIN FIND). • Match Check. ×
32 = 1024. Descriptor Cache (DES MEM). • Minimum(s) Cache (MIN MEM).For making use of high-level synthesis design [19], Xilinx System Generator ® is used fordesigning and implementing the Dot Prod and the
Cosine Inverse blocks as IP cores.Figure 2 describes a block diagram of our proposed SIFT descriptor matching acceleratorcore. In this architecture, the descriptors are streamed from the memory to the matchingcore. The internal caches (buffers) are used for keeping the descriptors to alleviate thememory bottleneck and fully utilize the matching core. This will allow a complete descrip-tor available at every clock-cycle, where the external memory bandwidth is optimized forincreasing the throughput as explained in Section 4.2.
Figure 2: Matching core architecture for the proposed SIFT descriptors matching detailing the sub-modulesand caches necessary for operation as controlled by the control unit. Z − n is a shift register with n pipelinestages. Since the descriptor size is 260 bytes (including the coordinates) and the memory band-width is 8 bytes per clock, it takes 33 clock-cycles to fetch a complete descriptor. In orderto match the fetching time, 33 descriptors of image ˜ α were saved in a separate descriptorscache (DES MEM), where each descriptor of image ˜ β falls through in 33 clock-cycles andsaved into Register.The outputs of the DES MEM and the Register are split into descriptors that are handedto the Dot Product core and their coordinates that are passed through shift register with acorresponding number of the pipeline stages. The output of the Dot Product is passed to aCosine Inverse core of 52 pipeline stages approximating the resulted output to 16-bits.To calculate the minimum and second minimum on the fly, the previous minimum andsecond minimum are retrieved from the minimum(s) cache (MIN MEM) and passed to theminimum-search core (MIN FIND) along with the current calculated cosine-inverse. Initially, the MIN MEM is empty; the minimum and second minimum values are delivered and tem-porarily considered as the maximum values (0 × F F F F ). α , MIN MEM should be flushed. Therefore, amultiplexer is used to pass a constant value of (0 × F F F F ) when a new block of descriptoris delivered at DES MEM. Otherwise, it passes the current value(s) in MIN MEM. This iscontrolled by the Control Unit, seen in Figure 2.The Control Unit present in the proposed SIFT matching architecture allows severaloperations to run concurrently by using a scheduling method; increasing overall throughputof the system. An example of scheduled concurrent operation is when the execution of a dot-product operation occurs on the final descriptor of image ˜ β , the DES MEM is simultaneouslyfilled with the subsequent descriptor of image ˜ α such that the following descriptor of image˜ β will already have a new reference descriptor block. The Control Unit is aware of the totalnumber of descriptors and the processing time of each core, which make it able to handle thecontrol signals to receive new descriptors, enable the internal cores, and control the internalcaches. The SIFT matching algorithm begins by calculating the product of each element of the k th descriptor of image ˜ α with the corresponding element of image ˜ β . Since the descriptoris comprised of 128 elements, 128 multiplications are required to calculate them in parallel.The output from each prior multiplication is sequentially added to obtain the resulting dot-product by the Dot Product core which composed of 128 multiplication cores and sevenlevels of tree adders. The multipliers necessary for the Dot Product core were implementedinto the FPGA’s digital signal processing (DSP) slices with 3 pipeline stages. The sevenlevels of adders necessary for sequential addition are equivalent to 7 pipeline stages. Figure3 shows the internal architecture of the Dot Product core with a total of 10 pipeline stages. In order to implement the cosine-inverse in hardware, a coordinate rotation digital com-puter (CORDIC) core provided by Xilinx using System Generator for DSP [20] is used. TheCosine Inverse core is shown in Figure 4. It is composed of two CORDIC cores: one tocalculate the square root of an input and another to find the polar coordinates of a feature,labeled in Figure 4 as Square Root and Polar Sys, respectively. Internal to the Square Rootand Polar Sys core are 37 and 11 pipeline stages, respectively. The calculation of 1 − x as an input to the Square Root core is completed with 4 pipeline stages, with a total of 52pipeline stages for the Cosine Inverse core alone. The Polar Sys core within the Cosine Inverse core has two inputs, u , and v of a Cartesiansystem, and two outputs, magnitude, ρ , and angle, θ . The relation between these Cartesiansystem, ( u, v ) and the polar system, ( ρ , θ ) is simply described for a right triangle with Seven levels of adders are required since log
128 = 7, meaning for each tree we can compute a segmentof the 128 elements. × ×× Σ Σ × f f f f f f f f Dot Product Multiplication Stage Sum of Products Stage - 1 Sum of Products Final Stage -7Output
Figure 3: Dot product core.Figure 4: Internal depiction of the ”Cosine Inverse” core, including square root calculation and the polarcoordinate translation module. Z − n is a shift register with n pipeline stages. hypotenuse ρ and sides u and v as ρ = √ u + v and θ = tan − ( v/u ). To obtain the arc-cosine, only the calculation of θ is necessary, thus only computing θ = cos − x , where x isan input into the Cosine Inverse needs calculation. The two inputs of the Polar Sys core,( u, v ) are thus obtained as u = x and v = √ − x . After calculating the arc-cosine for each descriptor in the database, two minimum valuesare determined, serving to highlight the database descriptors that are potential candidatesfor similarity within the image’s descriptor under consideration. In software approach todescriptor matching, the output values are stored into a memory then a sorting algorithmis applied to find the minimum and second minimum values. In hardware approach to12escriptor matching, a typical sorting algorithm is resource-inefficient due to memory needsthus, in our design, the MIN FIND core is designed to find both the minimum and secondminimum values on the fly. This is done by retrieving the previous minimum and secondminimum values from (MIN MEM) and compared with the current calculated cosine-inverse.The pseudo-code representing the hardware operation of the MIN FIND core is shown inAlgorithm 2.
Algorithm 2
Comparison scheme for calculating minimum and second minimum values tohighlight database descriptors as potential candidates for similarity with the image descrip-tor.
Input:
Output of the cosine inverse (curr val), recent minimum value from the memory(prev min), and recent second minimum value (prev sec min).
Output:
Updated minimum value (min) and second minimum value (sec min). if curr val < prev min then min = curr val ; sec min = prev min ; else if curr val < prev sec min then min = prev min ; sec min = curr val ; else min = prev min ; sec min = prev sec min ; end if return min, sec min5.4. Match Check Core The final step of the SIFT matching algorithm is to check if an actual match occursby passing the minimum and second minimum values to the Match Check core. TheMatch Check core then applies Equation 2 to check matching between the descriptor ofimage ˜ α with another descriptor within image ˜ β . The hardware design of the Match Checkcore consists of 3 pipeline stages without the need for any multiplier. The multiplica-tion procedure of the second minimum with 0.6 is hidden in an addition process since(0 . d = (0.10011) b can be used as a constant value. Therefore, minimum × (100000) b iscompared with second minimum × (10011) b to check matching between the two descriptors,previously mentioned in Algorithm 1. Figure 5 shows a three pipelined stages of addercircuits to implement Equation 2, where the multiplication of a number with the constantvalue (10011) b is done by adding the number to itself after it is shifted to the left one timeand the result is added to the same number after it is shifted to the left four times. Theshifting process was done by appending the right side of the number with extra zero-bits.13 igure 5: Matching Check core.
6. Experimental Results and Evaluation
In order to evaluate the proposed hardware architecture of the the SIFT descriptorsmatching, we used a Xilinx ® Zynq-7000-based Zedboard. The Zedboard has a programmablelogic and two ARM Cortex-A9 co-processors. The SIFT hardware matching algorithm corewas implemented into the Zedboard’s programmable logic while a single ARM Cortex-A9processor was only used for simulation within the Xilinx ® Software Development Kit.
In order for the SIFT matching core to start processing, descriptors d α and d β of bothimages ˜ α and ˜ β should be ready at the input ports of the matching core. The descriptors forboth images were initially stored onto a SD card used within the Zedboard, whose contentswere then loaded into the external DRAM by the on-board processing system (PS). To checkthe matches between two images, the PS initializes and facilitates direct memory access(DMA) transfer of the provided descriptors from the DRAM to the descriptor buffers.The PS then initiates the matching process with the matching core over advanced exten-sible interface (AXI) whereupon the matching core is provided with the number of descriptorblocks, number of descriptors per block, and a start signal. The Xilinx ® EDA design Suitewas used to synthesize and implement the design, including the matching core, DMA, andFIFO buffers. The Xilinx ® Software Development Kit was used to read descriptors for bothimages from SD card into memory, and pass them to the fabric buffers/descriptor(s) caches.The fabric clock of the Zedboard has a range of 100 kHz to 250 MHz, however the AXI DMAhas a maximum frequency of 150 MHz or 120 MHz for AXI4 and AXI4-Lite, respectively1421]. Due to the frequency limitations of the Zedboard’s systems, all experiments were runat a nominal frequency of 100 MHz.
The SIFT matching algorithm was fully synthesized and implemented onto the Zed-board’s fabric with a 135 MHz maximum frequency with a normal clock-speed of 100MHz . The implemented SIFT matching core has only one computational element of the“Dot Product”, “Cosine Inverse”, “MIN FIND”, and “Match Check ” cores. The match-ing core includes a ”Control Unit”, additional internal memories for Descriptors and Min-imum(s) Caches, and other registers for pipelining and synchronizing the data flow of thealgorithm. Table 1 summarizes the overall resource utilization for individual components,with the “Others” category collecting registers and multiplexers used for synchronizing de-scriptors and control signals. Table 1: Our proposed SIFT matching algorithm architecture utilization report for Zedboard FPGA imple-mentation.
Core LUT FF DSP BRAMDot-Product & Cosine Inverse
Minimum Search
82 66 0 0
Matching Check
42 283 0 0
Control Unit
92 2132 0 0
Descriptor Cache
Minimum(s) Cache
Others
112 327 0 0Total 3710 6365 132 30In order to evaluate our SIFT matching core based on cosine angle distance approach,several experiments were conducted. The experiments were run on four different images, image 1 , image 2 , image 3 and image 4 where each image has 579, 538, 882, and 1021descriptors, respectively. We chose image 4 as the database image which the other threeimages were checked against for potential matches. To check the correctness of the matchingpoints, image 4 was used for testing the self-matching ability of the SIFT matching coreas developed. Figure 6 shows the matching points between the selected images with thedatabase.For comparison of the descriptor matching time using a traditional software approach,the SIFT matching algorithm was executed on a 64-bit Intel ® Core 2 Duo CPU runningat 3.16 GHz using MATLAB ® Used in this context due to limitation of the DMA core and AXI4-Lite provided by the Vivado toolset. a) Image 1 matching points (579 descriptors). (b)
Image 2 matching points (638 descriptors).(c)
Image 3 matching points (882 descriptors). (d)
Image 4 matching points (1021 descriptors).Figure 6: Matching points for selected images with different number of descriptors same set of images. The differences in time taken for both the software and our proposedhardware approach can be summarized that the software approach takes a quadratic increasein computational time with an increasing number of descriptors, compared with a linearincrease for our proposed algorithm.Our SIFT matching core accelerated the computational time of the selected images by(11.5 ∼ × for (579 ∼ ® Table 2: Utilized resources of our architecture core vs other implementation in the literature.
Parameter [9] [8] [11] Proposed
FPGA used Virtex6 Cyclone IV Zynq-7000 Zynq-7000Image Size 512 ×
384 640 ×
480 512 ×
294 (75%) 30Clock Frequency (MHz) 172 100 100 100 The authors reported 1697 kbits of memory, which is equivalent to 213 BRAM in the best case ofmemory utilization.
7. Conclusions
In this paper, a fully pipelined accelerator for a keypoint descriptor matching schemefor the SIFT object recognition algorithm was designed and implemented on FPGA wherethe matching core was constructed of four main computational sub-modules and two localcaches. Utilizing a close construction and 16-bit fixed-point calculations helped alleviatememory bandwidth restrictions in order to achieve maximum throughput. An experimentalsystem was designed on a Xilinx ® Zedboard where the matching core was implemented onthe programmable fabric and the Zynq processing system initialized the matching process.Our proposed SIFT matching architecture consumes fewer resources and accelerates thematching process where 9.11, 6.75, and 6.08 milliseconds elapsed for calculating the matchingpoints of 882, 638, and 579 descriptors, respectively, with an image of 1021 descriptors. Ourproposed SIFT matching hardware implementation additionally utilized 91% fewer LUTsand 79% fewer BRAM when comparing with the state of the art hardware matching core.Future work includes the extension of the hardware architecture into a fully pipelined visionsystem with increased number of computation cores.
References [1] D. G. Lowe, Object Recognition from Local Scale-Invariant Features, in: Proceedings of the SeventhIEEE International Conference on Computer Vision, Vol. 2, 1999, pp. 1150–1157.[2] H. B.-S. Lee D-H, Lee D-W, Possibility Study of Scale Invariant Feature Transform (SIFT) AlgorithmApplication to Spine Magnetic Resonance Imaging, PLoS ONE 11 (4).[3] Y. Jiang, Y. Xu, Y. Liu, Performance evaluation of feature detection and matching in stereo visualodometry, Neurocomputing 120 (2013) 380 – 390, image Feature Detection and Description.[4] J. Kriˇzaj, V. ˇStruc, N. Paveˇsic, Adaptation of SIFT Features for Face Recognition under VaryingIllumination, in: The 33rd International Convention MIPRO, 2010, pp. 691–694.[5] A. Cesetti, E. Frontoni, A. Mancini, A. Ascani, P. Zingaretti, S. Longhi, A Visual Global PositioningSystem for Unmanned Aerial Vehicles Used in Photogrammetric Applications, Journal of Intelligent &Robotic Systems 61 (1) (2011) 157–168.[6] B. Kolman, D. R. Hill, Elementary Linear Algebra, Pearson Education, 2004.[7] G. Qian, S. Sural, Y. Gu, S. Pramanik, Similarity Between Euclidean and Cosine Angle Distance forNearest Neighbor Queries, in: Proceedings of the 2004 ACM Symposium on Applied Computing, SAC’04, ACM, New York, NY, USA, 2004, pp. 1232–1237.[8] J. Vourvoulakis, J. Kalomiros, J. Lygouras, Fpga-based architecture of a real-time sift matcher andransac algorithm for robotic vision applications, Multimedia Tools and Applications 77 (8) (2018)9393–9415.[9] G. Lentaris, I. Stamoulias, D. Soudris, M. Lourakis, HW/SW Codesign and FPGA Acceleration ofVisual Odometry Algorithms for Rover Navigation on Mars, IEEE Transactions on Circuits and Systemsfor Video Technology 26 (8) (2016) 1563–1577.[10] J. Wang, S. Zhong, L. Yan, Z. Cao, An Embedded System-on-Chip Architecture for Real-time VisualDetection and Matching, IEEE Transactions on Circuits and Systems for Video Technology 24 (3)(2014) 525–538.
11] R. Kapela, K. Gugala, P. Sniatala, A. Swietlicka, K. Kolanowski, Embedded platform for local imagedescriptor based object detection, Applied Mathematics and Computation 267 (2015) 419 – 426, theFourth European Seminar on Computing (ESCO 2014).[12] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, P. Fua, BRIEF: Computing a LocalBinary Descriptor Very Fast, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7)(2012) 1281–1298.[13] A. Alahi, R. Ortiz, P. Vandergheynst, FREAK: Fast Retina Keypoint, in: 2012 IEEE Conference onComputer Vision and Pattern Recognition, 2012, pp. 510–517.[14] G. Condello, P. Pasteris, D. Pau, M. Sami, An OpenCL-based feature matcher, Signal Processing:Image Communication 28 (4) (2013) 345 – 350, special Issue: VS&AR.[15] H. Fassold, H. Stiegler, J. Rosner, M. Thaler, W. Bailer, A GPU-accelerated two stage visual matchingpipeline for image and video retrieval, in: Content-Based Multimedia Indexing (CBMI), 2015 13thInternational Workshop on, IEEE, 2015, pp. 1–5.[16] Xilinx Inc., ZC702 Evaluation Board for the Zynq-7000 XC7Z020 User Guide (September, 2015).[17] L. Daoud, D. Zydek, H. Selvaraj, A Survey on Design and Implementation of Floating Point Adder inFPGA, in: Progress in Systems Engineering, Springer, 2015, pp. 885–892.[18] L. Daoud, M. K. Latif, N. Rafla, SIFT Keypoint Descriptor Matching Algorithm: A Fully PipelinedAccelerator on FPGA, in: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM, 2018, pp. 294–294.[19] L. Daoud, D. Zydek, and H. Selvaraj, A Survey of High Level Synthesis Languages, Tools, and Compilersfor Reconfigurable High Performance Computing, in: Advances in Systems Science, Springer, 2014, pp.483–492, , DOI: 10.1007/978-3-319-01857-7 47.[20] Xilinx Inc., Vivado Design Suite Reference Guide: Model-Based DSP Design Using System Generator(May, 2019).[21] Xilinx Inc., AXI DMA v7.1: LogiCORE IP Product Guide (October, 2017).11] R. Kapela, K. Gugala, P. Sniatala, A. Swietlicka, K. Kolanowski, Embedded platform for local imagedescriptor based object detection, Applied Mathematics and Computation 267 (2015) 419 – 426, theFourth European Seminar on Computing (ESCO 2014).[12] M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, P. Fua, BRIEF: Computing a LocalBinary Descriptor Very Fast, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7)(2012) 1281–1298.[13] A. Alahi, R. Ortiz, P. Vandergheynst, FREAK: Fast Retina Keypoint, in: 2012 IEEE Conference onComputer Vision and Pattern Recognition, 2012, pp. 510–517.[14] G. Condello, P. Pasteris, D. Pau, M. Sami, An OpenCL-based feature matcher, Signal Processing:Image Communication 28 (4) (2013) 345 – 350, special Issue: VS&AR.[15] H. Fassold, H. Stiegler, J. Rosner, M. Thaler, W. Bailer, A GPU-accelerated two stage visual matchingpipeline for image and video retrieval, in: Content-Based Multimedia Indexing (CBMI), 2015 13thInternational Workshop on, IEEE, 2015, pp. 1–5.[16] Xilinx Inc., ZC702 Evaluation Board for the Zynq-7000 XC7Z020 User Guide (September, 2015).[17] L. Daoud, D. Zydek, H. Selvaraj, A Survey on Design and Implementation of Floating Point Adder inFPGA, in: Progress in Systems Engineering, Springer, 2015, pp. 885–892.[18] L. Daoud, M. K. Latif, N. Rafla, SIFT Keypoint Descriptor Matching Algorithm: A Fully PipelinedAccelerator on FPGA, in: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ACM, 2018, pp. 294–294.[19] L. Daoud, D. Zydek, and H. Selvaraj, A Survey of High Level Synthesis Languages, Tools, and Compilersfor Reconfigurable High Performance Computing, in: Advances in Systems Science, Springer, 2014, pp.483–492, , DOI: 10.1007/978-3-319-01857-7 47.[20] Xilinx Inc., Vivado Design Suite Reference Guide: Model-Based DSP Design Using System Generator(May, 2019).[21] Xilinx Inc., AXI DMA v7.1: LogiCORE IP Product Guide (October, 2017).