[PDF] Computer Vision Accelerators for Mobile Systems based on OpenCL GPGPU Co-Processing

Abstract

In this paper, we present an OpenCL-based heterogeneous implementation of a computer vision algorithm -- image inpainting-based object removal algorithm -- on mobile devices. To take advantage of the computation power of the mobile processor, the algorithm workflow is partitioned between the CPU and the GPU based on the profiling results on mobile devices, so that the computationally-intensive kernels are accelerated by the mobile GPGPU (general-purpose computing using graphics processing units). By exploring the implementation trade-offs and utilizing the proposed optimization strategies at different levels including algorithm optimization, parallelism optimization, and memory access optimization, we significantly speed up the algorithm with the CPU-GPU heterogeneous implementation, while preserving the quality of the output images. Experimental results show that heterogeneous computing based on GPGPU co-processing can significantly speed up the computer vision algorithms and makes them practical on real-world mobile devices.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Computer Vision Accelerators for Mobile Systems based onOpenCL GPGPU Co-Processing

Guohui Wang · Yingen Xiong · Jay Yun · Joseph R. Cavallaro

Received: / Accepted:

Abstract

In this paper, we present an OpenCL-basedheterogeneous implementation of a computer vision al-gorithm – image inpainting-based object removal algo-rithm – on mobile devices. To take advantage of thecomputation power of the mobile processor, the algo-rithm workﬂow is partitioned between the CPU and theGPU based on the proﬁling results on mobile devices, sothat the computationally-intensive kernels are acceler-ated by the mobile GPGPU (general-purpose comput-ing using graphics processing units). By exploring theimplementation trade-oﬀs and utilizing the proposedoptimization strategies at diﬀerent levels including al-gorithm optimization, parallelism optimization, and mem-ory access optimization, we signiﬁcantly speed up thealgorithm with the CPU-GPU heterogeneous implemen-tation, while preserving the quality of the output im-ages. Experimental results show that heterogeneous com-puting based on GPGPU co-processing can signiﬁcantlyspeed up the computer vision algorithms and makesthem practical on real-world mobile devices.

Keywords

Mobile SoC · Computer vision · CPU-GPUpartitioning · Co-processing · OpenCL

Mobile devices have evolved rapidly and become ubiqui-tous over the past decade, giving rise to new applicationdemands through the convergence of mobile computing,

G. Wang and J. R. CavallaroDepartment of Electrical and Computing EngineeringRice University, Houston, Texas-77005, USAE-mail: [email protected], [email protected]. Xiong and J. YunQualcomm Technologies Inc., San Diego, California, USA wireless communication and digital imaging technolo-gies. On one hand, as mobile processors have becomemore and more powerful and versatile during the pastseveral years, we are witnessing a rapid growth in thedemand for the computer vision applications runningon mobile devices, such as image editing, augmentedreality, object recognition and so on [3, 6, 11, 20, 21,25, 30, 31]. On the other hand, with the recent advancesin the ﬁelds of computer vision and augmented reality,the emerging algorithms have become more complexand computationally-intensive. Therefore, the long pro-cessing time due to the high computational complexityprevents these computer vision algorithms from beingpractically used in mobile applications.To address this problem, researchers have been ex-ploring general-purpose computing using graphics pro-cessing units (GPGPUs) as accelerators to speed upthe image processing and computer vision algorithmsthanks to the heterogeneous architecture of the modernmobile processors [1, 2, 5, 8, 9, 12, 23, 24, 28]. On desk-top computers or supercomputers, numerous program-ming models have been extensively studied and utilizedto facilitate the parallel GPGPU programming, such asthe Compute Uniﬁed Device Architecture (CUDA) [18]and the Open Computing Language (OpenCL) [13, 15].As a comparison, due to the lack of parallel program-ming models in the mobile domain, the OpenGL ES(Embedded System) programming model was commonlyused to harness the computing power of the mobileGPU [14]. However, the inherent limitations of the OpenGLES lead to poor ﬂexibility and scalability, as well aslimited parallel performance, due to the fact that theOpenGL ES was original designed for 3D graphics ren-dering. Recently, emerging programming models suchas the OpenCL embedded proﬁle [13] and the Render-Script [7] have been supported by the state-of-the-art a r X i v : . [ c s . D C ] M a r Guohui Wang et al.

Image signal processorGPU

System memory/Global memory

Otheraccelerators Image signal processor

Mobile SoC

Sensors

System memory/Global memory

OtheracceleratorsMulti-coreCPU Multi-coreGPU

Fig. 1

Architecture of a typical mobile platform. mobile processors, making the mobile GPGPU feasiblefor real-world mobile devices for the ﬁrst time [21, 26,27].In this paper, we take the exemplar-based image in-painting algorithm for object removal as a case study toexplore the capability of the mobile GPGPU to acceler-ate computer vision algorithms using OpenCL. The re-mainder of this paper is organized as follows. Section 2introduces the architecture of the modern mobile SoCsand the OpenCL programming model for the mobileGPGPU. Section 3 brieﬂy explains the exemplar-basedinpainting algorithm for object removal. We analyze thealgorithm workﬂow and propose a method to map thealgorithm between mobile CPU and GPU in Section 4.To adapt the complex algorithm to the limited hard-ware resources on mobile processors, we further studyimplementation trade-oﬀs and optimization strategiesto reduce the processing time in Section 5. Section 6shows experimental results on a practical mobile device,which indicates that our optimized GPU implementa-tion shows signiﬁcant speedup, enabling fast interactiveobject removal applications in a practical mobile device.Section 7 concludes the paper.

As is shown in Fig. 1, a modern mobile SoC (system-on-chip) chipset typically consists of a multi-core mo-bile CPU, a mobile GPU with multiple programmableshaders, and an image signal processor (ISP) [10, 19,22]. Unlike their desktop counterparts, the mobile CPUand GPU share the same system memory via a sys-tem bus. The mobile platforms also contain a varietyof sensors and accelerators. Modern mobile platformstend to employ heterogeneous architectures, which in-tegrate several application-speciﬁc co-processors to en-able the computationally intensive algorithms such asface detection and so on. However, the space limitationand the power constraints of the mobile devices limitthe number of integrated hardware co-processors. It is

Global memory

Turbo decoderPrivate memory Private memoryWorkitem Work itemLocal memory

Work group ......

Private memory Private memoryWork item Work itemLocal memory

Work group ...... . . .

Fig. 2

OpenCL programming model and hierarchical mem-ory architecture. preferable to seek the general-purpose computing powerinside the mobile processor. The mobile GPUs are suit-able candidates to accelerate computationally intensivetasks due to their highly parallel architecture.The lack of good parallel programming models be-comes an obstacle to perform general-purpose comput-ing on the mobile GPUs. As a compromise, researchershave been using the OpenGL ES programming modelfor GPGPU to achieve performance improvement andenergy eﬃciency on mobile devices during the past decade.For instance, Singhal et al. implemented and optimizedan image processing toolkit on handheld GPUs [24].Nah et al. proposed an OpenGL ES-based implemen-tation of ray tracing, called MobiRT, and studied theCPU-GPU hybrid architecture [16]. Ensor et al. pre-sented GPU-based image analysis on mobile devices,in which the Canny edge detection algorithm was im-plemented using the OpenGL ES [5]. Researchers havealso attempted to accelerate feature detection and ex-traction using the mobile GPUs [1, 8, 23, 31]. Recently,performance characterization and energy eﬃciency formobile CPU-GPU heterogeneous computing have beenstudied [2, 28].Thanks to the uniﬁed programmable shader archi-tecture and the emerging parallel programming frame-works such as OpenCL, the new generation of mobileGPUs have gained real general-purpose parallel com-puting capability. OpenCL is a programming frame-work designed for heterogeneous computing across var-ious platforms [13]. Fig. 2 shows the programming andthe hierarchical memory architecture of OpenCL. InOpenCL, a host processor (typically a CPU) managesthe OpenCL context and is able to oﬄoad parallel tasksto several compute devices (for instance, GPUs). Theparallel jobs can be divided into work groups, and eachof them consists of many work items which are the basicprocessing units to execute a kernel in parallel. OpenCLdeﬁnes a hierarchical memory model containing a largeoﬀ-chip global memory but with long latency of severalhundred clock cycles, and a small but fast on-chip local

Sign Process Syst (2014) 3 S     Ω S  Ω  Ω p  p q   q   Ω p  p  Ω p  p  Source image with target object region Ω Update priority for all pixels on the border Find the patch to be filled based on the max priorityFind the best patch with minimum SSD valuesFill the current object patchFinal result with the hole filled

Fig. 3

Major steps of the exemplar-based image inpainting algorithm for object removal.

Original image Mask image Result image

Fig. 4

An example of the object removal algorithm. Themask image indicates the object to be removed from the orig-inal image. memory which can be shared by work items in the samework group. To eﬃciently and fully utilize the limitedcomputation resources on a mobile processor to achievehigh performance, partitioning the tasks between CPUand GPU, exploring the algorithmic parallelism, andoptimizing the memory access need to be carefully con-sidered. Few prior works have studied the methodologyof using OpenCL to program mobile GPUs and achievespeedup. Leskela et al. demonstrated a prototype ofOpenCL Embedded Proﬁle emulated by OpenGL ES onmobile devices and showed advantages in performanceand energy eﬃciency [12]. In our previous work, we haveexplored the mobile GPGPU capability of mobile pro-cessors to accelerate computer vision algorithms suchas an image editing algorithm (object removal), andfeature extraction based on SIFT (scale-invariant fea-ture transform) [26, 27]. Performance improvement andenergy consumption reduction have been observed.

In this paper, we take the exemplar-based inpaintingalgorithm for object removal as a case study to showthe methodology of using the mobile GPU as a co- processor to accelerate computer vision algorithms. Theobject removal algorithm involves raw image pixel ma-nipulation, iterative image processing technique, sum ofsquared diﬀerence (SSD) computation and so on, whichare typical operations for many computer vision algo-rithms. Therefore, the case study of the object removalimplementation can represent a class of computer visionalgorithms, such as image stitching, object recognition,motion estimation, texture analysis and synthesis, andso on. Therefore, by studying and evaluating the per-formance of the exemplar-based object removal algo-rithm on mobile devices with CPU-GPU partitioning,the feasibility and advantages of using the mobile GPUas a co-processor can be demonstrated. Furthermore,the optimization techniques proposed in this paper canpossibly be applied to other computer vision algorithmswith similar operation patterns or algorithm workﬂows.Object removal is one of the most important imageediting functions. As is shown in Fig. 4, the key ideaof object removal is to ﬁll in the hole that is left be-hind after removing an unwanted object, to generatea visually plausible result image. The exemplar-basedinpainting algorithm for object removal can preserveboth structural and textural information by replicatingpatches in a best-ﬁrst order, which can generate goodimage quality for object removal applications [4, 29].In the meanwhile, this algorithm can achieve compu-tational eﬃciency thanks to the block-based samplingprocessing, which is especially attractive for a parallelimplementation.The major steps of the exemplar-based inpaintingalgorithm for object removal proposed by Criminisi etal. is depicted in Fig. 3 [4]. Assume we have a sourceimage S with a target region Ω to be ﬁlled in after anobject is removed (the empty region). The left image Guohui Wang et al.

Table 1

Speciﬁcation of the experimental setup.

Mobile SoC Snapdragon 8974CPU Krait 400 Quad-core

Max clock frequency 2.15 GHzCompute units 4Local memory 32 KB/compute unit

GPU Adreno 330

Max clock frequency 400 MHzCompute units 4Local memory 8 KB/compute unitOperating system Android 4.2.2Development toolset Android NDK r9Instruction set ARM-v7a region is denoted as Φ ( Φ = S − Ω ). The border ofthe object region is denoted as δΩ . The image patchesare ﬁlled into the object region Ω one by one based onpriority values C ( p ). Given an image patch Ψ p centeredat pixel p for p ∈ δΩ , the priority value C ( p ) is deﬁnedas the product of two terms: P ( p ) = R ( p ) · D ( p ) , (1)in which R ( p ) is the conﬁdence term indicating theamount of reliable information surrounding the pixel p , and D ( p ) is the data term representing the strengthof texture and structural information along the edge ofthe object region δΩ in each iteration. R ( p ) and D ( p )are deﬁned as follows: C ( p ) = (cid:80) q ∈ Ψ p ∩ Ω C ( q ) | Ψ p | ,D ( p ) = ∇ I ⊥ p · n p α , (2)where | Ψ p | is the area of Ψ p , α is a normalization factor(for a typical grey-level image, α = 255), and n p is aunit vector orthogonal to δΩ in the point p .According to the priority values for all patches acrossthe border δΩ of the target region, we select a candi-date patch with the maximum priority value. Then, wesearch the image region Φ and ﬁnd a patch Ψ ˜ q that bestmatches the patch Ψ p (we call this step ﬁndBestPatch ).The goal of ﬁndBestPatch is to ﬁnd the best match-ing patch Ψ ˜ q from candidate patches Ψ q in the sourceimage region Φ , to match an object patch Ψ p in the ob-ject region Ω based on a certain distance metric. Thesum of squared diﬀerences (SSD) is typically used as adistance metric to measure the similarity between thepatches [4]. We denote the color value of a pixel x by I x = ( R x , G x , B x ). For an object patch Ψ p , the bestpatch Ψ ˜ q is chosen by computing: Ψ ˜ q = arg min q ∈ Φ d ( Ψ p , Ψ q ) , (3) in which d ( Ψ q , Ψ p ) is deﬁned as follows: d ( Ψ q , Ψ p ) = (cid:88) p ∈ Ψ p ∩ Φ,q ∈ Ψ q ∩ Φ ( I p − I q ) . (4)Assume that the size of the original image is M × N ,and the size of the patch is P × P . The complexity of ﬁndBestPatch can be estimated as O ( M N P ). Basedon our proﬁling (will be shown in Section 4), ﬁndBest-Patch compute kernel occupies the most computationsin the whole object removal algorithm.Once the best matching patch is found, we copy thepixel values of Ψ ˜ q into Ψ p . The aforementioned searchand copy process is repeated until the whole target re-gion Ω is ﬁlled up. More details of the algorithm canbe found in reference [4]. In this section, we analyze the workﬂow of the objectremoval algorithms and describe the algorithm parti-tioning between the CPU and GPU to fully utilize theresources of the mobile SoC chipset.4.1 Experimental SetupThe proﬁling and experiments are performed on a devel-opment platform consisting of a Qualcomm Snapdragon8974 chipset [22], which supports OpenCL EmbeddedProﬁle for both CPU and GPU. The details of the ex-perimental setup are listed in Table 1.4.2 Algorithm MappingFig. 5 shows a workﬂow diagram of the exemplar-basedinpainting algorithm for object removal. The algorithmcan be partitioned into three stages: initialization stage,iterative computation stage, and the ﬁnalization stage.The blocks with the slashed lines are core functions in-side the iterative stage and represent most of the com-putational workload. We can map the core functionsinto OpenCL kernels to exploit the 2-dimensional pixel-level and block-level parallelisms in the algorithms. TheCPU handles the OpenCL context initialization, mem-ory objects management, and kernel launching. By an-alyzing the algorithm, we partition the core functionsinto eight OpenCL kernels based on the properties ofthe computations, as is shown in Table 2. In each OpenCLkernel, the fact that no dependency exists among im-age blocks allows us to naturally partition the tasks intowork groups. To represent color pixel values in RGBA

Sign Process Syst (2014) 5

Indicate objects which need to be filled Any unfilled border ?Extract initial borders ofobjects, and set the initial confidence map C(p)

Compute priority with the confidence map

Find a patch with maximum priority

Find the best patchCopy color values and gradients to the filling patch

Generate the result image

NoYesUpdate the confidence map C(p) Identify the filling borders

Initialization Iterative computations Finalization

Indicate objects which need to be filledAny unfilled border ?Extract initial borders ofobjects, and set the initial confidence map C(p)

Identify the filling borders Compute priority with the confidence mapFind a patch with maximum priorityFind the best patchCopy the color values and gradients from the patch to the filling patch

Generate the result image

YesNo Update confidence map C(p)

Indicate objects which need to be filledAny unfilled border ?Extract initial borders ofobjects, and set the initial confidence map C(p)

Compute priority with the confidence map Find a patch with maximum priority Find the best patchCopy color values and gradients to the filling patchUpdate the confidence map C(p)Identify the filling bordersOpenCL kernelsC functions

Fig. 5

Algorithm workﬂow of the exemplar-based object removal algorithm. Please note that one OpenCL block might bemapped to one or multiple OpenCL kernels depending on the computation and memory data access patterns.

Table 2

Breakdown of execution time for OpenCL kernelfunctions running only on CPU.

Kernel functions Exec %time [s]Convert RGB image to gray-scale image 0.08 0.09%Update border of the area to be ﬁlled 0.60 0.69%Mark source pixels to be used 0.66 0.76%Update pixel priorities in the ﬁlling area 0.45 0.52%Update pixel conﬁdence in the ﬁlling area 0.36 0.41%

Find the best matching patch 84.4 97.0%

Update RGB and grayscale image 0.46 0.53%of the ﬁlling patchTotal Time 87.0 100% (red green blue alpha) format, we use eﬃcient vectordata structures such as cl uchar4 to take advantage ofbuilt-in vector functions of OpenCL.To better optimize the OpenCL-based implemen-tation, we ﬁrst measure the timing performance of theOpenCL kernels. Table 2 shows a breakdown of process-ing time when running the program on a single core ofthe CPU on our test device. The OpenCL kernel func-tion used to ﬁnd the best matching patch with the cur-rent patch (denoted as ﬁndBestPatch ) occupies mostof the processing time (97%), so optimizing the ﬁnd-BestPatch kernel becomes the key to improving perfor-mance. ﬁndBestPatch

KernelFunctionThe algorithm mapping of ﬁndBestPatch kernel usedfor the OpenCL implementation is shown in Fig. 6. To

ObjectSearch area α · w α · hα · h α· ww h(2α+1) · h (2α+1) · w Ω Object

Work group (0,0) Work group (0,1) . ize xMlocalS     work groups .y izeNlocalS     work groupsWork group (1,0)  Image ...

Work group (1,1) ... ... ... ...

Fig. 6

Algorithm mapping of ﬁndBestPatch kernel functionusing OpenCL. perform a full search for the best patch Ψ ˜ q to match thecurrent ﬁlling patch Ψ p in the ﬁndBestPatch OpenCLkernel, we spawn M × N work items, with each com-puting an SSD value between two P × P patches. Wepartition these M × N work items into work groups ac-cording to the compute capability of the GPU. The sizeof 2-dimensional work groups can be expressed as( (cid:100) M/localSize.x (cid:101) , (cid:100) N/localSize.y (cid:101) ) . (5)In our implementation, each work group contains8 × localSize.x =8, localSize.y = 8). There-fore, each work group of work items perform SSD com-putations for 64 patch candidates. The parallel imple-mentation of the SSD computation in the ﬁndBestPatch kernel function is detailed in Algorithm 1. Guohui Wang et al.

Table 3

Description of images in the test dataset. These images are selected to represent diﬀerent types of image scenes anddiﬀerent shapes of objects to be removed.

Image Image type Image size Object type Object size

WalkingMan Complex scene 512 ×

384 Blob 78 × ×

639 Small blob 59 × ×

384 Big/random shape 155 × ×

550 Long strip 1024 × Fig. 7

The best patch mapping found by a full search and the result images. The 1st row: original images. The 2nd row:masks covering the unwanted objects. The 3rd row: best patch mapping; the small squares indicate the best patches found bythe ﬁndBestPatch () function. The 4th row: result images.

Image ObjectSearch area α · w α · hα · h α· ww h(2α+1) · h N (o x ,o y )M Fig. 8

Diagram of reduced search area with search factor α . Sign Process Syst (2014) 7 Algorithm 1

Compute SSD values between all can-didate image patches and the image patch to be ﬁlledusing an OpenCL kernel function. Input :(a) Position of object patch Ψ p : ( px , py );(b) Patch size P ;2. Output : SSD array ssdArray ;3.

Begin OpenCL kernel:

4. Get global ID for the current work item: ( i , j );5. ﬂoat sum = 0 . src x , src y ; // source pixel position7. int tgt x , tgt y ; // target pixel position8. int winsize = P/ for (int h = − winsize ; h (cid:54) winsize ; h + +)10. for (int w = − winsize ; w (cid:54) winsize ; w + +)11. src x = i + w ; src y = j + h ;12. tgt x = px + w ; tgt y = py + h ;13. if (( src x , src y ) or ( tgt x , tgt y ) is out of image)14. continue ;15. end if if (pixel ( tgt x , tgt y ) is inside source region Φ )17. Read pixel ( tgt x , tgt y ) data into tgtData ;18. Read pixel ( src x , src y ) data into srcData ;19. sum + =( tgtData − srcData ) ;20. end if end for end for

23. Store sum into ssdArray ;24.

End OpenCL kernel α . Assume the width and height of theobject area are w and h respectively. The new searcharea is formed by expanding the object area by αh tothe up and down directions, and αw to the left and rightdirections, as is shown in Fig. 8. The search area factor α has a range of 0 ≤ α < max ( M/h, N/w ). Assume theobject region is centered at coordinate ( o x, o y ) . Fig. 8shows a typical case in which the whole search areais inside the image area. For more general cases, theboundary of the new search area becomes: B left = max(0 , o x − ( 12 + α ) w ) ,B right = min( N, o x + ( 12 + α ) w ) ,B top = max(0 , o x − ( 12 + α ) h ) ,B bottom = min( M, o x + ( 12 + α ) h ) . (6)By deﬁning the search factor α this way, we can eas-ily adjust the search area. Moreover, this method allowsthe search area to grow along four directions with anequal chance, so as to increase the possibility of ﬁnd-ing a better patch. Since there are no useful pixels inthe object area for patch matching, only the candidatepatches not in the object region will be compared withthe object patch. So the actual size of the search area( SA ) can be expressed as: SA = (2 α + 1) w · (2 α + 1) h − wh = ((2 α + 1) − wh. (7)The complexity of ﬁndBestPatch can be estimated by O (((2 α + 1) − whP ). Thus, when we reduce thesearch area (reducing α ), the complexity to search thebest patch reduces signiﬁcantly.Fig. 9 demonstrates the eﬀect of reducing the searchfactor α . After reducing the search factor, the bestmatching patches are limited to a small region aroundthe object region. Even when the search factor is re-duced signiﬁcantly to only α = 0 .

05, we still get visu-ally plausible results. Due to its importance, choosinga proper search factor α is critical for practical applica-tions to achieve good performance and eﬃciency. Basedon our experiments, for a regular object region, choos-ing parameter α in the range of 0 . ≤ α < . α .In addition to time reduction, reducing the searcharea can also reduce the possible false matching. As acomparison metric, SSD can roughly represent the sim-ilarity of two patches, but it cannot accurately reﬂectthe structural and color information embedded in thepatches. Therefore, the patches with the highest dis-tance scores (SSD in this algorithm) may not be thebest patches to ﬁll in the hole and to generate visuallyplausible results due to the limitation of SSD metric, es-pecially for complex scenes with diﬀerent color informa-tion and structures. The algorithm sometimes can leadto false matching, in which the reported “best” patch Guohui Wang et al.

Fig. 9

The eﬀect of reducing the search area. The search area factor α is deﬁned as in Fig. 8. Sign Process Syst (2014) 9 Patch size 9 × ×

13 Patch size 15 ×

15 Patch size 17 × Use Adobe PDF printer, high quality

Search area (2α+1) · h N p-1

Object area w h p - (2α+1) · w Overlap area

Fig. 10

The relationship among search area, object area andthe overlap area for the best patch searching. N o r m a li z e d t i m e r a t i o Search factor α Patch w/o local memPatch w/o local memPatch w/ local memPatch w/ local mem P r o c e ss i n g t i m e ( s e c o nd s ) Search factor α Patch w/o local memPatch w/o local memPatch w/o local memPatch w/ local memPatch w/ local memPatch w/ local mem w/o local memoryw/ local memory Fig. 11

Impact of increased patch size 13 ×

13 and 17 × × ×

17 patches is normalized by the time of the 9 × × et al. [4]). with a high correlation score may have very distinctivetextural or color information compared to the objectpatch. Under such circumstances, the artiﬁcial eﬀectsintroduced by the false matching will degrade the qual-ity of the result images signiﬁcantly. Fortunately, spa-tial locality can be observed in most of the natural im-ages, therefore, the visually plausible matching patches(in terms of color and texture similarity) tend to residein the surrounding area of the candidate patch withhigh chances. By reducing the search area to a certaindegree, we can reduce the possibility of false matchingand therefore generate visually plausible result images.5.4 Optimizing Patch SizeThe object removal algorithm is an iterative algorithm,in which one object patch is processed in each itera-tion. We need roughly wh/P iterations to ﬁnish thewhole process. That said, if we increase patch size P ,fewer iterations are needed to ﬁll the object area whichmay lead to shorter processing time. Meanwhile, the complexity of the SSD computation ( O ( P )) becomeshigher for the patch matching, which tends to increasethe processing time. Therefore, it is not straightforwardto determine the impact of increasing patch size P tothe overall complexity and performance.We use Fig. 10 to help us analyze the overall com-putation complexity. First of all, we assume the searchfactor α is deﬁned as in the previous section. We alsodeﬁne the search area as SA as in (7).Secondly, because the patch size is P , any candidatepatch within ( P −

1) pixels range surrounding the objectarea Ω will partially overlap with the object area. Wedeﬁne this overlapping area as OA , whose area can becalculated as: OA = (2( P −

1) + w ) · (2( P −

1) + h ) − wh. (8)If the candidate patch lies outside the overlappingarea , the complexity of the SSD computation can beestimated as O ( P ). When the candidate patch and theobject area overlaps , we only use the pixels I q in the in-tersection of the candidate patch Ψ q and source image Φ ( I q ∈ Ψ q ∩ Φ ) to perform the computation. We can esti-mate the computation complexity as O ( kP ), in which k is a constant value, representing the average ratio ofthe overlapping area. For a typical rectangle search areaand object area, k can be approximately estimated as0.5. Therefore, the overall computation complexity canbe calculated as in (9).Experimental results show that the processing timecan be reduced by increasing patch size while reduc-ing the search area. From Fig. 10, an intuitive expla-nation is that when we reduce the search area, morecandidate patches overlap with the object region. Inthis scenario, the bigger the patches are, the more over-lap there will be. As a result, the amount of requiredcomputations becomes smaller. Thus, as we increasethe patch size and reduce the search area, the process-ing time decreases. Equation (9) shows that for biggersearch factor α , the term (2 α + 1) dominates the com-plexity. In this case, the complexity change caused bythe increased patch size is negligible. However, when α becomes smaller, the search area decreases. When thesearch area SA becomes comparable to or even smallerthan the object area Ω , the term (2 α + 1) becomesless dominant. Therefore, for smaller search factor α ,increasing the patch size P can reduce the computa-tion complexity. Experimental results shown in Fig. 11verify the above analysis. In Fig. 11, the processing time In this case, the candidate patch is in the area of ( SA − OA ). In this case, the candidate patch is in the area of OA .0 Guohui Wang et al. Fig. 12

The experimental results for increased patch size. ( α = 0 .

05 in this experiment.) Sign Process Syst (2014) 11

Complexity overall ≈ O ( wh/P ) · ((2 α + 1) − wh · [ P rob ( Ψ q ∈ ( SA − OA )) · O ( P ) + P rob ( Ψ q ∈ OA ) · O ( kP )]= O [((2 α + 1) − w h · ( SA − OASA + k OASA )]= O [((2 α + 1) − w h · ( (2 α + 1) wh − (2( P −

1) + w )(2( P −

1) + h )((2 α + 1) − wh + k (2( P −

1) + w )(2( P −

1) + h ) − wh ((2 α + 1) − wh )]= O [ wh · ((2 α + 1) wh − (2( P −

1) + w )(2( P −

1) + h ) + k ((2( P −

1) + w )(2( P −

1) + h ) − wh ))]= O [ wh · (((2 α + 1) − k ) wh − (1 − k )(2( P −

1) + w )(2( P −

1) + h ))]= O [ w h · ((2 α + 1) − (1 − k )(2 ( P − w + 1)(2 ( P − h + 1) − k )] . (9) ObjectSearch area α · w α · h α · h α · ww h(2 α +1) · h (2 α +1) · w Ω Object

Work group (0,0) Work group (0,1) . ize xMlocalS     work groups .y izeNlocalS     work groupsWork group (1,0)  Image ...

Work group (1,1) ... ... ... ... P a t c h P a t c h P a t c h A work group (8 × PP USE Adobe Acrobat to print the figure

Fig. 13

A detailed diagram showing how 8 × for patch sizes of 13 ×

13 and 17 ×

17 is normalized bythe processing time of the 9 × α ≥ α < Table 4

Local memory usage for “WalkingMan” image.

Patch size Data Local memory

Source data 1024 bytes × Patch data 324 bytes( p + 8 − Total 1672 bytes

Source data 1600 bytes × Patch data 676 bytes( p + 8 − Total 2952 bytes

Source data 2304 bytes × Patch data 1156 bytes( p + 8 − Total 4616 bytes × n -th work item workson the n -th patch. Most of the pixels in adjacent can-didate patches overlap. For instance, patch 1 and patch2 share most of the image pixels and only one col-umn of pixels in each patch are diﬀerent. Similarly, alladjacent candidate patches processed by 8 × P × P patch, ( P + 8 − × ( P + 8 −

1) pixels are actuallyshared among work items. Thus, we can load these pix-els into the on-chip local memory to allow data sharingand avoid unnecessary accesses to the global memory.In our OpenCL implementation, ( P + 8 − · sizeof N o r m a li z e d t i m e r a t i o Search factor α Patch w/o local memPatch w/o local memPatch w/ local memPatch w/ local mem P r o c e ss i n g t i m e ( s e c o nd s ) Search factor α Patch w/o local memPatch w/o local memPatch w/o local memPatch w/ local memPatch w/ local memPatch w/ local mem w/o local memoryw/ local memory Fig. 14

Performance comparison between using the localmemory and without using the local memory. The results forthe “WalkingMan” are shown. The experiments performedon other test images generate similar results. ( cl uchar

4) source image data, P · sizeof ( cl uchar P · sizeof ( cl int ) patch pixellabel data can be loaded into the local memory.Table 4 lists the local memory usage for diﬀerentpatch sizes for the “WalkingMan” test image. The re-quired local memory for all the patch sizes listed inTable 4 can be ﬁt into the 8KB local memory of theAdreno GPU. In addition, if we carefully design themethod to load data from the global memory to the lo-cal memory by data striping, we can coalesce the globalmemory access to further reduce latency. The paralleldata loading from the global memory to the local mem-ory is shown in Algorithm 2, in which the coalescedglobal memory access is achieved.Experimental results in Fig. 14 demonstrate the per-formance improvement by utilizing the local memory toenable the data sharing between work items inside the Algorithm 2

Parallel data loading from global mem-ory to local memory in ﬁndBestPatch kernel function.

1. Get global ID of the current work item: ( gid.x , gid.y );2. Get local ID of the current work item: ( lid.x , lid.y );3. local id = lid.y ∗ lsize.x + lid.x ;4. Get local work group size: ( lsize.x , lsize.y );5. group size = lsize.x ∗ lsize.y ;6. Calculate local memory size: local mem size ;7. while ( local index < local mem size )8. Calculate global memory address;9. if ( local index < P ∗ P )10. Calculate patch data address;11. Load patchP ixelLabel into local memory;12. Load patchData into local memory;13. end if

14. Load srcData into local memory;15. local index + = group size ;16. end while

17. barrier(CLK LOCAL MEM FENCE);18. Start SSD computation from here. (Similar to Algorithm1, just read needed data from local memory.) (a) (b) (c) (d)

Fig. 15

An interactive object removal demo on Android withthe OpenCL acceleration. (a) original image; (b) a mask indi-cating the object area; (c) intermediate result; (d) ﬁnal resultimage after iterative editing. same work group. We observe on average a 30% reduc-tion in processing time after using the local memory.

We implemented the exemplar-based inpainting algo-rithm for object removal on a test platform based onthe Snapdragon chipset using OpenCL and the AndroidNDK [7, 13]. We applied the proposed optimizationtechniques discussed in Section 5. To demonstrate theeﬃciency and practicality of the proposed implemen-tation, we developed an interactive OpenCL Androiddemo on the test platform. Fig. 15 shows screen-shotsof the implemented Android demo application, in whichan end user can draw a customized mask by touchingthe touchscreen to cover an unwanted object and thenremove it by pressing a button. The demo allows itera-tive editing, so that the user can keep editing an imageuntil a satisfying result is obtained.From Table 5 to Table 8, we show the processingtime of the OpenCL-based implementations, includingthe CPU-only results (utilizing multi-core CPU on thechip) and CPU-GPU heterogeneous implementations.We can notice that the OpenCL implementations withproposed optimizations signiﬁcantly improve the pro-cessing performance compared to the serial version im-plementation. The CPU-GPU heterogeneous implemen-tations further improve the performance compared tothe CPU-only implementations.Table 5 shows experimental results for the “Walk-ingMan” image, in which the image size is 512 × × × Sign Process Syst (2014) 13

Table 5

Total processing time for “WalkingMan” image, with OpenCL kernels running on the GPU.CPU-only Heterogeneous CPU+GPUSearch factor α Search area w/o local memory ( s ) w/ local memory ( s )Patch size Patch size Patch size9 × ×

13 17 ×

17 9 × ×

13 17 ×

17 9 × ×

13 17 × ×

384 23.26 22.48 23.68 20.37 19.66 19.15 12.18 11.59 11.122 382 ×

384 18.08 16.97 18.30 16.32 14.93 14.52 9.31 8.82 8.421 234 ×

311 13.37 9.91 9.74 8.17 7.80 7.61 5.34 4.69 4.940.5 156 ×

248 11.27 9.80 8.20 5.29 4.24 3.95 3.93 2.93 2.430.2 109 ×

176 7.13 5.20 3.79 2.95 2.25 1.90 2.26 1.59 1.330.05 86 ×

139 6.02 4.73 3.85 2.12 1.65 1.45 2.04 1.52 1.25

Table 6

Total processing time for “River” image. With OpenCL kernels running on the GPU.CPU-only Heterogeneous CPU+GPUSearch factor α Search area Time w/o local memory ( s ) Time w/ local memory ( s )Patch size Patch size Patch size9 × ×

13 17 ×

17 9 × ×

13 17 ×

17 9 × ×

13 17 × ×

639 25.94 25.01 27.41 18.28 16.19 16.01 14.77 13.80 13.352 295 ×

465 12.23 12.15 12.68 8.52 7.93 7.95 6.71 6.94 6.591 177 ×

279 7.76 6.98 5.34 5.09 3.13 3.11 4.15 2.74 2.600.5 118 ×

186 4.80 3.91 3.82 2.81 2.06 1.42 2.34 1.87 1.50.2 83 ×

130 3.21 1.66 1.57 1.93 1.18 1.05 1.84 1.34 1.060.05 65 ×

102 2.29 1.72 1.91 1.39 1.15 1.01 1.66 1.16 1.00

Table 7

Total processing time for “Dive” image. With OpenCL kernels running on the GPU.CPU-only Heterogeneous CPU+GPUSearch factor α Search area Time w/o local memory ( s ) Time w/ local memory ( s )Patch size Patch size Patch size9 × ×

13 17 ×

17 9 × ×

13 17 ×

17 9 × ×

13 17 × ×

384 64.48 62.86 61.47 44.83 40.45 36.65 44.62 34.97 31.082 576 ×

384 65.11 64.83 63.55 44.73 40.29 36.61 44.67 35.00 31.111 465 ×

384 52.17 54.31 54.99 34.22 33.26 19.33 28.62 28.65 27.810.5 310 ×

369 54.36 37.07 36.82 21.78 22.73 8.24 19.98 23.04 16.680.2 217 ×

260 40.92 29.64 27.69 15.14 9.45 5.66 14.01 10.40 7.400.05 171 ×

205 35.44 22.22 20.171 13.74 7.90 5.40 14.21 7.84 5.31 search), the serial version of the implementation run-ning on one CPU core uses 87 seconds to ﬁnish the pro-cessing (shown in Table 2), which is a long processingtime for a practical mobile application. The fact thatiterative editing is required under many circumstancesmakes the serial implementation far from being prac-tical. Table 5 shows experimental results for OpenCL-based parallel solutions. With the default parameterconﬁguration, the multi-core CPU-only version reducesthe processing time to 23.26 seconds, and the hetero-geneous CPU-GPU implementation further reduces theprocessing time to 20.37 seconds (76.6% time reductioncompared to 87 seconds processing time for the serialimplementation).With all the proposed optimization techniques ap-plied, we observe signiﬁcant performance speedup. Withsearch factor α = 0 .

05, patch size 17 ×

17, and localmemory enabled, the processing time is reduced to only1.25 seconds, which indicates a 93.9% reduction in pro-cessing time compared to 20.37 seconds for the default parameter conﬁguration (full search, 9 × et al. , users of mo-bile applications can tolerate several seconds averageprocessing time for mobile services before they start tofeel frustrated [17]. By accelerating the object removalalgorithm using heterogeneous CPU-GPU partitioningon mobile devices, we successfully reduce the run time,which makes these types of computer vision algorithmsfeasible in practical mobile applications.We can draw similar conclusions from the timingresults for other test images shown in Table 6, 7 and8. To demonstrate the eﬀectiveness of our proposed op-timization schemes, the speedup gained from the pro-posed optimization strategies are concluded in Table 9.We observe speedups from 8.44X to 28.3X with all ourproposed optimizations applied to the heterogeneousOpenCL implementations. Table 8

Total processing time for “Hill” image. With OpenCL kernels running on the GPU.CPU-only Heterogeneous CPU+GPUSearch factor α Search area Time w/o local memory ( s ) Time w/ local memory ( s )Patch size Patch size Patch size9 × ×

13 17 ×

17 9 × ×

13 17 ×

17 9 × ×

13 17 × ×

550 153.36 217.15 208.62 248.55 329.53 313.33 94.61 103.18 90.302 1024 ×

50 21.07 15.86 12.14 20.77 21.49 20.88 18.9 11.54 12.221 1024 ×

30 14.46 14.00 12.73 16.80 11.41 12.73 14.7 13.61 9.310.5 1024 ×

20 12.88 14.15 12.09 15.61 12.09 12.67 15.9 11.28 8.910.2 1024 ×

14 13.33 14.00 11.77 15.50 14.22 12.84 16.2 11.54 9.370.05 1024 ×

12 11.52 15.46 11.18 15.60 17.11 13.38 15.9 13.34 8.79

Table 9

Speedup for OpenCL-based heterogeneous imple-mentations with the proposed algorithmic and memory opti-mizations.

Image Processing time (s)

Speedupw/o opt. w/ opt.Full search α = 0 . × × The emerging heterogeneous architecture of mobile SoCprocessors with the support of parallel programmingmodels such as OpenCL enables the capability of general-purpose computing on the mobile GPU. As a case study,we present an OpenCL-based mobile GPU implemen-tation of an object removal algorithm. We analyze theworkload for the computationally-intensive kernels ofthe algorithm, and partition the algorithm between themobile CPU and GPU. Algorithm mapping method-ology and optimization techniques for OpenCL-basedparallel implementation are discussed. Several imple-mentation trade-oﬀs are discussed according to the ar-chitectural properties of the mobile GPU. We performexperiments on a real mobile platform powered by aSnapdragon mobile SoC. The experimental results showthat the CPU-GPU heterogeneous implementation re-duces the processing time to 20.37 seconds, compared to87 seconds processing time for the single-thread CPU-only version. With the proposed optimization strate-gies, the processing time can be further reduced with-out degrading the visual image quality. When we applythe proposed optimizations (setting the search factorto α = 0 .

05 and increasing the patch size to 17 × Acknowledgment

This work was supported in part by Qualcomm, andby the US National Science Foundation under grantsCNS-1265332, ECCS-1232274, and EECS-0925942.

References

1. Bordallo Lopez M, Nyk¨anen H, Hannuksela J, Silven O,Vehvil¨ainen M (2011) Accelerating image recognition onmobile devices using GPGPU. In: Proc. of SPIE, vol7872, p 78720R2. Cheng KT, Wang Y (2011) Using mobile GPU forgeneral-purpose computing - a case study of face recog-nition on smartphones. In: Proc. IEEE Int. Symp. VLSIDesign, Automation and Test (VLSI-DAT), pp 1–4, DOI10.1109/VDAT.2011.57835753. Clemons JL (2013) Computer architectures for mobilecomputer vision systems URL http://hdl.handle.net/2027.42/97782 , Ph.D. Thesis, University of Michigan.4. Criminisi A, Perez P, Toyama K (2003) Object removalby exemplar-based inpainting. In: Proc. IEEE Conf.Computer Vision and Pattern Recognition (CVPR),vol 2, pp 721–728 vol.2, DOI 10.1109/CVPR.2003.12115385. Ensor A, Hall S (2011) GPU-based image analysis onmobile devices. arXiv preprint arXiv:111231106. Fern´andez V, Orduna JM, Morillo P (2012) Performancecharacterization of mobile phones in augmented realitymarker tracking. In: Int. Conf. Computational and Math-ematical Methods in Science and Engineering (CMMSE),pp 537–5497. Google (2013) Android Development Guide. URL http://developer.android.com/index.html

8. Hofmann R (2012) Extraction of natural feature descrip-tors on mobile GPUs Thesis, Hochschulschriftenserverder Universitat Koblenz-Landau.9. Hofmann R, Seichter H, Reitmayr G (2012) A GPGPUaccelerated descriptor for mobile devices. In: IEEE Int.Symp. Mixed and Augmented Reality (ISMAR), IEEE,pp 289–29010. Imagination Technologies Limited (2013) PowerVRGraphics. URL

11. Lee SE, Zhang Y, Fang Z, Srinivasan S, Iyer R, Newell D(2009) Accelerating mobile augmented reality on a hand-held platform. In: Proc. IEEE Int. Conf. Computer design(ICCD), pp 419–42612. Leskela J, Nikula J, Salmela M (2009) OpenCL Embed-ded Proﬁle prototype in mobile device. In: Proc. IEEE Sign Process Syst (2014) 15Workshop Signal Process. Syst. (SiPS), pp 279–284, DOI10.1109/SIPS.2009.533626713. Munshi A (2010) The OpenCL Speciﬁcation v1.1, theKhronos Group. URL

14. Munshi A, Leech J (2009) The OpenGL ES 2.0 Speciﬁ-cation, the Khronos Group. URL

15. Munshi A, Gaster B, Mattson TG, Fung J, Ginsburg D(2011) OpenCL Programming Guide. Addison-Wesley16. Nah JH, Kang YS, Lee KJ, Lee SJ, Han TD, Yang SB(2010) MobiRT: an implementation of OpenGL ES-basedCPU-GPU hybrid ray tracer for mobile devices. In: ACMSIGGRAPH ASIA, p 5017. Niida S, Uemura S, Nakamura H (2010) Mobile services- user tolerance for waiting time. IEEE Veh Technol Mag5(3):61–67, DOI 10.1109/MVT.2010.93785018. NVIDIA Corp (2013) CUDA toolkit v5.5. URL https://developer.nvidia.com/cuda-toolkit

19. NVIDIA Corpration (2013) NVIDIA Tegra mobileprocessor. URL

20. Paucher R, Turk M (2010) Location-based augmentedreality on mobile phones. In: IEEE Computer SocietyConf. Computer Vision and Pattern Recognition Work-shops (CVPRW), pp 9–1621. Pulli K, Baksheev A, Kornyakov K, Eruhimov V (2012)Real-time computer vision with OpenCV. Communica-tions of the ACM 55(6):61–6922. Qualcomm Inc (2013) Qualcomm Snapdragon Processor.URL

23. Rister B, Wang G, Wu M, Cavallaro JR (2013) A fast andeﬃcient SIFT detector using the mobile GPU. In: Proc.IEEE Int. Conf. Acoustics, Speech, and Signal Process.(ICASSP), pp 2674–267824. Singhal N, Park IK, Cho S (2010) Implementation andoptimization of image processing algorithms on handheldGPU. In: Proc. IEEE Int. Conf. Image Processing (ICIP),pp 4481–4484, DOI 10.1109/ICIP.2010.565174025. Wagner D, Reitmayr G, Mulloni A, Drummond T,Schmalstieg D (2008) Pose tracking from natural featureson mobile phones. In: Proc. Int. Symp. Mixed and Aug-mented Reality (ISMAR), pp 125–13426. Wang G, Rister B, Cavallaro JR (2013) Workload analy-sis and eﬃcient OpenCL-based implementation of SIFTalgorithm on a smartphone. In: Proc. IEEE Global Conf.Signal and Information Processing (GlobalSIP), to ap-pear.27. Wang G, Xiong Y, Yun J, Cavallaro JR (2013) Acceler-ating computer vision algorithms using OpenCL frame-work on the mobile GPU - a case study. In: Proc.IEEE Int. Conf. Acoustics, Speech, and Signal Processing(ICASSP), pp 2629–263328. Wang YC, Cheng KTT (2012) Energy and performancecharacterization of mobile heterogeneous computing. In:IEEE Workshop on Signal Processing Systems (SiPS),IEEE, pp 312–31729. Xiong Y, Liu D, Pulli K (2009) Eﬀective gradient domainobject editing on mobile devices. In: Proc. Asilomar Conf.Signals, Systems and Computers (ASILOMAR), pp 1256–1260, DOI 10.1109/ACSSC.2009.546995930. Yang X, Cheng KT (2012) Accelerating SURF detec-tor on mobile devices. ACM Multimedia URL http://lbmedia.ece.ucsb.edu/resources/ref/acmmm12.pdfhttp://lbmedia.ece.ucsb.edu/resources/ref/acmmm12.pdf