[PDF] Accelerating Discrete Wavelet Transforms on Parallel Architectures

Abstract

The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on various shared-memory parallel architectures, especially on graphics processing units (GPUs). All these studies, however, considered only separable calculation schemes. We show that corresponding separable parts can be merged into non-separable units, which halves the number of steps. In addition, we introduce an optional optimization approach leading to a reduction in the number of arithmetic operations. The discussed schemes were adapted on the OpenCL framework and pixel shaders, and then evaluated using GPUs of two biggest vendors. We demonstrate the performance of the proposed non-separable methods by comparison with existing separable schemes. The non-separable schemes outperform their separable counterparts on numerous setups, especially considering the pixel shaders.

Full PDF

AAccelerating Discrete Wavelet Transformson Parallel Architectures

David Barina Michal Kula Michal Matysek Pavel ZemcikCentre of Excellence IT4InnovationsFaculty of Information TechnologyBrno University of TechnologyBozetechova 1/2, BrnoCzech Republic{ibarina,ikula,imatysek,zemcik}@ﬁt.vutbr.cz

ABSTRACT

The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Untilrecently, several studies have compared the performance of such transform on various shared-memory parallel ar-chitectures, especially on graphics processing units (GPUs). All these studies, however, considered only separablecalculation schemes. We show that corresponding separable parts can be merged into non-separable units, whichhalves the number of steps. In addition, we introduce an optional optimization approach leading to a reduction inthe number of arithmetic operations. The discussed schemes were adapted on the OpenCL framework and pixelshaders, and then evaluated using GPUs of two biggest vendors. We demonstrate the performance of the proposednon-separable methods by comparison with existing separable schemes. The non-separable schemes outperformtheir separable counterparts on numerous setups, especially considering the pixel shaders.

Keywords

Discrete wavelet transform, Image processing, Synchronization, Graphics processors

The discrete wavelet transform became a very popularimage processing tool in last decades. A widespreaduse of this transform has resulted in a developmentof fast algorithms on all sorts of computer systems,including shared-memory parallel architectures. Atpresent, the GPU is considered as a typical represen-tative of such parallel architectures. In this regard,several studies have compared the performance ofvarious 2-D DWT computational approaches on GPUs.All of these studies are based on separable schemes,whose operations are oriented either horizontally orvertically. These schemes comprise the convolutionand lifting. The lifting requires fewer arithmetic op-erations as compared with the convolution, at the costof introducing some data dependencies. The numberof operations should be proportional to a transformperformance. However, also the data dependenciesmay form a bottleneck, especially on shared-memoryparallel architectures.

Permission to make digital or hard copies of all or part ofthis work for personal or classroom use is granted withoutfee provided that copies are not made or distributed for proﬁtor commercial advantage and that copies bear this notice andthe full citation on the ﬁrst page. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requiresprior speciﬁc permission and/or a fee.

In this paper, we show that the fastest scheme for agiven architecture can be obtained by fusing the corre-sponding parts of the separable schemes into new struc-tures. Several new non-separable schemes are obtainedin this way. More precisely, the underlying operationsof these schemes can be associated with neither hori-zontal nor vertical axes. In addition, we present an ap-proach where each scheme can be adapted to a partic-ular platform in order to reduce the number of opera-tions. This possibility was completely omitted in ex-isting studies. Our reasoning is supported by extensiveexperiments on GPUs using OpenCL and pixel shaders(fragment shaders in OpenGL terminology). The pre-sented schemes are general, and they are not limited toany speciﬁc type of DWT. To clarify the situation, theyall compute the same values.The rest of this paper is organized as follows. Sec-tion Background formally introduces the problem deﬁ-nition. Section Related Work brieﬂy presents the exist-ing separable approaches. Section Proposed Schemespresents the proposed non-separable schemes. Sec-tion Optimization Approach discusses the optimizationapproach that reduces the number of operations. Sec-tion Evaluation evaluates the performance on GPUs inthe pixel shaders and OpenCL framework. Eventually,Section Conclusions closes the paper. This section isfollowed by Section Appendix for readers not familiarwith signal-processing notations. a r X i v : . [ c s . PF ] M a y BACKGROUND

Since the separable schemes are built on the one-dimensional transform, a widely-used z -transform isused for the description of underlying FIR ﬁlters. Thetransfer function of the ﬁlter ( g k ) is the polynomial G ( z ) = ∑ k g k z − k ,where the k refers to the time axis. Below in the text,the one-dimensional transforms are used in conjunctionwith two-dimensional signals. For this case, the transferfunction of the ﬁlter (cid:0) g k m , k n (cid:1) is deﬁned as the bivariatepolynomial G ( z m , z n ) = ∑ k m ∑ k n g k m , k n z − k m m z − k n n ,where the subscript m refers to the horizontal axis and n to the vertical one. The G ∗ ( z m , z n ) = G ( z n , z m ) is apolynomial transposed to a polynomial G ( z m , z n ) . Ashortened notation G is only written in order to keepthe notation readable.A discrete wavelet transform is a signal-processing toolwhich is suitable for the decomposition of a signalinto low-pass and high-pass components. In detail,the single-scale transform splits the input signal intotwo components, according to a parity of its samples.Therefore, the DWT is described by 2 × , G . The transformcan also be represented by the polyphase matrix (cid:20) G ( o ) G ( e ) G ( o ) G ( e ) (cid:21) , (1)where the polynomials G ( e ) and G ( o ) refer to the evenand odd terms of G. This polyphase matrix deﬁnes theconvolution scheme. To avoid misunderstandings, it isnecessary to say that, in this paper, column vectors aretransformed to become another columns. For example,y = Mx and y = M M x are transforms represented byone and two matrices, respectively. Following the algo-rithm by Sweldens [14, 4], the convolution scheme in(1) can be factored into a sequence ∏ k (cid:20) ( k ) (cid:21) (cid:20) ( k ) (cid:21) (2)of K pairs of short ﬁlterings, known as the liftingscheme. The ﬁlters employed in (2) are referred to asthe lifting steps. Usually, the ﬁrst step P ( k ) in the k thpair is referred to as the predict and the second oneU ( k ) as the update. The lifting scheme reduces thenumber of operations by up to half. Since this paper ismostly focused on a single pair of steps, the superscript ( k ) is omitted in the text below. Note that the number of operations is calculated as the number of distinct(in a column) terms of all polynomials in all matrices,excluding units on diagonals.Considering the shared-memory parallel architectures,the processing of single or several samples is mappedto independent processing units. In order to avoidrace conditions during data exchange, the units mustuse some synchronization method (barrier). In the lift-ing scheme, the barriers are required before the liftingsteps. In the convolution scheme, the barrier is onlyrequired before starting the calculation. In this paper,the barriers are indicated by the | symbol. For example,M | M are two adjacent lifting steps separated by thebarrier. For simplicity, the number of barriers is alsocalled the number of steps in the text below.The 2-D transform is deﬁned as a tensor product of 1-Dtransforms. Consequently, the transform splits the sig-nal into a quadruple of wavelet coefﬁcients. Therefore,the 2-D DWT is described by 4 × N V (cid:12)(cid:12) N H (cid:12)(cid:12) ,where N H and N V are 1-D transforms in horizontaland in vertical direction. For the well-known Cohen-Daubechies-Feauveau (CDF) wavelet with 9/7 samples,such as used in the JPEG 2000 standard, these matricesare graphically illustrated in Figure 1. Here, only thehorizontal part is shown. Particularly, the ﬁlters in theﬁgure are of sizes 9 and 7 taps. The , , , and circlesrepresent the quadruple of wavelet coefﬁcients. Figuresshown are for illustration purpose only.Figure 1: Horizontal part of the separable convolutionscheme for the CDF 9/7 wavelet. Two appropriatelychosen pairs of matrix rows are depicted in separatesubﬁgures. The arrows are pointing to the destinationoperand and denote a multiply–accumulate operation,with multiplication by a real constant. The arrows inthe same row overlap.nother scheme used for 2-D transform is the separablelifting. Similarly to the previous case, the predict andupdate lifting steps can be applied in both directions se-quentially. Moreover, horizontal and vertical steps canbe arbitrarily interleaved thanks to the linear nature ofthe ﬁlters. Therefore, the scheme is deﬁned asS V U (cid:12)(cid:12) S H U (cid:12)(cid:12) T V P (cid:12)(cid:12) T H P (cid:12)(cid:12) ,wherein the predict steps T always precede the updatesteps S. The above mapping corresponds to a singleP and U pair of lifting steps. For multiple pairs, thescheme is separately applied to each such pair. In or-der to describe 2-D matrices, the lifting steps must beextended into two dimensions as (cid:20) GG ∗ (cid:21) = (cid:20) G ( z m , z n ) G ∗ ( z m , z n ) (cid:21) = (cid:20) G ( z m ) G ( z n ) (cid:21) .Then, the individual steps are deﬁned asT H P =   ,T V P =  ∗ ∗  ,S H U =   ,S V U =  ∗

00 1 0 U ∗  .For the CDF wavelets, the matrices are also illustratedin Figure 2, again showing the horizontal part only. (a) T H P (b) S H U Figure 2: The horizontal part of the separable liftingscheme for the CDF wavelets.

This section brieﬂy reviews papers that motivated ourresearch. So far, several papers have compared theperformance of the separable lifting and separableconvolution schemes on GPUs. Especially, Tenllado etal. [15] compared these schemes on GPUs using pixelshaders. The authors mapped data to 2-D textures,constituted by four ﬂoating-point elements. They con-cluded that the separable convolution is more efﬁcientthan the separable lifting scheme in most cases. Theyfurther noted that fusing several consecutive kernelsmight signiﬁcantly speed up the execution, even ifthe complexity of the resulting fused pixel program ishigher.Kucis et al. [8] compared the performance of several re-cently published schedules for computing the 2-D DWTusing the OpenCL framework. All of these schedulesuse separable schemes, either the convolution or lift-ing. In more detail, the work compares a convolution-based algorithm proposed in [5] against several lifting-based methods [2, 16] in the horizontal part of the trans-form. The authors concluded that the lifting-based al-gorithm of Blazewicz et al. [2] is the fastest method.Furthermore, Laan et al. [16] compared the perfor-mance of their separable lifting-based method againsta convolution-based method. They concluded that thelifting is the fastest method. The authors also com-pared the performance of implementations in CUDAand pixel shaders, based on the work of Tenllado [15].The CUDA implementation proved to be the fasterchoice. In this regard, the authors noted that a speedupin CUDA occurs because the CUDA effectively makesuse of on-chip memory. This use is not possible inpixel shaders, which exchange the data using off-chipmemory. Other important separable approaches can befound in [11, 6, 13, 12].This paper is based on the previous works in [1, 9].In those works, we introduced several non-separableschemes for calculation of 2-D DWT. However, wehave not considered important structures, such as poly-convolutions. We contribute this consideration with thispaper. Moreover, differences and similarities betweenthe separable schemes and their non-separable coun-terparts are homogeneously discussed here. All theseschemes are also thoroughly analyzed and evaluated.Considering the present papers, we see that a possiblefusion of separable parts into new non-separable struc-tures is not considered. Therefore, we investigate onthis promising technique in the following sections.igure 3: Non-separable convolution scheme for the CDF 9/7 wavelet. The individual rows of N are depicted inseparate subﬁgures. The sizes are from top to bottom and left to right: 9 ×

9, 7 ×

9, 9 ×

7, 7 × As stated above, the existing approaches did not studythe possibility of a partial fusion of lifting polyphasematrices. This section presents three alternative non-separable schemes for the calculation of the 2-D trans-form. The contribution of this paper starts with this sec-tion. To avoid confusion, please note that the proposedschemes compute the same values as the original ones.The non-separable convolution scheme is a counter-part to the separable convolution. Unlike the separa-ble scheme, all horizontal and vertical calculations areperformed in a single step N (cid:12)(cid:12) ,where N = N V N H is a product of 1-D transforms inhorizonal and vertical directions. The drawback of thisscheme is that it requires the highest number of arith-metic operations. For the CDF 9/7 wavelet, the matrixis graphically illustrated in Figure 3. Here, the 2-D ﬁl-ters are of sizes 9 ×

9, 7 ×

9, 9 ×

7, and 7 ×

7. Thesesizes make the calculation computationally demanding.Aside from the GPUs, this approach was earlier dis-cussed in Hsia et al. [7]. In order to reduce computational complexity, it wouldbe a good idea to construct some smaller non-separablesteps. Indeed, the non-separable convolution can bebroken into smaller units, referred here to as the non-separable polyconvolutions. For a single pair of liftingsteps, the scheme follows from the mappingN P , U (cid:12)(cid:12) ,whereN P , U =  V ∗ V V ∗ U U ∗ V U ∗ UV ∗ P V ∗ U ∗ P U ∗ P ∗ V P ∗ U V UP ∗ P P ∗ P 1  and V = PU +

1. For the CDF wavelets, the schemeis graphically illustrated in Figure 4. In this case, theemployed ﬁlters are of sizes 5 ×

5, 3 ×

5, 5 ×

3, and3 ×

3. Note that only half of the operations are requiredspeciﬁcally for the CDF 9/7 wavelet, compared to thenon-separable convolution. For the sake of complete-ness, it should be pointed out that it is also possibleto formulate the separable polyconvolution scheme. Inour experiments, this one was however not proven to beuseful concerning the performance.igure 4: Non-separable polyconvolution scheme forthe CDF wavelets. The individual rows of N are de-picted in separate subﬁgures.By combining the corresponding horizontal and verticalsteps of the separable lifting scheme, the non-separablelifting scheme is formed. The number of operations hasslightly been increased. The scheme consists of a spa-tial predict and spatial update step, thus two steps in to-tal for each pair of the original lifting steps. Formally,for each pair of P and U, the scheme follows fromS U (cid:12)(cid:12) T P (cid:12)(cid:12) ,whereT P =  ∗ ∗ P ∗ P 1  ,S U =  ∗ UU ∗ ∗  .Note that the spatial ﬁlters in PP ∗ and UU ∗ may becomputationally demanding, depending on their sizes.However, the situation is always better than in the pre-vious two cases. For the CDF wavelets, the scheme isgraphically illustrated in Figure 5. This section presents an optimization approach that re-duces the number of operations, while the number ofsteps remains unaffected. Such an approach was notcovered in existing studies.Regardless of the underlying platform, an important ob-servation can be made. A very special form of the op-erations guarantees that the processing units never ac-cess the results belonging to their neighbors. These op-erations comprise only constants. Since the convolu-tion is a linear operation, the polynomials can be pulled (a) T P (b) T P (c) S U (d) S U Figure 5: Non-separable lifting scheme for the CDFwavelets.out of the original matrices, and calculated in a differ-ent step. Formally, the original polynomials are splitas P = P + P and U = U + U . The P and U areconstant. As a next step, the P and U are substitutedinto the separable lifting scheme. The separable liftingscheme was chosen because it has the lowest number ofoperations. This part is illustrated in Figure 6. In con-trast, the P and U are kept in original schemes. Thesetwo steps are then computed without any barrier. Theobservation is further exploited to adapt schemes for aparticular platform. (a) T H P (b) T V P (c) S H U (d) S V U Figure 6: Separable lifting scheme with the polynomi-als P and U .Now, the improved schemes for the shaders andOpenCL are brieﬂy described. These schemes exploitthe above-described observation with the polynomialsP and U . On recent GPUs, OpenCL schemes alsoomit memory barriers due to the SIMD-32 architecture.Note that the non-separable polyconvolution schememakes sense only when K >

1, which is the case of theCDF 9/7 wavelet. Implementations in the pixel shadersmap input and output data to 2-D textures. There isno possibility to retain some results in registers, andthe results are exchanged through textures in off-chipmemory. Considering the OpenCL implementations, adata format is not constrained. The image is dividedinto overlapping blocks and on-chip memory shared byall threads in a block is utilized to exchange the results.Additionally, some results are passed in registers.his paper explores the performance for three fre-quently used wavelets, namely, CDF 5/3, CDF 9/7[3], and DD 13/7 [14]. Their fundamental propertiesare listed in Table 1: number of steps and arithmeticoperations. Note that the number of operations iscommonly proportional to a transform performance.Additionally, the number of steps correspond to thenumber of synchronizations on parallel architectures,which also form a performance bottleneck.Table 1: The total number of steps and arithmetic oper-ations for the optimized schemes. (a) CDF 5/3 scheme steps operationsOpenCL shadersseparable convolution 2 20 22separable lifting 4 16 16 non-separable convolution 1 23 39 non-separable lifting 2 18 18 (b) CDF 9/7 scheme steps operationsOpenCL shadersseparable convolution 2 56 58separable polyconv. 4 40 56separable lifting 8 32 32 non-separable convolution 1 152 200 non-separable polyconv. 2 46 62 non-separable lifting 4 36 36 (c) DD 13/7 scheme steps operationsOpenCL shadersseparable convolution 2 60 60separable lifting 4 32 32 non-separable convolution 1 203 228 non-separable lifting 2 50 50

The experiments in this paper were performed on GPUsof the two biggest vendors NVIDIA and AMD usingthe OpenCL and pixel shaders. In these experiments,only a transform performance was measured, usuallyin gigabytes per second (GB/s). The host system doesnot help in the calculation, i.e. with respect to paddingor pre/post-processing. Results for only two GPUs areshown for the sake of brevity: AMD Radeon HD 6970and NVIDIA Titan X. Their technical parameters aresummarized in Table 2. Table 2: Speciﬁcations of the evaluated GPUs. label AMD 6970 NVIDIA Titan Xmodel Radeon HD 6970 Titan X (Pascal)multiprocessors 24 28total processors 1 536 3 584processor clock 880 MHz 1 417 MHzperformance 2 703 GFLOPS 10 157 GFLOPSmemory clock 1 375 MHz 2 500 MHzbandwidth 176 GB/s 480 GB/son-chip memory 32 KiB 96 KiB

The implementations were created using the DirectXHLSL and OpenCL. The HLSL implementation is usedon the NVIDIA Titan X, whereas the OpenCL imple-mentation on the AMD 6970. The results of the per-formance comparison are shown in Figures 7, 8, and9. The value on the x-axis is the image resolution inkilo/megapixels (kpel or Mpel). Except for the con-volutions for the DD 13/7 wavelet, the non-separableschemes always outperform their separable counter-parts. For CDF wavelets, having short lifting ﬁlters, thenon-separable (poly)convolutions have a better perfor-mance than the non-separable lifting scheme. Unfortu-nately, for the DD 13/7 wavelet, which is characterizedby a high number of operations in lifting ﬁlters, the re-sults are not conclusive. Considering the implementa-tion in pixel shaders, similar results were also achievedon other GPUs, including NVIDIA uniﬁed architec-tures and AMD GPUs based on Graphics Core Next(GCN) architecture. Whereas for the OpenCL imple-mentation, the non-separable schemes are only provedto be useful for very long instruction word (VLIW) ar-chitectures.Looking at the experiments with the pixel-shader im-plementations, some transients can be seen at the be-ginning of the plots (in lower 2 Mpel region). We con-cluded that these transients are caused by a suboptimaluse of cache system, or alternatively by some overheadmade by the graphics API. It should be interesting toshow some measures provided by an OpenCL proﬁler.Our proﬁling revealed that the implementations exhibitonly an occupancy 95.24 %. This occupancy is causedby making use of 256 threads in OpenCL work groupsand due to maximal number 1344 of threads in mul-tiprocessors (256 times 5 work groups is 1280 out of1344).

50 100 150 200 250 300 100kpel 1Mpel 10Mpel 100Mpel G B / s G B / s (a) OpenCL (b) pixel shader separable liftingseparable convolution non-separable liftingnon-separable convolution Figure 7: Performance for the CDF 5/3 wavelet. G B / s G B / s (a) OpenCL (b) pixel shader separable liftingseparable polyconvolutionseparable convolution non-separable liftingnon-separable polyconvolutionnon-separable convolution Figure 8: Performance for the CDF 9/7 wavelet. G B / s G B / s (a) OpenCL (b) pixel shader separable liftingseparable convolution non-separable liftingnon-separable convolution Figure 9: Performance for the DD 13/7 wavelet.

CONCLUSIONS

This paper presented and discussed several non-separable schemes for the computation of the 2-Ddiscrete wavelet transform on parallel architectures,exemplarily on modern GPUs. As an option, an opti-mization approach leading to a reduction in the numberof operations was presented. Using this approach,the schemes were adapted on the OpenCL frameworkand pixel shaders. The implementations were thenevaluated using GPUs of the two biggest vendors.Considering OpenCL, the schemes exploit features ofrecent GPUs, such as warping. For CDF wavelets, thenon-separable schemes exhibit a better performancethan their separable counterparts on both the OpenCLand pixel shaders.In the evaluation, we reached the following conclu-sions. Fusing several consecutive steps of the schemesmight signiﬁcantly speed up the execution, irrespectiveof their higher complexity. The non-separable schemesoutperform their separable counterparts on numeroussetups, especially considering the pixel shaders. All ofthe schemes are general and they can be used on anydiscrete wavelet transform. In future work, we planto focus on general-purpose processors and multi-scaletransforms.

Acknowledgements

This work has been supported by the Ministry of Edu-cation, Youth and Sports from the National Programmeof Sustainability (NPU II) project IT4Innovations ex-cellence in science (no. LQ1602), and the TechnologyAgency of the Czech Republic (TA CR) CompetenceCentres project V3C – Visual Computing CompetenceCenter (no. TE01020415).

APPENDIX

For readers who are not familiar with signal-processingnotations, a relationship between polyphase matricesand data-ﬂow diagrams is explained here. The 2-Ddiscrete wavelet transform divides the image into fourpolyphase components. Therefore, the 4 × H P =   maps four polyphase components to another four com-ponents, while using two 2-D FIR ﬁlters representedby the polynomials P. Moreover, when we substitute aparticular polynomial, say P ( z ) = − / ( + z − ) , intothe matrix, the mapping gets a speciﬁc shape. Such asubstitution illustrated by the data-ﬂow diagram in Fig-ure 10. The solid arrows correspond to multiplicationby − / (a) T H P Figure 10: Visual representation of the polyphase ma-trix. The four polyphase components are representedby color circles.

EFERENCES [1] Barina, D., Kula, M., and Zemcik, P. Parallelwavelet schemes for images.

Journal of Real-Time Image Processing , in press. doi: 10.1007/s11554-016-0646-3.[2] Blazewicz, M., Ciznicki, M., Kopta, P., Kurowski,K., and Lichocki, P.

Two-Dimensional DiscreteWavelet Transform on Large Images for HybridComputing Architectures: GPU and CELL , pages481–490. Springer, 2012. ISBN 978-3-642-29737-3. doi: 10.1007/978-3-642-29737-3_53.[3] Cohen, A., Daubechies, I., and Feauveau, J.-C. Biorthogonal bases of compactly supportedwavelets.

Communications on Pure and AppliedMathematics , 45(5):485–560, 1992. ISSN 1097-0312. doi: 10.1002/cpa.3160450502.[4] Daubechies, I. and Sweldens, W. Factoringwavelet transforms into lifting steps.

Journalof Fourier Analysis and Applications , 4(3):247–269, 1998. ISSN 1069-5869. doi: 10.1007/BF02476026.[5] Galiano, V., Lopez, O., Malumbres, M. P., andMigallon, H. Improving the discrete wavelettransform computation from multicore to GPU-based algorithms. In

Int. Conf. on Computationaland Mathematical Methods in Science and Engi-neering , pages 544–555, June 2011. ISBN 978-84-614-6167-7.[6] Galiano, V., Lopez, O., Malumbres, M., and Mi-gallon, H. Parallel strategies for 2D discretewavelet transform in shared memory systems andGPUs.

The Journal of Supercomputing , 64(1):4–16, 2013. ISSN 0920-8542. doi: 10.1007/s11227-012-0750-5.[7] Hsia, C. H., Guo, J. M., Chiang, J. S., and Lin,C. H. A novel fast algorithm based on smdwt forvisual processing applications. In

IEEE Interna-tional Symposium on Circuits and Systems , pages762–765, May 2009. doi: 10.1109/ISCAS.2009.5117860.[8] Kucis, M., Barina, D., Kula, M., and Zemcik, P.2-D discrete wavelet transform using GPU. In

International Symposium on Computer Architec-ture and High Performance Computing Workshop ,pages 1–6. IEEE, Oct. 2014. ISBN 978-1-4799-7014-8. doi: 10.1109/SBAC-PADW.2014.13. [9] Kula, M., Barina, D., and Zemcik, P. New non-separable lifting scheme for images. In

IEEE In-ternational Conference on Signal and Image Pro-cessing , pages 292–295. IEEE, 2016. ISBN 978-1-5090-2375-2. doi: 10.1109/SIPROCESS.2016.7888270.[10] Mallat, S. A theory for multiresolution signal de-composition: the wavelet representation.

IEEETransactions on Pattern Analysis and MachineIntelligence , 11(7):674–693, 1989. ISSN 0162-8828. doi: 10.1109/34.192463.[11] Matela, J. GPU-based DWT acceleration forJPEG2000. In

Annual Doctoral Workshop onMathematical and Engineering Methods in Com-puter Science , pages 136–143, 2009. ISBN 978-80-87342-04-6.[12] Quan, T. M. and Jeong, W.-K. A fast discretewavelet transform using hybrid parallelism onGPUs.

IEEE Transactions on Parallel and Dis-tributed Systems , 27(11):3088–3100, Nov. 2016.ISSN 1045-9219. doi: 10.1109/TPDS.2016.2536028.[13] Song, C., Li, Y., Guo, J., and Lei, J. Block-based two-dimensional wavelet transform runningon graphics processing unit.

IET Computers Dig-ital Techniques , 8(5):229–236, Sept. 2014. ISSN1751-8601. doi: 10.1049/iet-cdt.2013.0141.[14] Sweldens, W. The lifting scheme: A custom-design construction of biorthogonal wavelets.

Ap-plied and Computational Harmonic Analysis , 3(2):186–200, 1996. ISSN 1063-5203. doi: 10.1006/acha.1996.0015.[15] Tenllado, C., Setoain, J., Prieto, M., Pinuel, L.,and Tirado, F. Parallel implementation of the2D discrete wavelet transform on graphics pro-cessing units: Filter bank versus lifting.

IEEETransactions on Parallel and Distributed Systems ,19(3):299–310, 2008. ISSN 1045-9219. doi:10.1109/TPDS.2007.70716.[16] van der Laan, W. J., Jalba, A. C., and Roerdink,J. B. T. M. Accelerating wavelet lifting on graph-ics hardware using CUDA.