[PDF] Sparse Representation based Multi-sensor Image Fusion: A Review

Abstract

As a result of several successful applications in computer vision and image processing, sparse representation (SR) has attracted significant attention in multi-sensor image fusion. Unlike the traditional multiscale transforms (MSTs) that presume the basis functions, SR learns an over-complete dictionary from a set of training images for image fusion, and it achieves more stable and meaningful representations of the source images. By doing so, the SR-based fusion methods generally outperform the traditional MST-based image fusion methods in both subjective and objective tests. In addition, they are less susceptible to mis-registration among the source images, thus facilitating the practical applications. This survey paper proposes a systematic review of the SR-based multi-sensor image fusion literature, highlighting the pros and cons of each category of approaches. Specifically, we start by performing a theoretical investigation of the entire system from three key algorithmic aspects, (1) sparse representation models; (2) dictionary learning methods; and (3) activity levels and fusion rules. Subsequently, we show how the existing works address these scientific problems and design the appropriate fusion rules for each application, such as multi-focus image fusion and multi-modality (e.g., infrared and visible) image fusion. At last, we carry out some experiments to evaluate the impact of these three algorithmic components on the fusion performance when dealing with different applications. This article is expected to serve as a tutorial and source of reference for researchers preparing to enter the field or who desire to employ the sparse representation theory in other fields.

Full PDF

11 Sparse Representation based Multi-sensorImage Fusion: A Review

Qiang Zhang, Yi Liu, Rick S. Blum,

Fellow, IEEE , Jungong Han, and Dacheng Tao,

Fellow, IEEE

Abstract —As a result of several successful applications in computer vision and image processing, sparse representation (SR) hasattracted signiﬁcant attention in multi-sensor image fusion. Unlike the traditional multiscale transforms (MSTs) that presume the basisfunctions, SR learns an over-complete dictionary from a set of training images for image fusion, and it achieves more stable andmeaningful representations of the source images. By doing so, the SR-based fusion methods generally outperform the traditionalMST-based image fusion methods in both subjective and objective tests. In addition, they are less susceptible to mis-registrationamong the source images, thus facilitating the practical applications. This survey paper proposes a systematic review of the SR-basedmulti-sensor image fusion literature, highlighting the pros and cons of each category of approaches. Speciﬁcally, we start by performinga theoretical investigation of the entire system from three key algorithmic aspects, (1) sparse representation models; (2) dictionarylearning methods; and (3) activity levels and fusion rules. Subsequently, we show how the existing works address these scientiﬁcproblems and design the appropriate fusion rules for each application such as multi-focus image fusion and multi-modality (e.g.,infrared and visible) image fusion. At last, we carry out some experiments to evaluate the impact of these three algorithmic componentson the fusion performance when dealing with different applications. This article is expected to serve as a tutorial and source ofreference for researchers preparing to enter the ﬁeld or who desire to employ the sparse representation theory in other ﬁelds.

Index Terms —Image fusion, Sparse representation, Dictionary learning, Activity level (cid:70)

NTRODUCTION D UE to recent technological advancements, extensivevarieties of imaging sensors have been employed inmany applications including remote sensing, medical imag-ing, video surveillance, machine vision and security. Thus,ﬁnding a way to most effectively utilize the informationcaptured from these multiple sensors, possibly of differentmodalities, is of considerable interest. Image fusion pro-vides one versatile solution, where multiple aligned imagesacquired by different sensors are merged into a compositeimage. The properly fused image is more informative thanany of the individual input images and can thus better inter-pret the scene [1]. As a result, multi-sensor image fusion hasalways been an active research topic, facilitating a variety ofvision-related applications.To date, a large number of image fusion algorithmshave been proposed [2], [3], [4], [5], [6], in which multiscaletransform-based (MST) fusion methods are the most popu-lar [7], [8], [9]. Traditional MST fusion methods are generally • Q. Zhang is with the Key Laboratory of Electronic Equipment StructureDesign, Ministry of Education, Xidian University, China. He and Y.Liu are also with the Center for Complex Systems, School of Mechano-Electronic Engineering, Xidian University, Xi’an Shaanxi 710071,China. Email: [email protected], yliu [email protected]. • R. S. Blum is with Electrical and Computer Engineering Depart-ment, Lehigh University, Bethlehem, PA 18015, United States. Email:[email protected]. • J. Han is with Department of Computer and Information Sciences,Northumbria University, Newcastle upon Tyne NE1 8ST, U.K. Email:[email protected]. (corresponding author) • D.Tao is with the School of Information Technologies in the Fac-ulty of Engineering and Information Technologies at the Universityof Sydney, J12/318 Cleveland St, Darlington NSW 2008, Australia.Email:[email protected] xxx, 2017 those based on pyramids [10] and wavelet transforms [11].Recently developed fusion methods can be considered astheir variations and extensions employing multiscale geo-metric analysis (MGA) tools, such as the Curvelet Transform[12], the Shearlet Transform [13] and the nonsubsampledContourlet Transform (NSCT) [14]. Thorough reviews onsuch methods can be found in [2], [7].Sparse representation (SR) [15] has recently drawn sig-niﬁcant interest in computer vision and image processingdue to its enhanced performance in many applications, suchas face recognition [15], action recognition [16], and objecttracking [17]. The main idea of SR theory lies in the fact thatan image signal can be represented as a linear combinationof the fewest possible atoms or transform basis primitivesin an over-complete dictionary. Sparsity means that only asmall number of atoms are required to accurately recon-struct a signal, i.e., the coefﬁcients become sparse. Over-completeness indicates that the number of atoms in thedictionary is larger than the dimension of the signal. Thus,a sufﬁcient number of atoms in an over-complete dictionarypermit an accurate sparse representation of signals [18].Not surprisingly, SR has also attracted signiﬁcant atten-tion in the research ﬁeld of image fusion [18], [19], [20], [21].Similar to the traditional MST-based image fusion methods,most of the SR-based image fusion methods also belongto the transform-domain-based techniques . However, thereare two main differences between the SR-based and thetraditional MST-based fusion methods [18], [19].1) The traditional MSTs usually ﬁx their basis func-tions in advance for image analysis and fusion.

1. As discussed later, parts of the SR-based fusion methods belong tothe spatial-domain-based methods. a r X i v : . [ c s . C V ] F e b Due to the limitations of predeﬁned basis functions,some signiﬁcant features (e.g., edges) of sourceimages may not be well expressed and extracted,thereby dramatically degrading the performance offusion. In contrast, SR generally learns an over-complete dictionary from a set of training images forimage fusion, which captures intrinsic data-drivenimage representations tending to be domain ag-nostic. The over-complete dictionary contains richerbasis atoms allowing more meaningful and stablerepresentations of source images. By doing so, SR-based fusion methods generally outperform the tra-ditional MST-based image fusion methods in bothsubjective and objective tests.2) The traditional MST-based fusion methods are im-plemented in a multiscale manner, where the se-lection of the MST decomposition level becomesthereby crucial and tricky. To ensure spatial detailscan be extracted from the source images, the decom-position level is often set too large. In this case, onecoefﬁcient in the low-pass band has a great impacton a large set of pixels in the fused image. Accord-ingly, an error in the low-pass sub-band (mainlycaused by noise or mis-registration between thesource images) will lead to serious artiﬁcial effects[19]. The fusion of the high-pass sub-band coefﬁ-cients is also sensitive to noise and mis-registrationin this case. Consequently, the MST-based fusionmethods are generally sensitive to mis-registration,impending their usage in the practical applicationswhere a perfect spatial alignment of different sourceimages is unachievable. In contrast, the SR-based fu-sion methods are generally implemented in a patch-based way. More speciﬁcally, the source images areﬁrst divided into a number of patches of the samesize, and the fusion is carried out at the patch level.Moreover, in order to reduce block artifacts andimprove the robustness against mis-registration, asliding window with a step length equal to a ﬁxednumber of pixels (e.g., one pixel) is often used inthe SR-based fusion methods. In other words, thesepatches overlap by a ﬁxed number of pixels alongthe horizontal and vertical directions. Generally,SR-based fusion methods are more robust to mis-registration than MST-based ones.

Since Yang and Li [18] took the ﬁrst step in applying theSR theory to the image fusion ﬁeld, a number of SR-basedimage fusion methods have been proposed. As shown inFig. 1, the growing appeal of this research area can beobserved from the steady increase in the number of scientiﬁcpapers published in academic journals and magazines since2010.The basic idea behind SR-based image fusion is thatimage signals can be represented as a linear combinationof a “few” atoms from a pre-learned dictionary, and thesparse coefﬁcients describe the salient features of the sourceimages. As shown in Fig. 2, the main steps in most SR-based image fusion methods include: (a) segment the source Fig. 1: Numbers of publications on SR-based fusion meth-ods, obtained from the Web of Science indexing service.images into some overlapping patches and rewrite each ofthese patches as a vector; (b) perform sparse representationon the source image patches using pre-deﬁned or learneddictionaries; (c) combine the sparse representations by somefusion rules; (d) reconstruct the fused images from theirsparse representations.The dictionaries employed in these methods may be di-rectly obtained from some ﬁxed (e.g., DCT and Wavelet) ba-sis [18]. They can also be learned from a set of auxiliary im-ages ( global trained dictionary ) [22] or from the input imagesthemselves ( adaptively trained dictionary ) [23] using somelearning methods, such as K-SVD [24]. Sometimes, a pair ofcoupled dictionaries are even simultaneously learned froma high-spatial-resolution image and its spatially-degradedversion. Using the coupled dictionaries allows to produce afused image with higher spatial-resolution [25], [26].Different sparse representation models have been usedin image fusion methods. They include: (1) the traditionalSR model [15] in which the sparsity constraint (using l -norm or l -norm) is performed on the representation coef-ﬁcients; (2) the non-negative SR model [27] in which thesparsity and non-negativity constraints are jointly imposedon the representation coefﬁcients; (3) the robust SR model[28] in which the sparsity constraint is imposed on the recon-struction errors as well as on the representation coefﬁcients;(4) the group-sparsity SR model [29] in which the nonzerorepresentation coefﬁcients are forced to occur in clusters(called group-sparsity) rather than appear randomly; (5) thejoint-sparse representation (JSR) model [30] which indicatesthat different signals from various sensors of the same sceneform an ensemble. All signals in one ensemble have a com-mon sparse component, and each employs an individualsparse component.When fusing the source image patches, the l - or l -normof the representation coefﬁcients [18] is generally used. Itcould possibly beneﬁt from other information to calculatethe activity level [9], which measures the information con-tained in these representation coefﬁcients that is deemeduseful during the fusion. Statistical characteristics, such asthe sparseness level [27] of their representation coefﬁcients,might also be employed to determine the activity levelduring the fusion. The energy of the sparse reconstruc- Fig. 2: Diagram of the SR-based image fusion method. (Credit to [2])tion errors [28] has been used to determine the activitylevel when fusing multi-focus images. With an activitycalculation deﬁned, a maximum-selecting or a weighted-averaging fusion rule can be employed to directly combinesource image patches or indirectly combine representationcoefﬁcients of the source image patches [9]. If the repre-sentation coefﬁcients are to be combined, the fused imageis reconstructed using the pre-learned dictionary and thecombined representation coefﬁcients (called the transform-domain fusion method) [27], [29], [30], [31]. Otherwise, thefused image can be directly obtained from the source imagepatches according to their activity level (called the spatial-domain fusion method) [23], [28]. The preferred approachdepends on the speciﬁc intended applications (e.g., fusionof multi-focus images or multi-modality images).Based on the above analysis, in this paper we will reviewsparse representation (SR) image fusion methods from thefollowing four key aspects: (1) sparse representation models;(2) dictionary learning methods; (3) activity levels and fu-sion rules; and ﬁnally, (4) applications to multi-focus imagesand multi-modality (e.g., infrared and visible) image fusion.

As pointed out previously, multi-sensor image fusion hasalways been a hot research topic in the area of image pro-cessing, and a considerable number of publications emergeevery year. The early reviews [2], [5], [9], [32], [33] thatfocus mainly on traditional MST-based [2], [9] or spatial-domain-based (e.g., patches) fusion methods are outdatedas they missed out on important recent advances, such asSR-based image fusion methods. In addition, most of themare only limited to one single application of image fusion,such as multi-focus [9], medical [5] or remote sensing imagefusion [32], [33]. On the other hand, in this paper, we willthoroughly discuss the SR-based fusion methods as well astheir applications to fusion of both multi-focus and multi-modality images. Recently, some review papers have alsoappeared on sparse representation theory [34], [35] withthe aim to explain the mathematical and theoretical aspectsof SR models, but they do not particularly discuss imagefusion problems. To the best of our knowledge, there areno previous papers where SR-based fusion methods arereviewed and evaluated. Therefore, it is desirable to puta thorough survey concerning SR-based image fusion in place, which may be useful to a variety of audience, rangingfrom image fusion learners intended to quickly grasp thecurrent progress in this research area as a whole, to imagefusion practitioners interested in applying SR methods totheir own problems.

The rest of this paper is organized as follows. The availableSR models are thoroughly reviewed in Section II. In SectionIII, dictionary learning methods are surveyed. In Section IV,the activity level calculations and fusion rules exploited inthe literature with different applications are discussed. InSection V, the impact of the choice of the components pre-sented in Sections II, III and IV on the fusion performance isexamined. Finally, conclusions and suggestions for futurework are provided in Section VI. Fig. 3 summarizes thestructure of this paper.

We assume that the reader has some basic knowledge oflinear algebra and optimization theories. Throughout thepaper, a vector is denoted by a low-case letter. A matrixis denoted by a capital letter. All the elements in a vector ora matrix are real-valued. Given a vector x and a matrix X ,some notations related to them used in this paper are listedin Table 1. Since the traditional SR model [15] was ﬁrst applied tomulti-sensor image fusion, many of its extensions have alsobeen applied to image fusion. For example, a non-negativesparse representation (NNSR) model was introduced forimage fusion in [27]. Unlike the traditional SR model thatjust imposes the sparsity constraint on the representationcoefﬁcients, the NNSR model imposes the joint sparsity andnon-negativity constraints on the representation coefﬁcients.From the image patch encoding point of view, the interpre-tation of NNSR model is more intuitive than the traditionalSR model.Assuming the imaging sensors observe the same scene,the source images captured by these sensors are expectedto possess common (or redundant) and complementary

Fig. 3: Organization of this paper.TABLE 1: List of vector and matrix related notations

Symbols Deﬁnition x ( i ) the i -th entry of the vector xX ( i, j ) the ( i, j ) -th entry of the matrix X (cid:107) x (cid:107) l -norm of the vector x , i.e., the number of nonzeroentries in the vector x (cid:107) x (cid:107) l -norm of the vector x , (cid:107) x (cid:107) = (cid:80) i | x ( i ) |(cid:107) x (cid:107) l -norm of the vector x , (cid:107) x (cid:107) = (cid:112)(cid:80) i x ( i ) (cid:107) X (cid:107) l -norm of the matrix X , i.e., the number of nonzeroentries in the matrix X (cid:107) X (cid:107) l -norm of the matrix X , (cid:107) X (cid:107) = (cid:80) i,j | X ( i, j ) |(cid:107) X (cid:107) F Frobenius -norm of the matrix X , (cid:107) X (cid:107) F = (cid:113)(cid:80) i,j X ( i, j ) (cid:107) X (cid:107) , l , -norm of the matrix X , (cid:107) X (cid:107) , = (cid:80) j (cid:112)(cid:80) i X ( i, j )( · ) T transpose of a vector or a matrix X † pseudo inverse of the matrix X (distinct) features. Such ideas map well into the joint sparserepresentation (JSR) model [30], in which all the each sensorimage from the same ensemble is automatically decom-posed into a common component that can be shared byall the images and an innovation component that describesindividual differebces. As a result, the JSR model attractsmore attention in image fusion, especially in multi-modalityimage fusion.In [28], a robust sparse representation (RSR) model wasintroduced to extract the detailed information in a set ofmulti-focus input images. The RSR model replaces the con-ventional least-squared reconstruction error with a so-calledsparse reconstruction error. By using RSR, any multi-focusimage can be decomposed into a fully-defocus image anda sparse but detailed image denoted by the sparse recon-struction error. Distinct from traditional SR-based fusionmethods, the reconstruction errors are employed insteadof the usual sparse representation coefﬁcients to guide thefusion process. Superiority over the latter SR-based methodsis veriﬁed in the experimental results. In this section, we will review some SR models thathave been applied in multi-sensor image fusion. We willstart by introducing some speciﬁc concepts related to sparserepresentation, so that the reader can understand the basicconcepts associated with this theory. Then we will extendthese concepts to some more complex representation mod-els. The sparse representation model relies on the assumptionthat many important signals can be represented or approxi-mately represented as a linear combination of a “few” atomsfrom a redundant dictionary [19], [23]. That is, given sucha redundant dictionary D ∈ R n × M ( n < M ) containing M prototype n -dimensional signals that are referred to asatoms formed by the columns of the matrix M , a signal y ∈ R n can be expressed as y = Dx or y ≈ Dx . The vector x ∈ R M contains the coefﬁcients that represent the signal y in terms of the dictionary D . As the dictionary is redundant,the vector x is not unique. Thus, the SR model was proposedas a method for determining the solution vector x with thefewest non-zero components [23]. Mathematically, this canbe achieved exactly assuming negligible noise or inexactlyconsidering noise by solving the optimization problem min x (cid:107) x (cid:107) s.t. y = Dx, (1)or min x (cid:107) x (cid:107) s.t. (cid:107) y − Dx (cid:107) ≤ ε. (2)The optimization of the above formulas is NP-hard andthus requires approximate techniques, such as the matchingpursuit (MP) [36], orthogonal matching pursuit (OMP) [37]or simultaneous OMP (SOMP) [38] algorithms to obtainsolutions with low complexity.Based on recent developments in SR and compressedsensing, the non-convex l -minimization problems in (1)and (2) can be relaxed to obtain the convex l -minimizationproblems [15], [39] in min x (cid:107) x (cid:107) s.t. y = Dx, (3) and min x (cid:107) x (cid:107) s.t. (cid:107) y − Dx (cid:107) ≤ ε (4)Solutions can be obtained by using linear programmingmethods [15], [40]. Considering that properly scaled black and while imagescan be interpreted as images with positive entries, theauthors of [27] introduced a non-negative sparse represen-tation (NNSR) model and applied it to the fusion of infraredand visible light images. Different from the traditional SRmodel which only emphasizes the sparsity constraint using l -norm or l -norm, NNSR jointly imposes the sparsity andnon-negativity constraints on the representation coefﬁcients.It can also be seen as an extension of the traditional non-negative matrix factorization [41] which adds a sparsityinducing penalty.Let Y = [ y , y , ..., y N ] be an observed non-negative datamatrix of size n × N representing a set of N source imagepatches, each column of which is a data vector (i.e., an imagepatch) y i ∈ R n . Then, given a dictionary D ∈ R n × M with M non-negative prototype atoms, the NNSR model coefﬁcientscan be obtained from min x i N (cid:88) i =1 (cid:18) (cid:107) y i − Dx i (cid:107) + λ (cid:107) x i (cid:107) (cid:19) s.t. D ≥ , x i ≥ , i = 1 , , ..., N , (5)where x i ∈ R M denotes the representation coefﬁcients ofthe data y i . Here, owing to the non-negativity, the l -norm ofthe vector x i is also calculated as the sum of the componentsin the vector x i . λ refers to the regularization parameter.When λ = 0 , NNSR is reduced to the non-negative matrixfactorization. This problem can be simply and efﬁcientlysolved by the non-negative sparse coding algorithm [42].Similar to the traditional SR model, NNSR can alsoencode the source images efﬁciently by using a few “ac-tive” components. In contrast, the non-negativity constraintmakes the representation purely additive (allowing no sub-tractions), thus enabling NNSR to achieve an easy or intu-itive interpretation of the encodings of the source images[27]. The term “Joint Sparsity”, that is, the common sparsity ofthe entire signal ensemble, was ﬁrst introduced in [43].Three joint sparsity models (JSMs) for different situationswere presented,

JSM-1 (sparse common component + in-novations),

JSM-2 (common sparse supports) and

JSM-3 (non-sparse common +sparse innovations). When differentimaging sensors observe the same scene, the source imagescaptured by the sensors are generally expected to possessboth “common (or correlated)” and “innovation (or com-plementary)” information. Accordingly, it is not surprisingthat JSM-1 has been shown to be more suitable for many

2. Here, a matrix D = [ d i,j ] is called non-negative if each of itselements d i,j is non-negative. For simplicity, a non-negative matrix D is denoted by D ≥ image fusion applications, especially for the fusion of multi-modality images [30], when compared with JSM-2 and JSM-3. In the JSM-1 (or JSM ) model, all signals share a commoncomponent while each individual signal contains an innova-tion component. Let Y k ∈ R n × L ( k = 1 , , ..., K ) denote the L signals of dimension n from the k -th sensor which can berepresented using [30] Y k = Y C + Y Uk = DX C + DX Uk , k = 1 , , ..., K, (6)where Y C = DX C denotes the common component for allsignals, and Y Uk = DX Uk denotes the innovation componentfor the k -th individual signal. D ∈ R n × M ( n < M ) is anover-complete dictionary. X C and X Uk ∈ R M × L are thesparse coefﬁcient matrices for the common and innovationcomponents, respectively.Let Y =  Y ... Y K  ∈ R nK × L , (7) D =  D D · · · D D · · · ... ... ... . . . ... D · · · D  ∈ R nK × ( K +1) M , (8) X =  X C X U ... X UK  ∈ R ( K +1) M × L , (9)where ∈ R n × M is a matrix of zeros. Under the assumedsparseness, the coefﬁcients of JSM model can be computedusing [30], [44], [45] min X (cid:107) X (cid:107) s.t. (cid:107) Y − DX (cid:107) F ≤ ε, (10)where ε ≥ is the error tolerance. Similar to solving(3) in the traditional SR model, the joint sparse coefﬁcientmatrix X of the JSM model in (10) can be obtained by usingthe previously discussed sparse approximation algorithms(e.g., the OMP algorithm [37]). Fig.4 illustrates the commonand complementary information obtained by using the JSRmodel , where Fig.4 (c) contains the common backgroundinformation acquired by the two sensors, while Fig.4 (d) and(e) contain the complementary information between the twosource images. Especially, the man behind the tree capturedby the infrared imaging sensor is clearly displayed in Fig.4(e).Considering that the subspace spanned by the innova-tion component might not be the same as the subspacespanned by the common component, Zhang et al., [30]presented a generalized version of the JSM model . In thegeneralized JSM model, the signals from one ensemble areassumed to depend on two dictionaries, i.e. the commondictionary D C ∈ R n × M and the innovation dictionary D U ∈ R n × M , instead of a single dictionary as in the JSM Fig. 4: Illustration of the common and innovation information obtained by using the JSR model. (a) and (b) test imagescaptured by two different sensors; (c) The common component between the two test images; (d) and (e) The innovationcomponents of the test images in (a) and (b), respectively.model. Accordingly, (6) and the dictionary matrix D in (8)are extended in the generalized JSM model [30], respectivelyto Y k = Y C + Y Uk = D C X C + D U X Uk , k = 1 , , ..., K, (11) D =  D C D U · · · D C D U · · · ... ... ... . . . ... D C · · · D U  ∈ R nK × ( K +1) M . (12)According to (10), the generalized JSM model can besolved by using the same methods as those for the tra-ditional SR and JSM models. In [30], the generalized JSMmodel is shown to be sometimes superior to the JSM modelin terms of the ability to extract detailed information fromthe resulting image representations but with little extracomputational complexity. Most of the existing SR models mentioned previously as-sume that the non-zero coefﬁcients appear randomly, and donot consider the intrinsic structure of the signals. For that,Li, et al. , introduced a group sparse representation (GSR)model [29], in which the cluster structure sparsity prioris incorporated and the non-zero elements are forced tooccur in clusters (called group-sparsity), rather than appearrandomly.Let G = { G , G , ..., G g } be a partition of the indexset { , , ..., M } , where g is the number of groups. Givena dictionary D = (cid:2) D G , D G , ..., D G g (cid:3) ∈ R n × M where D G i denotes the sub-dictionary with columns identical to D ingroup G i , any signal y ∈ R n can be represented as [29] y = Dx = (cid:2) D G , D G , ..., D G g (cid:3) (cid:104) x TG , x TG , ..., x TG g (cid:105) T , (13)where x = (cid:104) x TG , x TG , ..., x TG g (cid:105) T ∈ R M denotes the rep-resentation coefﬁcients, and x G i ( i = 1 , , ..., g ) are therepresentation coefﬁcients with respect of the sub-dictionary D G . In the GSR model, the sparse representation coefﬁ-cients are found from min x (cid:107) x (cid:107) , s.t. y = Dx or (cid:107) y − Dx (cid:107) ≤ ε, (14)where (cid:107) x (cid:107) , = g (cid:80) i =1 I ( (cid:107) x G i (cid:107) ) , and I ( · ) is an indicatorfunction, i.e., I ( (cid:107) x G i (cid:107) ) = (cid:26) , if (cid:107) x G i (cid:107) > , otherwise . (15)Similarly, the non-convex l , -minimization optimizationproblem in (14) can be relaxed by solving the followingconvex l , -minimization problem in (16) min x (cid:107) x (cid:107) , s.t. y = Dx or (cid:107) y − Dx (cid:107) ≤ ε, (16)where (cid:107) x (cid:107) , = g (cid:80) i =1 (cid:107) x G i (cid:107) . The GSR model can be effectivelysolved via the Group Orthogonal Matching Pursuit (GOMP)algorithm [46].Fig. 5 illustrates the representation coefﬁcients obtainedby using the SR model and the GSR model. In the GSRmodel, a dictionary containing 8 sub-dictionaries (i.e., g = 8 in (13)) is employed. As shown in Fig. 5(b) and (d), thecoefﬁcients obtained by using SR model are sparsely andrandomly distributed along the entire horizontal axis. In contrast, the coefﬁcients obtained by using the GSR modelare just sparsely located at a few segments along the hori-zontal axis as shown in Fig. 5(c) and (e). This demonstratesthat each local patch can be well reconstructed by using onlya few sub-dictionaries, instead of a few random dictionaryatoms, in the GSR model.Fig. 5: Illustration of GSR coefﬁcients. (a) Test image; (b) and(c) SR coefﬁcients and GSR coefﬁcients for the red rectanglepatch in (a), respectively; (d) and (e) SR coefﬁcients and GSRcoefﬁcients for the white rectangle patch in (a), respectively. As discussed previously, the traditional SR, NNSR, JSRand GSR models are seen to impose either an l -norm or l -norm minimization on the representation coefﬁcients toachieve a sparse representation of a signal, while imposingan l -norm minimization on the reconstruction errors (e.g.,the component (cid:107) y i − Dx i (cid:107) in (5)) . These approacheswork well for signals with small levels of Gaussian noise.However, if the signal contains non-Gaussian noise or iscorrupted by sparse but strong “outliers”, it may not bepossible to achieve a satisfactory result [15].In [28], Zhang and Levine presented a robust sparserepresentation (RSR) model by imposing sparse constraintson the reconstruction errors as well as on the representationcoefﬁcients. More speciﬁcally, let Y = [ y , y , ..., y N ] be anobserved data matrix of size n × N , each column of whichis a data vector y i ∈ R n . Further, suppose the observed data

5. In fact, the problems in (3) and (4) are equivalent to the followingproblem: min x (cid:107) y − Dx (cid:107) + λ (cid:107) x (cid:107) . Thus, the traditional SR modelalso imposes an l -norm minimization on the reconstruction errors. Y is partially corrupted by errors or noise E ∈ R n × N . Then,given a dictionary D ∈ R n × M with M prototype atoms, thecoefﬁcients of the RSR model are assumed to follow [28] min X,E (cid:107) X (cid:107) + λ (cid:107) E (cid:107) , s.t. Y = DX + E, (17)where the matrix X ∈ R M × N denotes the sought aftermatrix of coefﬁcients, and each of its columns x i ∈ R M denotes the sparse coefﬁcient vector for the data y i . λ > is a parameter and is used to balance the effects of thetwo components in (17). The optimization problem in (17)is convex and can be solved by various methods. In [28],the authors used the linearized alternating direction methodwith adaptive penalty (LADMAP) [47], [48] to solve thisproblem because of its high efﬁciency.Here, we perform an experiment to demonstrate therobustness of the RSR model to non-Gaussian noise orsparse “outliers”. Similar to [15], we select half of the imagesin the Extended Yale B database for training and the restfor testing. In the experiment, the pixel intensities of theoriginal images are used as features and stacked as columnsof the dictionary matrix D and the data matrix Y . Thenthe representation coefﬁcient matrix X and reconstructionmatrix E are obtained by solving (17).As shown in Fig. 6, the images reconstructed by theRSR model are superior to those reconstructed by thetraditional SR model. For example, there are some ghostsnear the eye regions labeled by a green rectangle in Fig.6(b1) reconstructed using the traditional SR model. Thisphenomenon looks more severe in Fig. 6(b2). In contrast,these ghosts are greatly reduced in the images reconstructedby the RSR model, as shown in Fig. 6(c1) and 6(c2). This alsodemonstrates that the RSR model is more robust to non-Gaussian noise or sparse “outliers” than the traditional SRmodel.In order to effectively extract and utilize multiple fea-tures for each local image patch during the fusion process,Zhang and Levine generalized the RSR model to multi-tasksparsity pursuit and presented a multi-task RSR (MRSR)model [28]. In MRSR, the multi-task sparsity pursuit isachieved by enforcing a joint sparsity constraint on thereconstruction errors across all the tasks.Let Y k = [ y k, , y k, , ..., y k,N ] ∈ R n k × N ( k = 1 , , ..., K )consist of K feature matrices for K different types of features.The vector y k,i ∈ R n k denotes the k -th type of feature ofdimension n k for the i -th image patch. Correspondingly, thecolumns y k,i ∈ R n k ( k = 1 , , ..., K ) in these matrices withthe same index i and different k denote different types offeatures for the same i -th image patch. N denotes the totalnumber of patches in the image to be considered. Then theMRSR coefﬁcients are assumed to satisfy [28]: min X k ,E k K (cid:88) k =1 (cid:107) X k (cid:107) + λ (cid:107) E (cid:107) , s.t. Y k = D k X k + E k , k = 1 , , ..., K, (18)where D k ∈ R n k × M k is a dictionary with M k prototypeatoms for the k -th type of feature. X k ∈ R M k × N and E k ∈ R n k × N denote the SR coefﬁcients and the reconstruc-tion errors for the k -th feature matrix Y k , respectively. Thejoint error matrix E is formed by concatenating the verticalcolumns of matrices E , E ,..., E K . Fig. 6: Reconstructed results for images with occlusions. (a1) and (a2) are occluded test images of the ﬁrst subject in theExtended Yale B database with 23% and 61% occlusion, respectively; (b1) and (b2) are reconstructed images using thedictionary atoms from the ﬁrst subject and their corresponding SR coefﬁcients for (a1) and (a2), respectively; (c1) and (c2)are reconstructed images using the dictionary atoms from the ﬁrst subject and their corresponding RSR coefﬁcients for (a1)and (a2), respectively; (d1) and (d2) indicate the RSR reconstruction errors for (a1) and (a2), respectively.As discussed in [28], [49], the corresponding columnsin the matrices E , E ,..., E K with the same index will becompelled to have similar magnitudes by imposing the l , -norm minimization on the matrix E . As for the RSR model,the optimization problem of MRSR can also be solved usingLADMAP [47], [48]. A close look at the aforementioned algorithms reveals thatthe essential difference among the SR models discussedabove is where they apply the constraints, either on therepresentation coefﬁcients, the reconstruction errors or onboth. It can also be noticed that the traditional SR, NNSR,JSR and GSR models impose different constraints on therepresentation coefﬁcients but the same least squared min-imization constraint on the reconstruction errors. These SRmodels can thus be called least-squared-error-based models.Differently, the RSR model replaces the conventional least-squares reconstruction error with a so-called sparse recon-struction error. Therefore, the RSR and MRSR models can becalled sparse-error-based models.In contrast to those least-squared-error-based SR meth-ods, using the sparse-error signiﬁcantly improves the ro-bustness of the RSR model against the non-Gaussian noiseor sparse but strong corruptions, thereby facilitating prac-tical applications. More importantly, many important fea-tures, including the detailed information contained in animage, can be denoted by the sparse error components obtained using the RSR model. Table. 2 summarizes thepreviously mentioned sparse representation models.Basically, the NNSR, JSR, GSR, RSR, and MRSR modelssomewhat improve the traditional SR model in variousaspects, and they generally perform better than the SRmodel when applied to multi-sensor fusion applications.However, it is difﬁcult to explain the suitability of a modelfor a speciﬁc application from the general point of view.Instead, we draw the conclusion based on the experimentalresults, which reveal that the RSR model seems to be moresuitable for multi-focus image fusion; the NNSR and JSRare more suitable for multi-modality image fusion; and theGSR model can facilitate both as it achieves generally goodresults for these two applications. It is necessary to pointout that the performance may be further improved if thedictionary of a model complies with the characteristics ofthe data. That is to say, it does not make sense to expect auniversal dictionary that can enhance the performance of allthe models. As a result, designing an appropriate dictionaryfor each model deserves further investigation.

Constructing a good dictionary is of fundamental impor-tance for the performance of an SR-based image fusionmethod. Generally, there are two categories of methods toconstruct an over-complete dictionary. The ﬁrst one uses

TABLE 2: Summary of the sparse representation models employed in multi-sensor image fusion.

Models Representation coefﬁcients constrains Reconstruction error constrainsLeast-squared-error-based SR Sparisity constraint least squared minimizationcontraintNNSR Sparisity and non-negativity contraintJSR Sparisity common component and innovationcomponents contraintGSR Group-sparisity contraintSparse-error-based RSR Sparisity contraint Sparisity contraintMRSR Sparistiy contraint Joint sparisity constraint cross errormatrices of multiple tasks some ﬁxed basis [18], [50]. In [18] for instance, an over-complete separable version of the DCT dictionary is con-structed by sampling cosine waves with different frequen-cies. In [50], a hybrid dictionary consisting of a DCT basis,a wavelet ‘db1’ basis, a Gabor basis and a ridgelet basis isconstructed. Employing a ﬁxed basis has the advantages ofsimplicity and fast implementation. Since this approach isnot customized by using appropriate input image data, itmay provide inferior performance for certain types of dataand applications.The second category of methods is to construct an over-complete dictionary by using some learning methods, suchas PCA, MOD and K-SVD [24]. These methods can befurther divided into global-trained-dictionary-based [19], [22],[44], [50] and adaptively-trained-dictionary-based [23], [27],[28], [30], [45], according to their employed training images.In the former methods, a public training database that gen-erally contains many high-resolution images is employedto construct the training data for dictionary learning. Forexample, in [19], the training data consists of 100,000 × patches, randomly sampled from a database of 40 high-quality images. While in the latter methods, the input im-ages are directly used to construct the training data. Forexample, in [27], the training data for dictionary learningcontains 20,000 × patches, which are randomly sampledfrom the source infrared and visible images. In [23], localpatches from the input multi-focus images are used as thetraining samples to learn a dictionary. In [28], the inputimage patches are directly employed to construct an over-complete dictionary. These dictionaries are adaptive to theinput image data and thus have the potential to outperformthe commonly used ﬁxed dictionaries. Accordingly, theselearned dictionaries are more widely adopted in SR-basedimage fusion. In the rest of this section, we review somedictionary learning methods used in multi-sensor imagefusion . Let Y = [ y , y , ..., y N ] ∈ R n × N be a training data matrix,where y i ∈ R n is the i -th sampled data vector. Our goalis to learn a dictionary D = [ d , d , ..., d M ] ∈ R n × M anda sparse coefﬁcient matrix X = [ x , x , ..., x N ] ∈ R M × N ,such that the product of D and X can approximate theoriginal data matrix Y efﬁciently. If X were known, the over-

6. It should be noted that the methods to be discussed are adoptedfor the global-trained dictionaries as well as the adaptively-traineddictionaries. complete dictionary D could be obtained from the matrix Y via solving min D,X (cid:107) Y − DX (cid:107) F s.t. (cid:107) x i (cid:107) ≤ τ, i = 1 , , ..., N, (19)where τ denotes the upper bound for the number of thenon-zero entries in x i . The solution to (19) for both D and X can be obtained by using the popular dictionary learningalgorithm K-SVD [24], which iteratively alternates betweentwo steps: sparse coding (ﬁnd X ) and dictionary updating(ﬁnd D ).In the sparse coding step, D is assumed to be ﬁxed, andthe optimization problem of (19) is reduced to a search forsparse representations with coefﬁcients summarized in thematrix X . For that, the criterion is rewritten as (cid:107) Y − DX (cid:107) F = N (cid:88) i =1 (cid:107) y i − Dx i (cid:107) . (20)Therefore, the problem in (19) can be decoupled into N optimization problems of the form min x i (cid:107) y i − Dx i (cid:107) s.t. (cid:107) x i (cid:107) ≤ τ, i = 1 , , ..., N. (21)This problem can be efﬁciently solved by the MP [36] andOMP [37] algorithms mentioned in Section II.In the dictionary updating stage, the coefﬁcient matrix X and the dictionary D are both assumed to be ﬁxed. Onlyone column d k in the dictionary and the coefﬁcients thatcorrespond to it (i.e., the k -th row of X , denoted as x Tk ) areconsidered each time. For that, the multiplication DX in (19)is decomposed into the sum of K rank-1 matrices. During theupdating, K -1 terms are supposed to be ﬁxed and one, i.e.,the k -th, remains in question. More speciﬁcally, the metric in(19) is rewritten as [24] (cid:107) Y − DX (cid:107) F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Y − M (cid:88) j =1 d j x Tj (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Y − (cid:88) j (cid:54) = k d j x Tj  − d k x Tk (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) E k − d k x Tk (cid:13)(cid:13)(cid:13) F , (22)where E k stands for the error for all the N samples whenthe k -th atom is removed. Minimizing the function in (22)is equivalent to ﬁnding a rank-1 matrix that closely approx-imates the error term E k in Frobenius norm. The rank-1matrix is described by the atom d k and the row vector x Tk .These can be obtained simply by using singular value de-composition (SVD) on E k . Moreover, to ensure the sparsityof the vector x Tk , some modiﬁcations are further performedon (22). More details can be found in [24]. In [30], the authors present a dictionary learning method(termed as

MODJSR ) for the JSR model. Similar to the tradi-tional dictionary learning methods using K-SVD, MODJSRis also implemented by alternating the sparse coding stageand the dictionary updating stage. In the second stage,dictionary updating is performed as a problem by the“Landweber” update [51] with an initial point obtained bythe method of optimal directions (MOD). This method isshown to have higher computational efﬁciency than the K-SVD method.Suppose Y k ∈ R n × L ( k = 1 , ..., K ) are signals fromthe same ensemble, i.e., from different source images ofthe same scene. Motivated by dictionary learning for thestandard SR, the dictionary learning method, MODJSR, forthe JSR model is deﬁned as [30] min D,X (cid:107) Y − DX (cid:107) F s.t. (cid:107) x t (cid:107) ≤ τ, t = 1 , , ..., L. (23)Here, the data set matrix Y ∈ R nK × L , the dictionarymatrix D ∈ R nK × ( K +1) M and the coefﬁcient matrix X ∈ R ( K +1) M × L are constructed as in (7), (8), and (9), respec-tively. τ denotes the maximal number of non-zeros coefﬁ-cients used in each column of X .Adopting the block-coordinate descent idea, an alter-nating strategy is used to solve (23) with two stages. Theﬁrst stage employs a joint sparse coding. That is, ﬁxing thedictionary D , the joint sparse coefﬁcient matrix X can beobtained by solving (10) via OMP [37] to take advantage ofits simplicity and fast execution.The second stage updates the dictionary. Fixing the jointsparse coefﬁcient matrix X , the dictionary D in (23) couldbe updated simply by ˆ D = Y X T ( XX T ) − with MOD.However, XX T may not always be full rank. The majoriza-tion method could be also directly employed, but it is slowdue to using the “Landweber” update which is a gradientupdate. If the dictionary is updated by the “Landweber”update, the initial point can be obtained by MOD. Then D isfound by solving [30] min D f ( D ) = min D (cid:107) Y − DX (cid:107) F = min D K (cid:88) k =1 (cid:13)(cid:13)(cid:13) Y k − D (cid:16) X C + X Uk (cid:17)(cid:13)(cid:13)(cid:13) F . (24)The optimum of the objective function satisﬁes ddD f ( D ) . (25)Hence, W = DH, (26)where W = (cid:80) Kk =1 Y k (cid:0) X C + X Uk (cid:1) T and H = (cid:80) Kk =1 (cid:0) X C + X Uk (cid:1)(cid:0) X C + X Uk (cid:1) T . Since X is sparse, thenon-zero elements of H are often concentrated on thediagonal and H ii ≥ i = 1 , ..., M ) , rank ( H ) = M holdswith high probability [52] due to Diagonal Dominancetheory. When rank ( H ) = M , the dictionary D is simplyupdated by D = W H − . Otherwise, it is updated by the“Landweber” rule as [30], [51] D [ k +1] = D [ k ] + 1 σ (cid:16) W − D [ k ] H (cid:17) H T , (27) where σ is a constant satisfying σ > (cid:13)(cid:13) H T H (cid:13)(cid:13) F . A goodinitial point, obtained by MOD and given by D [0] = W H ♦ is employed while updating the dictionary updating forhigher computation efﬁciency. Here H ♦ is computed as H ♦ = U Σ † U T and the matrices U and Σ result from theSVD of the matrix H , i.e., H = U Σ U T . Since the connection of sparsity and clustering was shownto be desirable in image restoration tasks [31], [53], somenew dictionary learning frameworks combined with cluster-ing of non-local patches were recently presented [54], [55].Motivated by clustering-based dictionary learning tech-niques, the authors presented an efﬁcient dictionary learn-ing method based on a joint patch clustering for multi-modal image fusion in [31]. This is also the ﬁrst attempttowards applying clustering-based dictionary learning toimage fusion.Conventional dictionary learning methods based on K-SVD, such as the ones discussed in the previous subsections,generally produce redundant or highly structured dictionar-ies [31]. The proposed dictionary learning in [31] aims toremove the redundancy while maintaining or improving thequality of the multimodal image fusion. Under an assump-tion that common image structures are distributed acrossthe source images from different sensor modalities, patchesfrom different source images are clustered together accord-ing to local structural similarities. Then sub-dictionaries thatbest describe the underlying structure of each cluster byusing only a few principal components are constructed.Finally, these sub-dictionaries are combined to form a ﬁnaldictionary.Since each sub-dictionary consists only of a few principalcomponents of each joint patch cluster, the ﬁnal dictio-nary constructed ends up with much smaller size thanthose learned by K-SVD. Although it is more compact, theconstructed dictionary still contains the most informativecomponents from each joint patch cluster. As a result, thecomputational complexity of the subsequent fusion methodis greatly reduced while the fusion performance is main-tained.

In the traditional SR models introduced in Section II, ahighly redundant dictionary is always needed to satisfysignal reconstruction requirements since the structures varysigniﬁcantly across different image patches. However, thismay result in potential visual artifacts as well as high com-putational cost. To address this problem, the authors in [22]introduced an adaptive sparse representation (ASR) model,in which a set of more compact sub-dictionaries are learnedfrom numerous high-quality image patches. These patcheshave already been pre-classiﬁed into several correspondingcategories based on their gradient information.Let P = { p , p , ..., p N } be a training data matrix,where p i ∈ R n is the i -th sampled data or image patch.The patches in set P are ﬁrst classiﬁed into K categories Fig. 7: Learning sub-dictionaries in the ASR model. (a) Illus-tration of the dominant orientation division; (b)-(h) Learnedsub-dictionaries { D k | k = 0 , , ..., } , respectively. (Credit to[22]) { P k | k = 1 , , ..., K } according to their dominant gradi-ent directions. Then a total of K + 1 sub-dictionaries { D o , D , ..., D K } are obtained, in which D is learned fromall the patches in P having no clear dominant directions,whereas { D k | k = 1 , , ..., K } is learned from the patchesin each corresponding subset { P k | k = 1 , , ..., K } that havespeciﬁc dominant directions described by category k . In thismethod, the dominant gradient direction of each signal y i is ﬁrst computed, after which the sub-dictionary D k i isadaptively selected as the dictionary. An example of the ASRdictionary learning with K = 6 is shown in Fig. 7. In [56], sparse representation was applied to single imagesuper-resolution. The main idea of the method is to assumethat the upsampled low-resolution (LR) and high-resolution(HR) image patch pairs share the same sparse coefﬁcientswith respect to their own dictionaries. Recently, this ideawas applied to multi-sensor image fusion [25], [57], [58] aswell as pan-sharpening [26], [59].In order to construct a pair of coupled dictionaries,two training sets for the LR and HR dictionaries are ﬁrstconstructed from the same set of HR training images as shown in Fig. 8 and explained thereafter. Each high-resolution image I is blurred and down-sampled (with a

7. For pan-sharpening, the training sets may be constructed from theHR panchromatic source images. user-deﬁned factor) to generate a LR image. The latter isthen up-sampled back to the original size using Bicubicinterpolation and the resulting image is seen as a LRimage. A pair of training sets (cid:8) y Hi ∈ R n | i = 1 , , ..., N (cid:9) , (cid:8) y Li ∈ R n | i = 1 , , ..., N (cid:9) are thus created by extractingpatches from the original HR image I and its degradedLR version, respectively, in which y Hi and y Li with thesame index i correspond to the same spatial position in theHR and LR images. Based on the assumption that sparsecoefﬁcients of the LR image patch y Li over the LR dictionary D L ∈ R n × M are the same as those of the HR image patch y Hi over the HR dictionary D H ∈ R n × M , the coupleddictionaries D H and D L can be learned by solving thefollowing optimization problem [25] { D H , D L , X } =arg min D H ,D L ,X N (cid:88) i =1 (cid:13)(cid:13)(cid:13) y Hi − D H x i (cid:13)(cid:13)(cid:13) + N (cid:88) i =1 (cid:13)(cid:13)(cid:13) y Li − D L x i (cid:13)(cid:13)(cid:13) s.t. ∀ i (cid:107) x i (cid:107) ≤ τ , (28)where X = [ x , x , ...., x N ] ∈ R M × N is the matrixcontaining the sparse coefﬁcients, and τ controls thesparsity level. By introducing auxiliary variables Y H =[ y H , y H , ..., y HK ] ∈ R n × N , Y L = [ y L , y L , ..., y LN ] ∈ R n × N , Y = (cid:104)(cid:0) Y H (cid:1) T , (cid:0) Y L (cid:1) T (cid:105) T ∈ R n × N , and D = (cid:104) ( D H ) T , ( D L ) T (cid:105) T ∈ R n × M , problem (28) is equivalentlytransformed to (19) and can thus be efﬁciently solved byK-SVD. As discussed in this section, many dictionary learning meth-ods have been presented or applied to multi-sensor imagefusion. Among these methods, the K-SVD method, thanksto its simplicity and generalization, is the most broadlyadopted by the existing SR-based fusion methods. To someextent, the learning procedure of the ASR dictionary and thecoupled dictionary are also K-SVD like based on the sameprinciple. It is worthwhile pointing out that each dictionarylearning method has its pros and cons, meaning that thereis no universal dictionary that suits all applications.Using these methods, a globally-trained dictionary oran adaptively-trained dictionary can be generated duringthe fusion process. These learned dictionaries are adaptiveto the input image data and usually perform better thanthe ﬁxed dictionaries in terms of the extraction and rep-resentation of signiﬁcant features in an image. However,these learned dictionaries generally contain a large num-ber of atoms in order to accurately reconstruct an inputimage patch. This increases the redundancy among thedictionary atoms and thus degrades the subsequent fusionperformance to some extent. Moreover, this also increasesthe computational complexity of a fusion method. In Table.3, we compare some existing dictionary learning methodswith respect to the number of sub-dictionaries, redundancy,applicable model and consumed computation power. Nev-ertheless, how to learn a dictionary with a ﬁxed smallnumber of atoms and yet maintain a good representationcapability for different SR models and fusion applicationsis desirable and still a challenging problem in multi-sensorimage fusion. Fig. 8: Procedure to construct the training sets for the coupled dictionaries.TABLE 3: Comparison of different dictionary learning methods number of dictionaries redundancy applied model computation efﬁciencyK-SVD-DL 1 high SR, RSR, MRSR lowMOD-DL 1 high JSR highPCA-DL 1 (multiple sub-dictionaries) low SR, GSR, RSR, MRSR highASR-DL > (speciﬁc dominant directions) + 1 (common) low SR, RSR, MRSR mediumCoupled-DL 2 high SR, NNSR, RSR, MRSR low So far, SR-based image fusion methods have been used ina wide variety of applications, such as multi-focus imagefusion, and multi-modality (e.g., infrared and visible light)image fusion. These applications are targeting differentfusion goals, and thus have different fusion strategies. Inthis section, we will review some applications of SR-basedfusion methods for fusing multi-focus images or infraredwith visible images.

Due to the limited depth-of-focus of optical lenses in CCDdevices, it is often not possible to obtain an image thatcontains all of the relevant objects in focus. As shown in Fig.9 , this issue can be overcome by multi-focus image fusion,in which several images with different focus points (e.g.,Fig. 9(a) and Fig. 9(b)) are combined to form a composite

8. The test images in Fig. 9 and the soon Fig. 11 are downloaded fromhttp://home.ustc.edu.cn/ liuyu1 image (e.g., Fig. 9(c)) with full-focus. The basic requirementfor multi-focus image fusion is that only the focused regionsshould be extracted from the given multi-focus input imagesand then preserved in the fused image, while all of thedefocused regions should be discarded.As shown in Fig. 2 in Section 1, the SR-based multi-focus image fusion generally involves the following steps:(1) Divide the source images into a larger number of imagepatches of the same size (e.g., × ). In order to reduceblock artifacts and improve robustness to mis-registration,a sliding window at a step length of a ﬁxed number ofpixels (e.g., one pixel) is also often used in this step. Thatis to say, these patches overlap by a ﬁxed number of pixelsalong the horizontal and vertical directions, respectively. (2)Re-order each of these patches as a vector of n -dimensions(e.g., n = 8 × ). (3) Sparsely code these vectorsvia different SR models and pre-constructed dictionariesintroduced in Sections II and III. The traditional SR modelintroduced in Section II.A is the most widely used in multi-focus image fusion. The dictionaries directly learned from aset of training images with high-resolution using K-SVD are Fig. 9: Illustration of multi-focus image fusion. (a) Focus on the ﬂower; (b) Focus on the clock; (c) Fused image withfull-focus.also the most popular in these methods. (4) Deﬁne activitylevels and then construct the fused image with differentfusion rules.Activity level reﬂects the importance of each local imagepatch. Particularly, for multi-focus image fusion, the activitylevel should reﬂect the focus information of each imagepatch. In SR-based multi-focus image fusion methods, theactivity level is generally deﬁned as the l -norm, l -norm orthe l -norm of the sparse coefﬁcient vector for each imagepatch, i.e., A ( p k i ) = (cid:107) x k i (cid:107) j (29)where p k i denotes the i -th patch from the k -th source image, x k i denotes the representation coefﬁcient vector correspond-ing to the patch p k i , and j = 0 , 1, or 2 describes which normfunction is employed to deﬁne the activity level.Sometimes, relatively more sophisticated activity levelsare also deﬁned. For example, in [23], the correlation be-tween the sparse representation of the input images andthe pooled features obtained in the previous dictionarylearning phase is used as the decision map for the fusion.As opposed to most SR-based multi-focus image fusionmethods employing the sparse representation coefﬁcients todeﬁne activity level, the fusion method presented in [28]employs the sparse reconstruction error, more speciﬁcally,the l -norm of each column vector in the sparse error matrixobtained by the RSR model, to deﬁne the activity level foreach source image patch.There are two different ways to construct the fusedimage after the activity level of each image patch is deter-mined. Accordingly, different SR-based multi-focus imagefusion methods are divided into two categories, transform-domain-based and spatial-domain-based . In the transform-domain-based fusion methods [18], [22], [29], [50], [60], [61],[62], [63], [64], the representation coefﬁcients of fused imagepatches are ﬁrst obtained from the corresponding represen-tation coefﬁcients of source image patches according to theiractivity levels. Then the fused image patches are constructedby multiplying the pre-deﬁned dictionary with the obtainedrepresentation coefﬁcients. On the other hand, in the spatial-domain-based fusion methods [23], [28], the fused imagepatches are directly extracted from the source image patchesaccording to their activity levels.In general, both the maximum-selection and weighted-averaging fusion rules (or fusion strategies) might be em-ployed to determine the fused image patches or their rep- resentation coefﬁcients. However, in the SR-based multi-focus fusion methods, the maximum-selection fusion ruleis more popular. In this approach, the fused image patch orits sparse representation is generally selected from the inputimage patch or its sparse representation with the highestactivity level. Some state-of-art SR-based multi-focus imagefusion methods are summarized in Table 4. It is becoming more common to employ multiple typesof imaging sensors in video surveillance to improve therobustness, in which visible light and infrared imagingsensors are normally combined. Image fusion allows theinformation captured by these different sensors to be suf-ﬁciently and effectively integrated to create a compositeimage, containing more useful information than any of theindividual input images. This image can be used to betterinterpret the scene [3]. Multi-modality image fusion has alsobeen widely applied to many other ﬁelds such as medicalimaging.A video surveillance application is shown in Fig. 10 (a),where the moving person is evident in the image taken bythe infrared video camera. However, the scene environment(e.g., the hedges and the shrubs) is better displayed in thevisible-light image (Fig. 10(b)), in which the moving targetsare difﬁcult to see. By fusing the two input images, themoving target from the infrared camera and the backgroundscene (or the environment) from the visible light camera arewell integrated. As shown in Fig. 10(c), the fused imageclearly shows that there is a man in the scene. SR hasalso been applied to multi-modality image fusion, includinginfrared and visible light sensors [19], [27], [30], [31], [44],[45], [65]. Due to different imaging technologies of thesensors, these multi-modality images of the same scenecaptured by different image sensors provide redundantand complementary information. The basic job of a multi-modality image fusion approach is to properly employ theredundant and complementary information available fromthe different input images [66].Interestingly, this notion maps well into the JSR modeland this is reﬂected by the fact that, in addition to the tradi-tional SR model, the JSR model is popular in multi-modalityimage fusion [30], [44], [45], [67]. The reason for this is thatin the JSR model, all the signals from the same ensembleare automatically decomposed into a common component TABLE 4: Some state-of-the-art SR-based multi-focus image fusion methods.

Method Model Dictionary Fusion rule[18], [50],[60], [61] SR Learned from a set of images[18], [50], [61]Fixed DCT basis [18], [50], [60]Fixed hybrid basis [50]Fixed hybrid basis [50] Maximum l -norm selection of representationcoefﬁcient vectors [18]Maximum selection of absolute coefﬁcientvector entries [50]Maximum l -norm selection of representationcoefﬁcient vectors [60]Weighed averaging of representation coefﬁcientvectors [61]Transform-domain-based [29] Group SR Learned from a set of images Maximum l -norm selection of representationcoefﬁcient vectors[22] Adaptive SR Multiple dictionaries with differentdominant directions learned froma set of images Maximum l -norm selection of representationcoefﬁcient vectors[62], [63] JSR Learned from source images Summing of representation coefﬁcient vectors[64] Extended JSR Learned from a set of images Maximum l -norm selection of representationcoefﬁcient vectors[23] SR Learned from source images Maximum correlation between the sparserepresentations of input source images and thetraining pooled featuresSpatial-domain-based [28] RSR Data itself Maximum l -norm selection of sparereconstruction error vectors[28] Multi-taskRSR Data itself Maximum l -norm selection of joint sparsereconstruction error vectors Fig. 10: Illustration of infrared and visible image fusion. (a) Infrared image; (b) Visible light image; (c) Fused image.that is shared by all the signals and an innovation com-ponent that describes each individual signal. The commoncomponent describes the redundant information among allthe signals, while the innovation component describes thecomplementary information [45]. Accordingly, JSR alreadyextracts the required information needed for fusion. In thesubsequent fusion phase, the innovation components for theinput images are combined together by using a weighted-averaging [30], [45] or a summing [44], [67] fusion strat-egy. The ﬁnal fused image is obtained by integrating thecommon component shared by all the input images intothe previously combined innovation component. Finally,it should be noted that almost all SR-based multi-modalityimage fusion methods are transform-domain-based. Thismay result from the fact that patches from the multi-modality input images corresponding to the same spatialpositions have greatly diverse characters because of thedifferent sensor technologies. Subsequently, many spatialartifacts will be introduced during the fusion if a spatial- domain-based method is adopted which tends to producehigher activity levels. Alternatively, a transform-domain-based method may reduce the artifacts to some extent. Table5 summarizes some state-of-art SR-based multi-modalityimage fusion methods.

As discussed in the previous sections, SR models, learneddictionaries and activity levels are three important issues inSR-based fusion methods. In this section, we will discussthe impacts of these three components on the fusion perfor-mance in the context of the previous two applications. Forthis purpose, we employ two sets of test images, as shownin Fig. 11 and Fig. 12. The two sets of test images contain 10pairs of multi-focus images and 10 pairs of infrared and visi-ble images, respectively. We employ the mutual information( MI ) [71], the gradient preservation quality metric Q G [72],the structure similarity (SSIM) fusion quality metric Q S , and TABLE 5: Some state-of-the-art SR-based multi-modality image fusion methods.

Methods Model Dictionary Fusion rule[19], [31], [65],[68], [69], [70] SR Learned from a set of images[19], [69], [70]Learned from source images[31], [65], [68] Maximum l -norm selection ofrepresentation coefﬁcient vectors [19], [70]Maximum l -norm selection ofrepresentation coefﬁcient vectors [69]Maximum selection of (absolute)coefﬁcient vector entries [65], [68]Summing of representation coefﬁcient vectors [31][29] Group SR Learned from a set of images Maximum l -norm selection ofrepresentation coefﬁcient vectors[22] Adaptive SR Multiple dictionaries with differentdominant directions learned froma set of images Maximum l -norm selection ofrepresentation coefﬁcient vectors[27] NNSR Learned from source images Maximum l -norm & sparsenessselection of representation coefﬁcient vectors[30], [44],[45], [67] JSR Learned from a set of images [44], [67]Learned from source images [30], [45] Summing of representation coefﬁcient vectors [44], [67]Weighted averaging of representation coefﬁcientvectors [30], [45] Fig. 11: 10 pairs of multi-focus test images. The top row contains 10 input images with the focus on the left part, and thebottom row contains the corresponding input images with the focus on the right part.Fig. 12: 10 pairs of multi-modality test images. The top row contains 10 visible input images, and the bottom row containsthe corresponding infrared input images.two phase-congruency fusion quality metrics Q ZP [73] and Q P C [10] to evaluate different fusion methods.In these experiments, the SR-based fusion methods areapplied on a patch by patch basis. That is, the source imagesare ﬁrst divided into many patches of the same size andthen these patches are fused. The size of the patches is setto × as referring to the experimental results in [18].Accordingly, the size of the dictionary atoms is also set to × . In addition, in order to improve the robustness tomis-registration and reduce the spatial artifacts, a slidingwindow technology is employed, i.e., the patches overlapby one pixel. Next, the impact of different sparse representation models(listed in Table 6 ) on the fusion performance will be dis-cussed. Table 7 provides the scores of the different fusion

9. The dictionaries used in the models mentioned in Table 6 arelearned from a database containing 24 high-resolution training imagesthat are downloaded from http://r0k.us/graphics/kodak/. methods on the two sets of test images, it indicates that thesparse representation model has a great effect on the fusionperformance. As shown in Table 7, fusion performancevaries signiﬁcantly with the employed sparse representationmodel in an image fusion method. It also shows that theGSR performs the best among the six models consideredhere. In terms of most quality metrics, it achieves the highestscores for the fusion of multi-focus images as well as forthe fusion of infrared and visible images. This may bedue to the cluster structure sparsity prior employed in theGSR model. In addition to GSR, RSR and NNSR couldalso achieve satisfactory results when applied to multi-focusimage fusion and multi-modality image fusion, respectively.However, for multi-modality image fusion, JSR could notachieve a satisfactory result as it did in [30]. This might bedue to the employed dictionary KSVD-512 that was learnedfor SR rather than for JSR. TABLE 6: Different fusion methods with various SR models and their key parameters.

Model Dictionary Fusion ruleSR [15], [18], [19] Learned from a set of images using K-SVD method [19]ASR [22] Multiple dictionaries [22] with different dominant directionslearned from a set of imagesGSR [29] Learned from a set of images using the patch-clustering-basedmethod [31] Maximum l -norm selection ofrepresentation coefﬁcient vectorsNNSR [27] Learned from a set of images using the method in [42]JSR [30], [43], [45] Learned from a set of images using K-SVD method [19]RSR [28] Learned from a set of images using K-SVD method [19] Maximum l -norm of sparsereconstruction errors TABLE 7: Performance of different SR models on the twosets of test images. Scores for all image pairs in each datasetare averaged.

Testimages Models MI Q G Q S Q ZP Q PC Multi-focusimages SR 4.1267 0.7584 0.5008 0.9533 0.6846ASR 4.0889 0.7548 0.4976 0.9444 0.6773GSR 4.6534

NNSR 4.0504 0.7565 0.4994 0.9574 0.6615JSR 4.6081 0.7666 0.5108 0.9565 0.6934RSR

JSR 2.4258 0.6178 0.4205 0.7815 0.3992RSR 2.7403 0.6335

In this part, we will study the effect of the employeddictionary on the fusion performance. In all the experimentsconducted, we employ the traditional SR model, and themaximum l1-norm as the fusion rule during the fusion pro-cess. Moreover, we test two kinds of over-complete dictio-naries on the two sets of test images. The ﬁrst is a 2-D over-complete DCT dictionary of size 512 (DCT-512, for short)[18]. The second includes four global trained dictionaries ofsize 128, 256, 512, and 1024. The four dictionaries (KSVD-128, KSVD-256, KSVD-512, and KSVD-1024, for short) areall learned from image samples using the iterative K-SVDalgorithm [24]. The training data consists of 50,000 × patches, randomly taken from the database mentioned inthe previous Section V.A. We also test three sets of adap-tively trained dictionaries (denoted by D vi -512,D ir -512, andD joint -512) on the infrared-visible test image set (i.e., thesecond set of test images). Each dictionary in the D vi -512 setconsists of 512 atoms and is learned from the correspondingvisible input image in the second set of test images by usingthe iterative K-SVD algorithm. Similarly, each dictionary inthe D ir -512 set is learned from the corresponding infraredinput image, and each dictionary in the D joint -512 set islearned from the corresponding visible and infrared testimages.Table 8 provides the fusion scores of different dictio-naries on the two sets of test images. According to Table8: (1) As expected, the global learned dictionaries usuallyperform better than the ﬁxed DCT dictionary. (2) The adap-tively trained dictionaries in D vi -512 and D joint -512 sets,especially the ones dictionaries in the former set, perform competitively with the global dictionary having the samenumber of atoms when applied to multi-modality imagefusion. However, the dictionaries in the D ir -512 set that areadaptively learned from the infrared input images do notperform better than the global learned dictionary and theones in D vi -512 and D joint -512 sets. This may be due tothe fact that fewer patches in the infrared images containsigniﬁcant structures. As a result, the dictionaries in theD ir -512 set have weak representation power and reducethe fusion performance. In contrast, the visible input im-ages contain many more patches with signiﬁcant structures.Correspondingly, the dictionaries in the D vi -512 set seem toachieve better fusion performance. (3) It can also be arguedthat the number of dictionary atoms have a great impact onthe fusion performance. As shown in Table 8, the dictionaryKSVD-512 obtains the highest fusion performance amongthe four global dictionaries studied when applied to multi-focus image fusion as well as multi-modality image fusion.For the dictionary KSVD-128, the number of dictionaryatoms seems too small, and some image patches (e.g., thosewith signiﬁcant details) are not well represented. Therefore,the fusion performance is not comparable to the one ob-tained by using the dictionaries KSVD-256 and KSVD-512.However, if the number of dictionary atoms is too large, theatoms become too redundant. This will degrade the fusionperformance. KSVD-1024 is one such example. In addition,this will also increase the computational complexity of afusion method.TABLE 8: Performance of different dictionaries on the twosets of test images. Scores for all image pairs in each datasetare averaged. Testimages Dictionary MI Q G Q S Q ZP Q PC Multi-focusimages DCT-512 3.9947 0.7350 0.4732 0.8941 0.6443KSVD-128 3.8924 0.7439 0.4737 0.9012 0.6620KSVD-256 4.0344 0.7575 0.5003 0.9523 0.6826KSVD-512

KSVD-1024 4.0588 0.7532 0.4919 0.9321 0.6753Visible-infraredimages DCT-512 2.3280 0.5892 0.3939 0.8195 0.4082KSVD-128 2.0021 0.5941 0.4009 0.7680 0.3858KSVD-256 2.2390 0.6179 0.4210 0.8287 0.4175KSVD-512 vi -512 2.3111 0.6121 0.4196 D ir -512 2.1703 0.6051 0.4126 0.8106 0.4059D joint -512 2.2774 0.6109 0.4184 0.8357 0.4216 Thereafter, we discuss the impact of three activity levelmeasures, l -norm, l -norm and l -norm of representationcoefﬁcients in (29), on the fusion performance. In this ex-periment, we employ the traditional SR model and themaximum-selecting fusion rule during the fusion process.The quantitative values obtained by the image fusion qual-ity measures considered in Table 9 indicate that the l -norm of representation coefﬁcients is a better choice amongthe three activity levels mentioned here. It achieves higherscores for the fusion of multi-focus images as well as for thefusion of multi-modality images, especially for the former.TABLE 9: Performance of different activity levels on the twosets of test images. Scores for all image pairs in each datasetare averaged. Testimages Activitylevel MI Q G Q S Q ZP Q PC Multi-focusimages l -norm l -norm 4.1267 l -norm 4.0761 0.7557 0.5025 0.9473 0.6755Visible-infraredimages l -norm l -norm 2.3239 l -norm 2.3390 0.6135 0.4217 0.8453 0.4242 ONCLUSION AND D ISCUSSION

SR-based image fusion methods have attracted much at-tention recently. Sparse representation models, dictionarylearning, and fusion rules are three key components ofin these techniques. In this paper, we have presented athorough survey on the issues related to SR-based fusionmethods. The following conclusions could be drawn accord-ingly.For representation models, the traditional SR model isthe most popular in image fusion. Extensions, such ASR,GSR, NNSR, JSR, and RSR models, have also been applied toimage fusion. Fusion performance varies with these modelsdepending on the application. For example, GSR generallyachieves better fusion performance when applied to multi-focus image fusion as well as infrared and visible imagefusion. RSR and NNSR might also be a good choice for thefusion of multi-focus images and multi-modality images,respectively.Regarding the dictionaries, the over-complete dictionar-ies with a ﬁxed basis (e.g., a DCT basis) and those learnedfrom a set of training images ( global trained dictionary ) orthe input images themselves ( adaptively trained dictionary )have been applied to image fusion. Generally, the learneddictionaries could achieve better fusion performance thanthose with a ﬁxed basis. The number of atoms in a dictionaryhas a strong impact on the fusion performance. A compactdictionary with good representation capability is greatlydesirable in image fusion for high fusion performance andcomputational efﬁciency. However, this is still a challengingproblem in that area.For fusion strategies, the l -norm, l -norm and l -normof the representation coefﬁcients or reconstruction errorsare usually employed as the activity level. The maximum-selecting fusion rule is employed in most of the existing SR-based image fusion methods. Designing more sophisticated activity levels and fusion rules for SR-based image fusionpresents an interesting research topic for the future.Moreover, most of the current SR-based fusion methodsare performed in a patch-based way. In order to improvethe robustness to mis-registration while reducing the spatialartifacts, a sliding window technology is often employed.This results in the loss of information in the fused imageand in the huge increase of computational complexity. Agood alternative fusion strategy might consist of integratingsome local consistency prior into these SR models duringthe sparse coding phase for each image patch.Finally, while we mainly reviewed in this paper SR-based fusion methods that have been applied to multi-focusand multi-modality image fusion, it is also worth notingthat the SR theory has also been exploited in some otherapplications in image fusion, such as remote image fusion(also called pan-sharpening) [26], [59], [74], [75] and multi-exposure image fusion [76]. SR-based pan-sharpening is ahot topic in this ﬁeld. A CKNOWLEDGMENTS

This work is supported by the National Natural ScienceFoundation of China under Grant No.61104212, by Nat-ural Science Basic Research Plan in Shaanxi Province ofChina (Program No. 2016JM6008), by the Fundamental Re-search Funds for the Central Universities under Grant No.NSIY211416, and by the Natural Science Foundation underGrant No. ECCS-1405579. R EFERENCES [1] Y. Liu, S. Liu, and Z. Wang, “Multi-focus image fusion with denseSIFT,”

Information Fusion , vol. 23, pp. 139–155, 2015.[2] S. Li, X. Kang, L. Fang, J. Hu, and H. Yin, “Pixel-level image fusion:A survey of the state of the art,”

Information Fusion , vol. 33, pp.100–112, 2017.[3] Q. Zhang, Y. Wang, M. D. Levine, X. Yuan, and L. Wang, “Multi-sensor video fusion based on higher order singular value decom-position,”

Information Fusion , vol. 24, pp. 54–71, 2015.[4] S. Pertuz, D. Puig, M. A. Garcia, and A. Fusiello, “Generationof all-in-focus images by noise-robust selective fusion of lim-ited depth-of-ﬁeld images,”

IEEE Transactions on Image Processing ,vol. 22, no. 3, pp. 1242–1251, 2013.[5] A. P. James and B. Dasarathy, “Medical image fusion: A survey ofthe state of the art,”

Information Fusion , vol. 19, pp. 4–19, 2014.[6] S. Li, X. Kang, and J. Hu, “Image fusion with guided ﬁltering,”

IEEE Transactions on Image Processing , vol. 22, no. 7, pp. 2864–2875,2013.[7] S. Li, B. Yang, and J. Hu, “Performance comparison of differentmulti-resolution transforms for image fusion,”

Information Fusion ,vol. 12, no. 2, pp. 74–84, 2011.[8] G. Pajares and J. M. D. L. Cruz, “A wavelet-based image fusiontutorial,”

Pattern Recognition , vol. 37, no. 9, pp. 1855–1872, 2004.[9] Z. Zhang and R. S. Blum, “A categorization of multiscale-decomposition-based image fusion schemes with a performancestudy for a digital camera application,”

Proceedings of the IEEE ,vol. 87, no. 8, pp. 1315–1326, 1999.[10] Q. Zhang, Z. Ma, and L. Wang, “Multimodality image fusion byusing both phase and magnitude information,”

Pattern RecognitionLetters , vol. 34, no. 2, pp. 185–193, 2013.[11] Y. Liu, J. Jin, Q. Wang, Y. Shen, and X. Dong, “Region levelbased multi-focus image fusion using quaternion wavelet andnormalized cut,”

Signal Processing , vol. 97, no. 7, pp. 9–30, 2014.[12] L. Guo, M. Dai, and M. Zhu, “Multifocus color image fusion basedon quaternion curvelet transform,”

Optics Express , vol. 20, no. 17,pp. 18 846–18 860, 2012. [13] L. Wang, B. Li, and L. Tian, “Multi-modal medical image fusionusing the inter-scale and intra-scale dependencies between imageshift-invariant shearlet coefﬁcients,” Information Fusion , vol. 19,no. 1, pp. 20–28, 2014.[14] K. P. Upla, M. V. Joshi, and P. P. Gajjar, “An edge preservingmultiresolution fusion: Use of contourlet transform and MRFprior,”

IEEE Transactions on Geoscience and Remote Sensing , vol. 53,no. 6, pp. 3210–3220, 2015.[15] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,”

IEEE Transactions on PatternAnalysis and Machine Intelligence , vol. 31, no. 2, pp. 210–227, 2009.[16] T. Guha and R. K. Ward, “Learning sparse representations forhuman action recognition,”

IEEE Transactions on Pattern Analysisand Machine Intelligence , vol. 34, no. 8, pp. 1576–1588, 2012.[17] X. Yuan, X. Liu, and S. Yan, “Visual classiﬁcation with multitaskjoint sparse representation,”

IEEE Transactions on Image Processing ,vol. 21, no. 10, pp. 4349–4360, 2012.[18] B. Yang and S. Li, “Multifocus image fusion and restoration withsparse representation,”

IEEE Transactions on Instrumentation andMeasurement , vol. 59, no. 4, pp. 884–892, 2010.[19] Y. Liu, S. Liu, and Z. Wang, “A general framework for imagefusion based on multi-scale transform and sparse representation,”

Information Fusion , vol. 24, pp. 147–164, 2015.[20] Q. Zhang and X. Maldague, “An adaptive fusion approach forinfrared and visible images based on NSCT and compressedsensing,”

Infrared Physics & Technology , vol. 74, pp. 11–20, 2016.[21] Q. Wei, J. Bioucas-Dias, N. Dobigeon, and J.-Y. Tourneret, “Hy-perspectral and multispectral image fusion based on a sparserepresentation,”

IEEE Transactions on Geoscience and Remote Sensing ,vol. 53, no. 7, pp. 3658–3668, 2015.[22] Y. Liu and Z. Wang, “Simultaneous image fusion and denoisingwith adaptive sparse representation,”

IET Image Processing , vol. 9,no. 5, pp. 347–357, 2015.[23] M. Nejati, S. Samavi, and S. Shirani, “Multi-focus image fusionusing dictionary-based sparse representation,”

Information Fusion ,vol. 25, pp. 72–84, 2015.[24] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,”

IEEE Transactions on Signal Processing , vol. 54, no. 11, pp. 4311–4322, 2006.[25] H. Yin, S. Li, and L. Fang, “Simultaneous image fusion and super-resolution using sparse representation,”

Information Fusion , vol. 14,no. 3, pp. 229–240, 2013.[26] M. Guo, H. Zhang, J. Li, L. Zhang, and H. Shen, “An onlinecoupled dictionary learning approach for remote sensing imagefusion,”

IEEE Journal of Selected Topics in Applied Earth Observationsand Remote Sensing , vol. 7, no. 4, pp. 1284–1294, 2014.[27] J. Wang, J. Peng, X. Feng, G. He, and J. Fan, “Fusion methodfor infrared and visible images by using non-negative sparserepresentation,”

Infrared Physics & Technology , vol. 67, pp. 477–489,2014.[28] Q. Zhang and M. D. Levine, “Robust multi-focus image fusionusing multi-task sparse representation and spatial context,”

IEEETransactions on Image Processing , vol. 25, no. 5, pp. 2045–2058, 2016.[29] S. Li, H. Yin, and L. Fang, “Group-sparse representation withdictionary learning for medical image denoising and fusion,”

IEEETransactions on Biomedical Engineering , vol. 59, no. 12, pp. 3450–3459, 2012.[30] Q. Zhang, Y. Fu, H. Li, and J. Zou, “Dictionary learning methodfor joint sparse representation-based image fusion,”

Optical Engi-neering , vol. 52, no. 5, pp. 057 006–1–057 006–11, 2013.[31] M. Kim, D. K. Han, and H. Ko, “Joint patch clustering-baseddictionary learning for multimodal image fusion,”

InformationFusion , vol. 27, pp. 198–214, 2016.[32] G. Vivone, L. Alparone, J. Chanussot, M. D. Mura, A. Garzelli,G. Licciardi, R. Restaino, and L. Wald, “A critical comparisonamong pansharpening algorithms,”

IEEE Transactions on Geoscienceand Remote Sensing , vol. 53, no. 5, pp. 2565–2586, 2015.[33] L. Loncan, L. B. de Almeida, J. M. Bioucas-Dias, X. Briottet,J. Chanussot, N. Dobigeon, S. Fabre, W. Liao, G. A. Licciardi,M. Simoes et al. , “Hyperspectral pansharpening: A review,”

IEEEGeoscience and Remote Sensing Magazine , vol. 3, no. 3, pp. 27–46,2015.[34] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, and S. Yan,“Sparse representation for computer vision and pattern recogni-tion,”

Proceedings of the IEEE , vol. 98, no. 6, pp. 1031–1044, 2010. [35] Z. Zhang, Y. Xu, J. Yang, X. Li, and D. Zhang, “A survey of sparserepresentation: Algorithms and applications,”

IEEE Access , vol. 3,pp. 490–530, 2015.[36] S. Mallat and Z. Zhang, “Matching pursuits with time-frequencydictionaries,”

IEEE Transactions on Signal Processing , vol. 41, no. 12,pp. 3397–3415, 1993.[37] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparsesolutions of systems of equations to sparse modeling of signalsand images,”

SIAM Review , vol. 51, no. 1, pp. 34–81, 2009.[38] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, “Algorithms for si-multaneous sparse approximation: part I: Greedy pursuit,”

SignalProcessing , vol. 86, no. 3, pp. 572–588, 2006.[39] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principalcomponent analysis?”

Journal of the ACM , vol. 58, no. 3, pp. 1–37,2011.[40] D. L. Donoho and Y. Tsaig, “Fast solution of (cid:96) -norm minimiza-tion problems when the solution may be sparse,” IEEE Transactionson Information Theory , vol. 54, no. 11, pp. 4789–4812, 2008.[41] D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,”

Nature , vol. 401, no. 6755, pp. 788–791, 1999.[42] W. Dong, F. Fu, G. Shi, X. Cao, J. Wu, G. Li, and X. Li, “Hyperspec-tral image super-resolution via non-negative structured sparserepresentation,”

IEEE Transactions on Image Processing , vol. 25,no. 5, pp. 2337–2352, 2016.[43] D. Baron, M. B. Wakin, M. F. Duarte, S. Sarvotham, and R. G.Baraniuk, “Distributed compressed sensing,”

Preprint , vol. 22,no. 10, pp. 2729 – 2732, 2012.[44] H. Yin and S. Li, “Multimodal image fusion with joint sparsitymodel,”

Optical Engineering , vol. 50, no. 6, pp. 067 007–1–067 007–10, 2011.[45] N. Yu, T. Qiu, F. Bi, and A. Wang, “Image features extractionand fusion based on joint sparse representation,”

IEEE Journal ofSelected Topics in Signal Processing , vol. 5, no. 5, pp. 1074–1082, 2011.[46] A. Majumdar and R. K. Ward, “Fast group sparse classiﬁcation,”

Canadian Journal of Electrical and Computer Engineering , vol. 34,no. 4, pp. 136–144, 2009.[47] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction methodwith adaptive penalty for low-rank representation,” in

Advances inNeural Information Processing Systems , 2011, pp. 612–620.[48] Y. Zhang, Z. Jiang, and L. S. Davis, “Learning structured low-rankrepresentations for image classiﬁcation,” in , 2013, pp. 676–683.[49] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multitasksparsity pursuit,”

IEEE Transactions on Image Processing , vol. 21,no. 3, pp. 1327–1338, 2012.[50] B. Yang and S. Li, “Pixel-level image fusion with simultaneousorthogonal matching pursuit,”

Information Fusion , vol. 13, no. 1,pp. 10–19, 2012.[51] L. Landweber, “An iteration formula for fredholm integral equa-tions of the ﬁrst kind,”

American Journal of Mathematics , vol. 73,no. 3, pp. 615–624, 1951.[52] R. A. Horn and C. R. Johnson,

Matrix Analysis . CambridgeUniversity Press, Cambridge, UK, 1985.[53] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local sparse models for image restoration,” in

Proceedings of the12th International Conference on Computer Vision , 2009, pp. 2272–2279.[54] P. Chatterjee and P. Milanfar, “Clustering-based denoising withlocally learned dictionaries,”

IEEE Transactions on Image Processing ,vol. 18, no. 7, pp. 1438–1451, 2009.[55] W. Dong, X. Li, L. Zhang, and G. Shi, “Sparsity-based imagedenoising via dictionary learning and structural clustering,” in

Proceedings of IEEE Conference on Computer Vision and Pattern Recog-nition , 2011, pp. 457–464.[56] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolutionvia sparse representation,”

IEEE Transactions on Image Processing ,vol. 19, no. 11, pp. 2861–2873, 2010.[57] M. T. Iqbal and J. Chen, “Uniﬁcation of image fusion and super-resolution using jointly trained dictionaries and local informationcontents,”

IET Image Processing , vol. 6, no. 9, pp. 1299–1310, 2012.[58] K. Ren and F. Xu, “Super-resolution images fusion via compressedsensing and low-rank matrix decomposition,”

Infrared Physics & Technology , vol. 68, pp. 61–68, 2015.[59] X. X. Zhu and R. Bamler, “A sparse image fusion algorithm withapplication to Pan-sharpening,”

IEEE Transactions on Geoscienceand Remote Sensing , vol. 51, no. 5, pp. 2827–2836, 2013. [60] R. Y. Ibrahim, J. Alirezaie, and P. Babyn, “Pixel level jointedsparse representation with RPCA image fusion algorithm,” in ,2015, pp. 592–592.[61] Y. Zhang and Y. Chen, “A new image-fusion technique basedon blocked sparse representation,” in Proceedings of InternationalConference on Computer Science and Information Technology , 2014, pp.53–60.[62] Y. Yao, X. Xin, and P. Guo, “OMP or BP ? A comparison study ofimage fusion based on joint sparse representation,” in

Proceedingsof the 19th International Conference on Neural Information Processing ,2012, pp. 75–82.[63] Y. Yao, P. Guo, X. Xin, and Z. Jiang, “Image fusion by hierarchicaljoint sparse representation,”

Cognitive Computation , vol. 6, no. 3,pp. 281–292, 2014.[64] B. Yang, J. Luo, L. Guo, and F. Cheng, “Simultaneous image fusionand demosaicing via compressive sensing,”

Information ProcessingLetters , vol. 116, no. 7, pp. 447–454, 2016.[65] X. Lu, B. Zhang, Y. Zhao, H. Liu, and H. Pei, “The infrared andvisible image fusion algorithm based on target separation andsparse representation,”

Infrared Physics & Technology , vol. 67, pp.397–407, 2014.[66] Q. Zhang, L. Wang, H. Li, and Z. Ma, “Similarity-based multi-modality image fusion with shiftable complex directional pyra-mid,”

Pattern Recognition Letters , vol. 32, no. 13, pp. 1544–1553,2011.[67] B. Yang and S. Li, “Visual attention guided image fusion withsparse representation,”

Optik , vol. 125, no. 17, pp. 4881–4888, 2014.[68] M. Ding, L. Wei, and B. Wang, “Research on fusion method forinfrared and visible images via compressive sensing,”

InfraredPhysics & Technology , vol. 57, pp. 56–67, 2013.[69] H. Yin, “Sparse representation with learned multiscale dictionaryfor image fusion,”

Neurocomputing , vol. 148, pp. 600–610, 2015.[70] Y. Liu, S. Liu, and Z. Wang, “Medical image fusion by combiningnonsubsampled contourlet transform and sparse representation,”

Chapter in Communications in Computer and Information Science , vol.484, pp. 372–381, 2014.[71] G. Qu, D. Zhang, and P. F. Yan, “Information measure for per-formance of image fusion,”

Electronics Letters , vol. 38, no. 7, pp.313–315, 2002.[72] C. S. Xydeas and V. Petrovic, “Objective image fusion performancemeasure,”

Electronics Letters , vol. 36, no. 4, pp. 308–309, 2000.[73] Z. Liu, D. S. Forsyth, and R. Laganiere, “A feature-based metric forthe quantitative evaluation of pixel-level image fusion,”

ComputerVision and Image Understanding , vol. 109, no. 1, pp. 56–68, 2008.[74] M. R. Vicinanza, R. Restaino, G. Vivone, M. D. Mura, andJ. Chanussot, “A pansharpening method based on the sparserepresentation of injected details,”

IEEE Geoscience and RemoteSensing Letters , vol. 12, no. 1, pp. 180–184, 2015.[75] C. Jiang, H. Zhang, H. Shen, and L. Zhang, “Two-step sparsecoding for the pan-sharpening of remote sensing images,”

IEEEJournal of Selected Topics in Applied Earth Observations and RemoteSensing , vol. 7, no. 5, pp. 1792–1805, 2014.[76] J. Wang, H. Liu, and N. He, “Exposure fusion based on sparserepresentation using approximate K-SVD,”