[PDF] Automatic Image Registration in Infrared-Visible Videos using Polygon Vertices

Abstract

In this paper, an automatic method is proposed to perform image registration in visible and infrared pair of video sequences for multiple targets. In multimodal image analysis like image fusion systems, color and IR sensors are placed close to each other and capture a same scene simultaneously, but the videos are not properly aligned by default because of different fields of view, image capturing information, working principle and other camera specifications. Because the scenes are usually not planar, alignment needs to be performed continuously by extracting relevant common information. In this paper, we approximate the shape of the targets by polygons and use affine transformation for aligning the two video sequences. After background subtraction, keypoints on the contour of the foreground blobs are detected using DCE (Discrete Curve Evolution)technique. These keypoints are then described by the local shape at each point of the obtained polygon. The keypoints are matched based on the convexity of polygon's vertices and Euclidean distance between them. Only good matches for each local shape polygon in a frame, are kept. To achieve a global affine transformation that maximises the overlapping of infrared and visible foreground pixels, the matched keypoints of each local shape polygon are stored temporally in a buffer for a few number of frames. The matrix is evaluated at each frame using the temporal buffer and the best matrix is selected, based on an overlapping ratio criterion. Our experimental results demonstrate that this method can provide highly accurate registered images and that we outperform a previous related method.

Full PDF

AAutomatic Image Registration in Infrared-VisibleVideos using Polygon Vertices

Tanushri Chakravorty, Guillaume-Alexandre Bilodeau

LITIV Lab´Ecole Polytechnique de Montr´ealMontr´eal ,QC, CanadaEmail: [email protected], [email protected]

Eric Granger

LIVIA´Ecole de technologie sup´erieureMontr´eal , QC, CanadaEmail: [email protected]

Abstract —In this paper, an automatic method is proposed toperform image registration in visible and infrared pair of videosequences for multiple targets. In multimodal image analysislike image fusion systems, color and IR sensors are placedclose to each other and capture a same scene simultaneously,but the videos are not properly aligned by default because ofdifferent ﬁelds of view, image capturing information, workingprinciple and other camera speciﬁcations. Because the scenes areusually not planar, alignment needs to be performed continuouslyby extracting relevant common information. In this paper, weapproximate the shape of the targets by polygons and useafﬁne transformation for aligning the two video sequences.After background subtraction, keypoints on the contour ofthe foreground blobs are detected using DCE (Discrete CurveEvolution)technique. These keypoints are then described by thelocal shape at each point of the obtained polygon. The keypointsare matched based on the convexity of polygon’s vertices andEuclidean distance between them. Only good matches for eachlocal shape polygon in a frame, are kept. To achieve a globalafﬁne transformation that maximises the overlapping of infraredand visible foreground pixels, the matched keypoints of eachlocal shape polygon are stored temporally in a buffer for a fewnumber of frames. The matrix is evaluated at each frame usingthe temporal buffer and the best matrix is selected, based onan overlapping ratio criterion. Our experimental results demon-strate that this method can provide highly accurate registeredimages and that we outperform a previous related method.

Index Terms —Image registration; feature matching; homogra-phy; multimodal analysis; Temporal information

I. I

NTRODUCTION

Nowadays there has been an increasing interest in infrared-visible stereo pairs in video surveillance because both sensorscomplement each other. This has led to the developmentof variety of applications ranging from medical imaging,computer vision, remote sensing, astrophotography etc. toextract more information about an object of interest in animage. Visible camera provides information about the visualcontext of the objects in the scene, but under poor lightconditions only limited information is captured. On the otherhand infrared provides enhanced contrast and rich informationabout the object when there is less light, especially in the darkenvironment. Example of such different capturing informationis shown in Figure 1. Therefore, to beneﬁt from both themodalities, it is required to extract information from both thecapturing sources for which image registration is a necessary step.Infrared-visible image registration is a very challengingproblem since the thermal and visible sensors capture differentinformation about a scene [1]. The infrared captures theheat signature emitted by objects, while the visible capturedthe light reﬂected by objects. Due to this difference, thecorrespondence between the visible and the infrared is hardto establish as local intensities or textures do not match, ascan be seen in the Figure 1. (a)(b)Fig. 1. Different image information captured of a same scene by (a) visiblecamera (b) and infrared camera, respectively. Note the absence of the stripedtexture of the shirt in infrared.

Therefore, to detect and match common features such asappearance, shape etc. in the image captured by both camerasbecome very difﬁcult. This problem further becomes more a r X i v : . [ c s . C V ] M a r hallenging with the increase in number of targets in the scene,as the complexity of the system increases. Hence, in orderto perform more accurate infrared-visible image registrationfor multiple targets such as people, we utilise a methodbased on keypoint features on target boundaries and temporalinformation between matched keypoint pairs to calculate thebest afﬁne transformation matrix.Hence in this method to establish feature point correspon-dence, boundary regions described by visible and infraredvideos are considered as noisy polygons and the aim is tocalculate the correspondence between the vertices of thesepolygons. This information can be further used for fusion ofinfrared and visible image to improve object detection andtracking, recognition etc. Since infrared is relatively invariantto changes in illumination, it has the capability for identifyingobject under all lighting conditions, even in total darkness.Hence, the worthy information provided by infrared imagesis a potential for the development of surveillance applications.In this paper, both the cameras are placed parallel and nearbyin a stereo pair conﬁguration, i.e. the cameras observe acommon viewpoint [1]. Note, that we do not assume that thescene is planar, but we do assume that all targets are aboutin the same plane. That is, the group of targets are movingtogether through different planes throughout the videos. Ourmethod can be generalized to many targets in different planes.This paper extends the work of [2]. The contributions ofthis paper are:1) We improve registration accuracy over the state ofthe art, since we consider each polygon obtained ina frame separately, which results in more precisekeypoint matches and less outliers. If we consider allthe polygons at a single time for ﬁnding matches asin [2], much ambiguity arises and the method is onlylimited to targets in a single plane at a time.2) By considering each polygon separately, our methodcan be generalized to a scene with targets appearingsimultaneously in many depth planes.3) We propose a more robust criterion to evaluate the qual-ity of a transformation and use the overlapping ratio ofthe intersection of foreground pixels of the infrared andvisible images over the union of these foreground pixels.This allows us to update the current scene transformationmatrix only if the new one improves on the accuracy.This paper is structured as follows. Section II discussesthe recent work done in the ﬁeld. The proposed algorithm ispresented in Section III. Section IV presents the registrationresults and evaluation accuracy of the algorithm with theground-truth. Finally Section V summarises the paper.II. P REVIOUS W ORK

Image registration technique has been mainly applied inthermography and multimodal analysis, in which a particular scene is captured using visible and infrared sensor fromdifferent viewpoints to extract more information about it.For extracting common information, often image regions areconsidered in both the infrared and the visible by using asimilarity measure like LSS (Local Self Similarity) [3] or MI(Mutual Information) [4]. LSS and MI are easy to computeover regions, but the procedure becomes slow when used asa feature for image registration, particularly when there aremany targets and registration has to be computed at everyframe. Therefore, features like boundaries [5],[6], edges orconnected edges, are one of the most popular approaches forextracting common information in this scenario.Features such as corners, are also used to perform matchingin image registration [7]. The corners are detected on both thevisible and infra red images and then similarity is measuredusing Hausdorff distance [8]. Furthermore, the features suchas line segments and virtual line intersections have also beenused for registration [9]. To ﬁnd correspondence betweenframes, recent methods like blob tracking [10] and trajectorycomputation [1] have also been used. But these methods arecomplex, since they need many trajectories to achieve a goodtransformation matrix and hence more computation. Also, theyonly apply to planar scenes.Recently, Sonn et al. [2] have presented a method basedon polygonal approximation DCE (Discrete Curve Evolution)[11] of foreground blobs for fast registration. The methodgives promising results, but it is limited in precision becauseit only considers the scene globally. We extend this methodby considering each target individually for better matchingprecision. This will in turn allow calculating a transformationfor each individual target. We also improved transformationmatrix selection. Therefore, in our proposed method, weextract features such as keypoints on each contour. The ad-vantage of using keypoint features is that they are simple andeasy to calculate. Also, to have more matches, the keypointsare stored in a temporal buffer that is continually renewedafter a ﬁxed number of frames, thus resulting in a bettertransformation for the video sequence. Our experiments showthat we obtain better results as compared to a recent methodused for registering infrared-visible video sequences.III. M

ETHODOLOGY

Our proposed method consists of four main steps as shownin Figure 2. The ﬁrst step of the algorithm performs back-ground subtraction according to the method explained in [12]to get the foreground blob regions (contours). The secondstep of the algorithm performs feature detection and extractionusing DCE (Discrete Curve Evolution) technique [11] whichoutputs signiﬁcant keypoints on the contours. For featuredescription, the signiﬁcant keypoints detected are described bythe local polygon shapes computed at each keypoint. The thirdstep of the algorithm performs feature matching by comparingthe feature descriptors obtained at the previous step usingsimilarity measures as described later in the Section III-D ofthe paper. The corresponding matches are saved in a temporalbuffer. This temporal buffer allows accumulating matches fromecent observations to reduce noise and improve registrationprecision. The fourth and ﬁnal step of the algorithm calculatesthe homography matrix based on the result of the matcheddescriptors stored in the temporal buffer in the previous stepand hence we obtain a transformation matrix at the end ofour algorithm. This process is applied at every frame sincethe target is assumed to move throughout different planes.The buffer is refreshed after a few frames so as to keepthe latest matched keypoints, which helps in estimating therecent homography which best transforms the scene. All thekeypoints in the temporal buffer are used to calculate thetransformation matrix.

Fig. 2. System Block Diagram

A. Background Subtraction

The objective of the background subtraction method isto identify the foreground objects in an image. Techniqueslike frame differencing [13] computes the difference betweentwo consecutive frames depending on a threshold. But thistechnique may not be useful for images having fast motionsince the method relies on a global threshold. Also, it wouldnot give complete object contours. This is why a properbackground subtraction technique needs to be used. Betterbackground subtraction methods will give better registrationaccuracy. In this work we have used a simple backgroundsubtraction technique based on temporal averaging [12]. Theadvantage of this method is that it is fast and the algorithmselectively segments foreground blobs from the backgroundframes. This method produces correct background image foreach input frame and is able to remove ghosts and stationaryobjects in the background image efﬁciently. Any other methodcould be used.

B. Feature Extraction

After the background subtraction step, which gives theforeground blobs, an adapted DCE (Discrete Curve Evolution)[14] is used to detect salient points on the infrared-visibleforeground blobs. These points represent points of visualsigniﬁcance along the contour and the branches connected tothese points represent the shape signiﬁcance. Branches thatend at a concave point are discarded and the output of thealgorithm converges to a convex contour. The method alsoﬁlters out noisy points along the contours. Hence, the methodapproximates each contour of a foreground blob to a polygon.The signiﬁcant number of vertices or points on each contourthat are determined by the DCE algorithm can be set by theuser, which is set as 16 in our case. Before the matchingprocess, the internal contours or holes are also removed fromeach contour.

C. Feature Description

Each signiﬁcant keypoint obtained by DCE is describedby the local shape of the polygon at that point [2]. Theproperties of the local polygon vertices like convexity andangle of the polygon are used to describe the keypoints. Hencea feature descriptor is composed of two feature vectors withtwo components viz., convexity and angle of the polygon. Forexample, consider three consecutive points P , P , P on acontour in clockwise direction. The convexity of a single pointis deﬁned by the normal vector to the remaining other pointsand can be deﬁned by equation: (cid:126)n = (cid:126)P × (cid:126)P , (1)where (cid:126)n is the normal vector, (cid:126)P is a vector from P to P and (cid:126)P is a vector from P to P . Here each keypointis supposed to have three dimensional coordinates given by ( x, y, . After the cross product, (cid:126)n will contain the value of z coordinate. This z value is evaluated for determining convexityof keypoints. If its value is less than zero, the polygon isconsidered convex, else it is concave. Only those contoursare kept for further processing which sufﬁce this criterion ofconvexity.For computation of angle θ for a keypoint, it is calculatedby the following equation: θ = cos −  (cid:12)(cid:12)(cid:12) (cid:126)P (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) (cid:126)P (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) (cid:126)P (cid:12)(cid:12)(cid:12) ∗ (cid:12)(cid:12)(cid:12) (cid:126)P (cid:12)(cid:12)(cid:12) ∗ (cid:12)(cid:12)(cid:12) (cid:126)P (cid:12)(cid:12)(cid:12)  (2)where θ is the angle formed between (cid:126)P and (cid:126)P . D. Feature Matching

To ﬁnd the correspondence between the keypoints, eachpolygon is analysed separately one by one, in both visible andinfrared foreground regions. This results in a larger numberof pair of matches as compared to the method in [2], whereall the polygons are analysed at a single time for the wholeimage. This also allows us to register each target individuallyif required. Therefore, in this work, we have to ﬁnd bothhe best polygon matches and the best keypoint matches.The features are matched by comparing feature descriptorsobtained in the previous step. The similarity conditionslike convexity, euclidean distance and difference of anglesbetween the feature descriptors determine the percentage ofmatching between the polygons. Therefore, the matchingcriteria is given by the following conditions [2]:1) Convexity, c : It is calculated using Eq. 1. Only thosepoints are kept which satisfy the criteria of having thevalue of z greater than zero.2) E d : It is the euclidean distance between two keypoints. E d = | P I − P V | (3)3) E θ : The difference between the two keypoint angles. E θ = | θ I − θ V | (4)The pair of keypoints from visible and infrared imageswhich fulﬁl the convexity criteria, c , Eq. 1 are kept. Thenthe euclidean distance, E d between the two keypoints, andthe difference between the two keypoint angles, E θ is calcu-lated using Eq. 3 and Eq. 4, respectively. The threshold forEuclidean distance is set as E dMax , and for maximum angleerror, as E θMax .Only those pair of keypoint is kept for which E d ≤ E dMax and E θ ≤ E θMax is true and the other pairs of keypoints arerejected. If there are keypoints in infrared for which thereis more than single match in the visible, the best match forthose keypoints is selected by a Score criteria as mentionedin [2]. The Score, S is calculated as: S = αE d E dMax + E θ E θMax (5)Additionally, contrarily to [2], we only keep matches thatare on the best matching pairs of polygons. The matchedkeypoints for each polygon in both visible and infrared imageare saved in a temporal buffer of matches, since it mightnot be possible to have a signiﬁcant number of matchedkeypoints, when a single frame is considered. Therefore,the temporal buffer stores the matched keypoints for a fewnumbers of frames and is renewed with new keypoints. Thetemporal buffer gets ﬁlled in a similar to ﬁrst-in-ﬁrst-out-fashion. This technique helps to attain a signiﬁcant numberof matched keypoints, which will result in a more meaningfuland accurate calculation of homography matrix. One or moretemporal buffer of matches can be used. To register each objectindividually, a temporal buffer should be attributed to eachobject. Tracking of objects may be required to distinguish thedifferent temporal buffers. In this work, to test our proposedimprovements we have used a single temporal buffer, andassume that all objects move together in a single plane. Wewill see later on, that even in this case, considering matcheson a polygon by polygon basis, improves accuracy becausematching ambiguities are greatly reduced. E. Transformation matrix calculation and matrix selection

The pairs of matched keypoints stored in the temporal bufferare used to determine an accurate transformation matrix forevery frame of the whole video scene. The temporal buffershould not have a too long temporal extent as the target willgradually move from on plane to another. Therefore, the savedmatches work in a FIFO (ﬁrst-in-ﬁrst-out) method. With thistechnique, only the last few frames are required to calculatethe transformation matrix. The matrix thus obtained is moreaccurate, since the last saved pair of matches in the temporalbuffer resemble more to the polygons that were present in thelast number of frames in the video sequence, which are aboutin the same plane.To calculate the homography and ﬁlter the outlier pointsin the buffer, the RANSAC (RANdom SAmple Consensus) isused [15]. The matrix is calculated for each frame and the bestmatrix is saved and applied to the infrared foreground frame,which becomes a transformed infrared foreground frame. Forselecting the best matrix, the blob overlap ratio is evaluatedat every frame. It is given by: BR = A I ∩ A V A I ∪ A V , (6)where, A I and A V are the foreground regions of the trans-formed infrared and visible blobs respectively. Only thatmatrix is selected for which the overlap ratio, BR , is closeto 1. This improved selection criterion contributes to animprovement in the precision.IV. E VALUATION OF IMAGE REGISTRATION ACCURACY

We have used

Alignment Error to measure the accuracyof our method. In this, the mean square error is evaluated atthe ground-truth points selected in visible and infrared videoframes. The alignment error gives a measure that how differentis the transformation model obtained by image registrationmethod from the ground-truth.To test our new method, we calculated a single trans-formation matrix at each frame as we are using a singletemporal buffer in the current implementation of our method.This choice has the beneﬁt that allows us to compare ourproposed method with the one of Sonn et al. [2], whichonly considers the scene globally at each frame. To allowcomparison with the state-of-the-art, we applied the methodof Sonn et al. [2] on our video sequences using their defaultparameters (which was provided to us by the authors). Ourmethod was tested on selected frames distributed throughoutthe video. The mean error was calculated for points selectedover regions of persons. The number of persons present in thescene varies from 1 to 5. For the selected test frames, theyare approximately together in a single plane (see Figure 4d).We have detailed the results for the various number of people,and thus, the various numbers of planes that are potentially ateach frame.For each sequence, two tests were done. One is for the 30frame buffer size and the other is for 100 frame buffer sizerespectively. Both methods were tested with the same buffer

ABLE IM

EAN R EGISTRATION ERROR FOR

PERSONS IN THE VISIBLE ANDINFRA - RED VIDEO SEQUENCE FOR AND

TEMPORAL BUFFER SIZERESPECTIVELY . M

EAN R EGISTRATION ERROR : E = (cid:112) E x + E y COMPARED TO THE GROUND - TRUTH . size. The parameters used for our method are E dMax = 65 and E θMax = 40 degrees respectively.The table IV shows the mean registration error for 1-5persons at the selected frames in the video and shows that ourmethod outperforms the previous work [2] for all the variousnumber of people, in the video sequence for both buffer size.Since we have considered each contour separately one by one,in the video sequences, we have more number of keypoints orfeatures, which helps in better matching. The remaining noisykeypoints are ﬁltered by the RANSAC algorithm.The fact that our new method outperforms the method ofSonn at al.[2] for even only one person is signiﬁcant. This canbe explained by the background subtraction that is not perfectand results in many polygons even in the case of one personas shown in Figure 3. Considering each polygon individuallyallows us to select better matches and remove contradictoryinformation between polygon matches. (a) (b)Fig. 3. (a) and (b) More than one polygon for a single person, afterBackground Subtraction step. Our method ﬁlters out the contradictory matchesbetween polygons, since we consider each polygon individually. This shows that matching polygon vertices globally is errorprone, as the local shape of the polygon vertices are notnecessarily unique. Thus, we should ensure that the matchingvertices are matched as a group from the same pair ofpolygons. Furthermore, because our transformation matrixcriterion is better, we also update more adequately the currenttransformation matrix. The result shows that the error variesbetween 0.5 and 2 pixels. Since the buffer size has a smallimpact on matching between keypoints, we can choose the buffer size depending on the application. Figure 4 showstransformation results. It can be noted that the chances ofregistration error increases in cases where the people are notexactly in the same depth plane (see Figure 4e and 4f). Forsuch cases. we the can improve the results by calculating morethan one transformation matrix for each person.V. S

UMMARY AND CONCLUSIONS

We have presented an alternative approach to other im-age registration methods, such region-based, frame-by-framekeypoints-based and trajectory-based registration methods thatworks for visible and infrared stereo pairs. The method usesa feature based on polygonal approximation and a temporalbuffer ﬁlled with matched keypoints. The results show that ourmethod outperforms [2], for every tested sequence. As we havemore considered each contour locally one by one in the videosequence, we obtain more features and hence more matches.To obtain the best transformation from these matches, we havea selection criterion that matches the overlap ratio of the twotransformed foreground infra-red and visible images and selectthe best ratio, which helps in improving the precision andaccuracy and thus describes the best transformation of a videoscene.In future work, we would manage the reservoir attributionfor each blob by incorporating the information from a tracker.This would result in even more precise results, in cases wherethe targets are at different depth planes.A

CKNOWLEDGMENT

This work was supported by the FRQ-NT Team researchproject grant No. 167442. a) (b)(c) (d)(e) (f)Fig. 4. Transformation results. Figures 4.(a) and (b) show the overlapping regions of the transformed IR polygons over the Visible polygons and Figures4.(c) and (d) show the overlapping regions of the transformed IR over the foreground visible regions. Figures 4.(e) and (f) show the results when the personsare not in the same image plane. R EFERENCES[1] G. Bilodeau, A. Torabi, and F. Morin, “Visible and infrared imageregistration using trajectories and composite foreground images,”

Imageand Vision Computing , vol. 29, no. 1, pp. 41 – 50, 2011.[2] B. G.-A. Sonn, “Fast and accurate registration of visible and infraredvideos,”

CVPR , 2013.[3] A. Torabi and G.-A. Bilodeau, “Local self-similarity-based registrationof human { ROIs } in pairs of stereo thermal-visible videos,” PatternRecognition , vol. 46, no. 2, pp. 578 – 589, 2013.[4] B. Zitov and J. Flusser, “Image registration methods: a survey,”

Imageand Vision Computing , vol. 21, pp. 977–1000, 2003.[5] H. Xishan and C. Zhe, “A wavelet-based multisensor image registrationalgorithm,” in

Signal Processing, 2002 6th International Conference on ,vol. 1, 2002, pp. 773–776 vol.1. [6] M. Elbakary and M. Sundareshan, “Multi-modal image registration usinglocal frequency representation and computer-aided design (cad) models,”

Image and Vision Computing , vol. 25, no. 5, pp. 663 – 670, 2007.[7] T. Hrka, Z. Kalafati, and J. Krapac, “Infrared-visual image registrationbased on corners and hausdorff distance,” in

Image Analysis , ser. LectureNotes in Computer Science, B. Ersbll and K. Pedersen, Eds. SpringerBerlin Heidelberg, 2007, vol. 4522, pp. 383–392.[8] D. P. Huttenlocher, G. A. Klanderman, G. A. Kl, and W. J. Rucklidge,“Comparing images using the hausdorff distance,”

IEEE Transactionson Pattern Analysis and Machine Intelligence , vol. 15, pp. 850–863,1993.[9] E. Coiras, J. Santamara, and C. Miravet, “Segment-based registrationtechnique for visual-infrared images,”

Optical Engineering , vol. 39, pp.282–289, 2000.[10] J. Zhao and S. Cheung, “Human segmentation by fusing visible-light andhermal imaginary,” in

Computer Vision Workshops (ICCV Workshops),2009 IEEE 12th International Conference on , 2009, pp. 1185–1192.[11] X. Bai, L. J. Latecki, and W. yu Liu, “Skeleton pruning by contourpartitioning with discrete curve evolution,”

IEEE TRANS. PATTERNANAL. MACH. INTELL , vol. 29, no. 3, pp. 449–462, 2007.[12] B. Shoushtarian and H. E. Bez, “A practical adaptive approach fordynamic background subtraction using an invariant colour model andobject tracking,”

Pattern Recognition Letters , vol. 26, no. 1, pp. 5 – 26,2005.[13] M. Piccardi, “Background subtraction techniques: a review,” in

Systems,Man and Cybernetics, 2004 IEEE International Conference on , vol. 4,Oct 2004, pp. 3099–3104 vol.4.[14] G. A. Bilodeau, P. St-Onge, and R. Garnier, “Silhouette-based featuresfor visible-infrared registration,” in

Computer Vision and Pattern Recog-nition Workshops (CVPRW), 2011 IEEE Computer Society Conferenceon , 2011, pp. 68–73.[15] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigmfor model ﬁtting with applications to image analysis and automatedcartography,”