[PDF] A Novel Registration & Colorization Technique for Thermal to Cross Domain Colorized Images

Abstract

Thermal images can be obtained as either grayscale images or pseudo colored images based on the thermal profile of the object being captured. We present a novel registration method that works on images captured via multiple thermal imagers irrespective of make and internal resolution as well as a colorization scheme that can be used to obtain a colorized thermal image which is similar to an optical image, while retaining the information of the thermal profile as a part of the output, thus providing information of both domains jointly. We call this a cross domain colorized image. We also outline a new public thermal-optical paired database that we are presenting as a part of this paper, containing unique data points obtained via multiple thermal imagers. Finally, we compare the results with prior literature, show how our results are different and discuss on some future work that can be explored further in this domain as well.

Full PDF

TThis work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Abstract

Thermal Imaging has come a long way in the past few years, and a range of thermal imagers are now available in the market, all of which work on the Infrared spectrum. Thermal images are different from optical images in the sense that while the former captures emitted radiation from an object, the latter works on the principle of reflected light. This means that using a thermal imager, it is possible to capture images in complete darkness, as long as there is a temperature difference between the object being captured and the surrounding. It needs to be noted, however, that when we are talking about infrared images, we specifically mean

Thermal Infrared (TIR) images. TIR images are different from

Near Infrared (NIR) in the sense that NIR images incorporate optical band information in an infrared image. The NIR image is quite similar to the optical domain images and seem to possess sharper edges as compared to a TIR image, but they would not work in complete darkness because of the same reason. Thus, while TIR images might seem more challenging to work with, we still chose them as part of the current research problem. However, barring a few expensive thermal imagers, all of them work with a primary thermal sensor and a secondary optical sensor for obtaining joint thermal images. The first problem with thus obtained images is that they are usually not centred at the same point on the 2 images. The shift is random, based on the distance of the object being captured, located from the thermal imager. This is primarily because the thermal sensor and the optical sensor are physically located at different points in a general thermal imager. Here, we propose a registration technique, which works in order to obtain the optimal area of an optical image based on a thermal image, captured via a thermal imager. However, there is a distinct absence of a diverse thermal-optical paired database, which is essential for training the deep learning network to learn color distributions. To counter this, we also introduce a paired image dataset [32], for our work. The data points (or images) that we are using here comprise of images from a natural setting (greenery), modern setting (buildings), crowd setting and historical buildings. Although the number of paired images in this database are not very high (1843 pairs), the point we are trying to make here is that we have used 2 different thermal imagers (of 2 different makes, with different resolutions across the internal cameras) and varied subject settings while training our networks. Our technique works with the said constraints, proving that it is a robust method even under varying conditions. This also provides the premise that the colorization algorithm that we are proposing will get better with the addition of new data points, irrespective of the imager used. The deep learning algorithm that we train for transforming the thermal image inputs is used to obtain a color mask for a thermal prior. Once we have obtained this mask, we fuse it with the thermal image. This is done because our motivation was to preserve the information that is obtained in the thermal domain while producing a colorized image, rather than produce a purely optical colorized image. This is a big challenge because the TIR image data is different from optical images due to the difference in sensors being used. Moreover, the algorithm might provide different color outputs based on the thermal profile of the same scene, irrespective of the optical ground truth based on the illumination profile of the thermal data. We followed a fusion technique because we wanted to create color

A Novel Registration & Colorization Technique for Thermal to Cross Domain Colorized Images

Suranjan Goswami,

IEEE

Student Member Satish Kumar Singh, Senior Member,

IEEE images which are able to preserve the characteristics present in a thermal image for better human perception. This is based on the idea that a human is more likely to notice the features in a colorized image rather than in a grayscale image. After all, color can provide more information than grayscale [1], especially in domains dealing specifically with night vision problems like in surveillance and assistive driving, where perceptual information enhancement forms an important field. However, as we have pointed out earlier, it is likely that the model will improve further with the addition of new data points, bringing it closer to real world optical images in color, although it will never be able to produce actual real-world optical images since that was not the intention of this work. While traditional methods for image colorization have included techniques like scribble-based colorization [16], labelling [17], color transfer from images based on similarity matching [18], [19], [20] and feature extraction [21], newer methods based on Convolutional Neural Networks (CNN) are coming to the forefront at a very rapid speed. Grayscale to color image conversion [9], [22], [23], [24], [25] is a very important topic in this field. While there has not been much work on thermal images in this aspect, Near Infrared Images, by virtue of their similarity to grayscale images have had some work [26], [27], [14] presented as well. While there has been a research work by Berg et al. [2], on thermal image colorization, it specifically focuses on only one domain of images, namely images from vehicle traffic. The authors explain that they focused on images offered in the KAIST-MS traffic scene dataset [3]. This is also the case with the work by X Kuang et al. [4], who use a Generative Adversarial Network (GAN) to process images from the same database and [34], which uses stacked encoder-decoder for colorization. There have been multiple sources which have reported that GANs are difficult to train because finding the correct hyper parameters needed to tune the networks are quite difficult [5], [6], [7], [8]. However, this method becomes much easier when the images being used are from a single domain, as is in the case of the previous works. Since we were focusing on a database where one of the aims was generalization, we did not take the route of using GANs for our work. Another avenue of work that could be considered related is style transfer for images via GAN [28], [29], [30] for data preparation. While this might seem to be a good method for synthetic generation of thermal images, we wanted to work with real world data, which is the reason we created our own database [32] to work on. This is in addition to the fact that we tried to make our database varied enough in terms of data to model a simple generalized network. Additionally, since most of the previously mentioned works focus specifically on image colorization exclusively, none of them focus on the image registration method separately. While the paper by Yan et al. [10] does touch on this aspect, we found that when we took real world images, we could simplify the algorithm they provided. Not only that, their paper did not focus on the aspect of image scaling either, where they assumed that the image pairs would be of the same resolution. As we explain in Section 2.1, while there are thermal imagers which would work on this principle, they are much more expensive than the thermal imagers we have used, and thus, much less accessible. Additionally, if there is no problem of scaling present in a thermal imager, it reduces to just a centre-shift problem, which is much less complex than the model we are proposing here. In brief, our major contribution in this paper comprises of a new simplified method for registration of thermal images with their optical counterparts, which is robust enough to work even with images obtained from different thermal imagers with widely varying resolution

Figure 1: A comparison of the sizes of the thermal image vs the optical image for a scene captured via 2 different thermal imagers. Figure (a) is captured via Sonel KT400 and figure (b) is the image captured via FLIR E40. The thermal images for each are inset on the bottom left of the optical images for a visual comparison. The images on the left are the thermal images and those on the right are the optical images. These images were used to obtain the Homography constants for the rescaling factors of the images obtained via the thermal imagers 384 x 288 2592 x 1944 320 x 640 1536 x 2048 (a) (b) his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Figure 2: Architecture for the Deep Learning Network we have used in our work. The grey blocks refer to 2D convolution layers. Each of the blue blocks comprise of a 2D convolution layer, a Batch Normalization layer an LReLU layer, a second Convolution 2D layer and a DropOut layer. The orange blocks contain a Convolution 2D, BatchNorm and a DropOut layer each. The blue lines represent Dense layers while the green lines represent Concatenation layers. Each of the yellow blocks comprise of a Conv2DTranspose, BatchNorm, DropOut and ReLU layers. The red dotted layer represents an individual DropOut layer. of both optical and thermal images. We also propose a new deep learning architecture for creating a color mask so that new information can be produced for a thermal input in the form of color, which can then be fused with the thermal input to create a cross domain colorized image containing information from both the thermal and the optical images. It needs to be noted here that we have introduced a novel method wherein we are not simply colorizing the thermal images with the colors obtained from their optical counterparts. Instead, we are producing a new image which has characteristics of both the thermal and the optical images in the luminance layer itself, introduced when we are fusing the mask obtained by separating the luminance and the chrominance sections and then fusing them separately to create a cross domain fused color image. Our method was trained and tested on images from several different classes, showing that it is robust under a wide variety of data as well. Finally, we also present a new unique database comprising of thermal/optical paired images from several different classes as well as their raw, unscaled counterparts in [32]. Proposed Method

Our method is based on 3 separate modules which work serially to produce an output for a thermal image input. It can be formally broken down into 3 phases:  Registration  Colorization  Post-processing Of these, the first module (Registration) works only in the training phase whereas the next 2 modules work in both the training and testing phases. We are going to cover each of these sections separately in the following parts.

Registration

The registration module works on thermal-optical image pairs. The basic idea behind the registration module is that every image in the thermal domain has an optical pair. Since the thermal imagers we used have the option of capturing a thermal and an optical image of the same scene simultaneously, this is achievable. However, the problems encountered when dealing with a thermal-optical pair are that the thermal and the optical images were of different resolutions, and had a random centre shift based on the distance of the object being captured, since the physical location of the TIR and the optical sensors were different on the imager itself. The FLIR imager had a 3.1 Megapixel optical sensor which produce images at a resolution of 1536x2048 pixels and the Sonel imager had a 5 Megapixel optical sensor which produces images at a resolution of 2592x1944 pixels. This means that for the same scene, the images captured by the FLIR imager were in landscape format while that by the Sonel imager were in portrait mode, and they were of widely different resolutions. So, we needed to formulate a method that would work uniformly across images from both imagers. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

For this, we used the concept of Homography for the calculation of the scale of each image pair in the different thermal imagers. Homography tells us that in order to calculate the corresponding region that maps each x i (data in the first domain) to its corresponding X i (data in the second domain), we need to simply compute the 3×3 homography matrix [11]. Since we are dealing with 2D images, we consider the case where one image can be considered as the perspective projection of the other image. This is the standard method used for maps and aerial photography techniques. Based on 4 different corresponding points, we find the Homography constants for images and thus rescale images accordingly. It needs to be understood that the rescaling factor was constant for a particular thermal imager. Once we had the factor, we could apply it uniformly across all images captured via that thermal imager. However, while the rescaling factors were the same for all images captured via a particular thermal imager, the factors themselves varied. The rescaling factor was 18% in case of the images captured via the Sonel thermal imager and 36.5% in case of the images captured via the FLIR thermal imager uniformly across both axes. For obtaining the 4 corresponding points in 2 images from 2 different domains, we simply took images where we could find out direct pixel values from images obtained across different domains for the same scene. This is shown in Fig. 1, where we use LED lighting strips in order to create basic cases where we could obtain 4 different corresponding points of interest across the 2 differing domains easily. In that figure, (a) is the image obtained via the Sonel thermal imager, and (b) is the one obtained via the FLIR thermal imager. Both optical images have an inset of the original thermal image at the bottom left to give an idea of the difference in scales between the captured thermal and optical images. Once the rescaling problem was tackled and we obtained the factors for the images from the 2 different thermal imagers, we focused on the actual problem of registration. Now, while there are a lot of matching factors that are used as registration scores against similar images, we found that most of them were ineffective for our use. This is because registration techniques like Structural Similarity Index Measure (SSIM) work by matching how similar 2 images might be, which would not work in thermal-optical pairs since features present in optical domain (like painted patterns) might not be present in the thermal domain. The problems with matching techniques based on illumination, like Histogram Matching, however, are different. In these cases, the thermal and the optical pairs would behave differently because of the nature of the sensors. While optical domain images work on the principle of reflected light, the thermal sensors work on capturing the emitted radiation instead. This results in completely different illumination profiles for the same scene, which render illumination matching techniques useless. However, we reasoned that the internal entropy of an image should be same regardless of the sensor used to capture the image since it represents the amount of disorder in an image. More intuitively, it represents the amount of information content in an image. Thus, we opted to use Mutual information (MI) as a matching score for the registration metric in our method. There has been a previous work on using MI for thermal-grayscale registration [13]. However, they showed that you would need to correct the illumination in both images for it to work. Our assumption, however, was that this would not yield any extra advantage to the thermal image as MI does not work with the absolute values of the illumination, but rather, with the distribution of the illumination. Of course, this is based on the fact that the images are captured during the daytime, when the distribution can be clearly observed in both domains simultaneously. If the images are captured during night time, while the thermal counterpart of the images may be visible, the optical sensor would not register any data due to no reflected light. We outline the algorithm ( Algorithm 1 ) proposed for our Registration method as below. In it, N stands for the total size of the flattened rescaled optical grayscale image and K stands for the size of the thermal image. This is 384x288 = 110,592 for images captured via the Sonel imager and 240x320 = 76,800 for images captured via the FLIR imager. Once the region has been calculated, the corresponding 2D index has to be calculated for the point on the resized optical image and the region has to be cut out, representing the registered image, followed by reshaping to the original 2D shape from 1D array. Algorithm 1

Input : Thermal-Optical image pair

Output : Cropped region of registered Optical image for each image pair in database: 2. flatten the thermal image as thermal 3. flatten the grayscale image as grayscale 4. temp = 0 5. for i in 0 to N-K: 6. if (temp < MI (thermal, grayscale[i to i+K]): 7. x =i 8. temp = MI (thermal, grayscale[i to i+K]) 9. save image corresponding to x from grayscale image in 2D format In the above algorithm, temp represents the score that is used in order to decide the patch which needs to be registered, in each iteration and MI represents the Mutual Information function which calculates the his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible mutual information, given 2 distributions. In mathematical terms, the registration process becomes an optimization problem based on Shannon’s entropy, which can be expressed as below. Given a matrix of the thermal image A with shape (a, b) and an optical image matrix X with shape (x, y) , we can define a patch p i where p i = (x i : x i + a, y i : y i + b) (1) as the (a , b) patch with the highest Mutual Information MI in the optical image and (x i , y i ) being the starting index of the patch inside the image, where i ≤ x - a, 0 ≤ b i ≤ y - b (2) since the values of a and b has to be bound between 0 and the last indices subtracted from the size of the image being considered for registration. We calculate the region with the highest Mutual Information in the optical image X based on the Mutual Information (MI) formula following Algorithm 1 , where MI is defined as:

MI = H(A) + H(X) – H(A, X) = ∑ ∑ 𝑝 𝐴𝑋 (𝑎, 𝑥) 𝑙𝑜𝑔 𝑝 𝐴𝑋 (𝑎,𝑥)𝑝 𝐴 (𝑎)𝑝 𝑋 (𝑥)𝑥∈𝑋𝑎∈𝐴 (3) where H(X) represents the entropy of system X . However, the calculation of MI is a very computationally expensive work. As such, we simplified the method by reducing the number of checks that need to be performed as a part of the search problem. Instead of searching the full optical image for the most optimal region corresponding to the thermal image, we change the range for the check to half of the difference in range between the 2 images. Thus, the

Figure 3: Comparative Analysis of Results. The first row represents the raw thermal images, second row stands for the outputs/masks obtained via the algorithm by Iizuka et al. [9], third row for the output after fusing the mask with the thermal inputs and the fourth row represents the output obtained from [34]. The fifth row represents the RGB output mask obtained via the proposed network while the 6 th row represents the output obtained after fusion. The last row shows the ground truth obtained via the optical sensor. Thermal Input Baseline Mask Baseline Output from [9] Proposed Network RGB Mask Proposed Network Output Ground Truth Optical Image (a) (b) (c) (d) (e) (f) (g) (h) Output from [34] his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible range reduces from (0 : x – a , 0 : y - b) to ( (x-a)/2 - (x-a)/4 : (x-a)/2 + (x-a)/4 , (y-b)/2 - (y-b)/4: (y-b)/2 + (y-b)/4 ). We make an assumption here that the maximum amount of shift in pixel would not be more than half the total difference in sizes between the 2 images. This is also shown to be valid experimentally for both imagers when we take an image at a very close range and check the parallax error present between the thermal and the resized optical images. We show this as Fig. 1 in the Supplementary materials section. However, while we can reduce the time complexity by about 50%, we found that there were a lot of redundant checks being performed as a part of the checking procedure because after the maximal region is found, the whole loop continues till the end of the range. Thus, we introduced a secondary check to this wherein the checking process only continues till 3 full rows of the checking range is skipped without any update to the maximal region. We base this on the assumption that the check for the maximal MI region happens gradually, wherein the optimal region is found out by closing the gap in maximal region. Thus, if there is a break in the check region, it means that the maximal region has already been achieved, and further comparisons need not be performed. This brings in a very major change in the time complexity, reducing the average time by a factor of 91%. We name this Algorithm 2 . However, we deduced that this could be reduced further, so we made a few more modifications to the process, reducing the time complexity even further.

Now we take a region that is 30 pixels away from the edges of the thermal image in each direction. This is because we can see that the thermal band is present in all thermal images towards the right most region. Not only that, we try to capture the images with maximum importance towards keeping the maximal information regions at the centre of the thermal image. Thus, reducing the range by trimming the sides of the thermal image theoretically does not result in a significant loss of information contained in the region to be compared, providing the same output as the original method. Finally, we cut out the registered optical image from the rescaled RGB matrix as:

Image i = (x i – 30: (x i – 30) + a , y i – 30: (y i – 30) + b) (4) This process brings down the registration time complexity further by a factor of 32% from the previous iteration. We call this

Algorithm 3 , and is the final registration algorithm we are proposing in this work. We include a bar graph showing the time complexity comparison in Fig. 2 in the Supplementary section. It needs to be understood however, that our method does not work for all pairs of thermal-optical images. This is covered in more detail in Section 3.4.

Colorization

While most of the deep learning based colorization techniques focus on using the Luminance-Chrominance image model to work on images, our idea was different. Although earlier papers [14], [15], [9], [2] show that the LAB model is easier to work with for colorization techniques, they were all based on the idea that only the chrominance values need to be calculated while the luminance values can be directly transferred from the grayscale image. While this method would work for the colorization of optical domain grayscale values, it becomes clear that the method would fail in case of colorization of thermal images where the optical grayscale images and the thermal grayscale images differ significantly from each other. Thus, instead of the LAB domain, we chose to work directly in the RGB domain instead. This not only created the final output which focused more on the chrominance channels, the final output could be made to focus more on the thermal image for the output luminance channel, as we finally separate the output into luminance and chrominance channels, and then fuse the luminance channel with the thermal input and use the chrominance channel as is, for the final output image, ensuring that the final image has information from the input thermal domain as well as color information obtained via the trained model. This is covered in more depth in Section 2.3. Our method for colorization is a deep learning network. The basic framework for the network is based on the work described in [9] which works on the principle of creating global features and fusing them with mid-level priors to obtain joint features for chrominance values. We, however do not use mid-level features at all, instead opting for a pixel level fusion with the intermediate output. Moreover, the network that they describe is based on the calculation of just the chrominance values from a grayscale prior. We modify it significantly to produce a mask directly in the RGB domain, which is then fed into the post processing step for creating the final colored image. Several changes like using a Conv2DTranspose layer instead of upsampling layer (so that the gradient information is better preserved during back propagation of loss) and using dropout and Batch Normalization in order so that loss does not stagnate has been used have been made into the original architecture described in [9]. More details on this can be found in Table 1-3. We also make a post processing step which ensures that the final output contains the data from both the thermal and the optical domain concurrently in the same output image. This is detailed in Section 2.3. We divide the network into 3 subsections, namely Low Level Features Layer, Fusion Layer and Mask Layer in order to better describe it (Tables 1 –3). his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Table 1: Low Level Features Layer (Input: 200x200x1)

Layer Name Details Output

Block1 (D:1) Conv2D {W (3,3), S: (1,1)} 200x200 Block2 (D: 32) Conv2D {W: (3,3), S: (2,2)} BatchNorm LRelu Conv2D {W: (3,3), S: (1,1)} DropOut 100x100 Block3 (D: 64) Conv2D {W: (3,3), S: (2,2)} BatchNorm LRelu Conv2D {W: (3,3), S: (1,1)} DropOut 50x50 Block4 (D: 128) Conv2D {W: (3,3), S: (2,2)} BatchNorm LRelu Conv2D {W: (3,3), S: (1,1)} DropOut 25x25 Table 2: Fusion Layer (Input: 25x25x128)

Layer Name Details Output

Block5 (D:64) Conv2D {W: (3,3), S: (1,1)} DropOut 25x25 Block6 (D:64) Conv2D {W: (3,3), S: (2,2)} BatchNorm DropOut 13x13 Block7 (D:64) Conv2D {W: (3,3), S: (2,2)} BatchNorm DropOut 7x7 Block8 (D:64) Conv2D {W: (3,3), S: (2,2)} BatchNorm DropOut 4x4 Block9 (D:64) Conv2D {W: (1,1), S: (2,2)} 4x4 Block10 Flatten 64*4*4 Block11 Fully Connected Dense {N: 512} Dense {N: 256} Dense {N: 200} 200 Block12 Concatenate {N: 25, axis = 1} Concatenate {N: 25, axis = 1} 200*25*25 Block13 (D: 200) Reshape 25x25 Block14 (D: 328) Concatenate {(Block4, Block13), axis: 3} 25x25 Table 3: Mask Layer (Input: 25x25x328)

Layer Name Details Output

Block15 Fusion Layer (D:200) Conv2D {W: (1,1)} 25x25 Block16 (D:128) Conv2DTranspose {W: (3,3), S: (2,2)} BatchNorm DropOut ReLU 50x50 Block17 (D:64) Conv2DTranspose {W: (3,3), S: (2,2)} BatchNorm DropOut ReLU 100x100 Block18 (D:32) Conv2DTranspose {W: (3,3), S: (2,2)} BatchNorm DropOut ReLU 200x200 Block19 (D:16) Conv2D {W: (3,3), S: (1,1)} DropOut 200x200 Block20 (D:3) Conv2D {W: (3,3), S: (1,1)} 200x200

In the Tables 1-3, D represents the depth (or the number of layers) of the network at the corresponding level, W stands for the shape of the sliding window and S is the stride. N stands for the number of outputs in the case of the Dense layers and represents the number of times the previous layer is repeated in case of a Concatenation layer. Conv2D represents a 2D Convolution layer, BatchNorm is the Batch Normalization layer, LReLU is a Leaky ReLU layer, DropOut is a Drop-out layer, Flatten stands for a flattening layer, Dense stands for a fully connected layer, Reshape represents a reshaping layer and Concatenate stands for a Concatenation layer along a specific axis. All DropOut layers have a parameter of 0.5. The deep learning network that we use for the translation of the thermal image into an RGB mask is described in the colorization section and shown in Fig. 2. It is an end to end colorization technique which produces a unique output for a given grayscale thermal input. We use elu as an activation function in all of the layers except the fusion layer and the last layer, which is the output layer. The last layer activation is sigmoid since the output is on a range of [0,1] as we normalize the RGB values for all the 3 channels of the output images by dividing the individual values at the 3 different layers by 255, which is the highest value possible for any pixel while the fusion layer has a sigmoid ( 𝜎 ) activation function. The fused feature [9] is described as: 𝑦 𝑢,𝑣𝑓𝑢𝑠𝑖𝑜𝑛 = 𝜎 ( 𝑏 + 𝑊 [ 𝑦 𝑙𝑜𝑤 𝑦 𝑓𝑢𝑠𝑒𝑑 ]) (5) In Eq. 5, 𝑦 𝑢,𝑣𝑓𝑢𝑠𝑖𝑜𝑛 ∈ ℝ is the fused feature at (u,v). The element 𝑦 𝑙𝑜𝑤 ∈ ℝ is a low level feature lying midway through the Low Level layer and 𝑦 𝑓𝑢𝑠𝑒𝑑 ∈ℝ is the feature at the fusion layer. W is a 200 by 328 weight matrix while 𝑏 ∈ ℝ is a bias. Here, ℝ represents the space of 200 parameters that are used in the feature space. These act as trainable parameters of the network, produced via training the layers and optimizing the loss gradient. After the bias has been incorporated into the model, we finally start upscaling the layers via 2D transposed convolution in the mask layer while, at the same time decreasing the depth of the output so that we obtain the final RGB mask, which needs to be fused back with the Infrared thermal input image to obtain the colorized fused image as the output. The input to the network is a 200x200 image, irrespective of what the initial thermal image size may be, so that we have a parity of inputs in the model. Since his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible it is a grayscale image, it has only 1 layer while the output is a 200x200x3 RGB image. As can be seen from the architecture, we increase the depth of the layer in steps. Each of the blocks deal with a fixed depth of output which first increases to a depth of 128 and then is kept constant at a depth of 64 while being halved at each level. For halving the shape of the input, we use convolution layers directly. Once the required shape of 25x25 is reached, we use the output of that layer to calculate pixel level priors and fuse them directly into the intermediate 25x25 mask to create a fused layer, which we then up-sample continuously into the original shape of 200x200 with a depth of 3. It is to be noted here that unlike the original paper by Iizuka et al., we do not use the upsample layer for our work, rather opting for convolution transpose layer. The benefit of this is the fact that the transpose layer retains the weights, thus providing a link between the original input across the layers and the output which is double the size of the input in the 2 axes. We use logcosh as the loss function with ADAMAX as an optimizer for our network. Detailed explanation as to why we are using logcosh is discussed in Section 3. Once the network is trained, we can input a thermal grayscale image and obtain an RGB mask for the input image. The mask is combined with the thermal input image in the post processing layer to obtain the final colorized output. Post Processing

Once the mask is obtained for an input, we need to create the final output in the form of a colorized image from it. This is the post processing step. The post processing step can be further divided into 3 sub-steps. They are:  Conversion to LAB domain  Fusion of Thermal Image  Conversion back to RGB domain Our theory was that the mask, obtained in the RGB domain contains mostly the chrominance information for an input thermal grayscale image. Thus, while creating the final colorized image, we gave a higher priority to the thermal image input for preparing the final output. For the first step, we take the RGB mask and convert it into LAB domain. This is because at this stage, we wish to incorporate the information from the thermal image as the luminance channel into the output in order to create the final cross domain colorized image. For this to be successful, it is important to separate the chrominance from the luminance value of the output mask. However, we did not want to discard the luminance information contained in the mask completely either. Thus, comes the second stage. Once we obtain the luminance information of the output mask, we fuse it with the thermal image according to the equation given below:

L = (l/2 + thermal) / 2 (6)

Thermal Image Registered Image Output

Figure 4: Output on KAIST Mutispectral Pedestrian dataset his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

In Eq. 6, L represents the final luminance channel of the output image, l stands for the mask luminance channel and thermal represents the thermal image. This equation helps give more weight to the thermal image while creating the final luminance channel, thus incorporating the thermal image information into the final colorized output. Once the luminance channel is ready, we just fuse it together with the chrominance channel and convert it back from the LAB domain into the RGB domain. Experimental Results and Discussion

We created a database containing registered 1843 thermal to color pairs. We obtain random 10 cuts of images from each of the pairs of the images, creating a total of 18,430 pairs of images. The problem however is that the FLIR thermal imager has a maximum resolution of 240x320 for the thermal images, which is why we opted to create 200x200 images. Out of these 18,430 images, we use 18,411 images for training and 19 images for validation. Once the model for mask creation is trained, we go back to the original 1843 data points and obtain 500 images of 200x200 resolution from 500 randomly selected images and test the network on them. We provide a few of the results in the Fig. 3. We compare the output obtained against a baseline of a grayscale to color images [9] using their trained model. We wanted to retrain our model on this baseline, but the authors for this work were unavailable for comment, even on being mailed. As such, we used their publicly available code for our work. While we also wanted to compare our results against the TIR2Lab results presented in the paper [2], the code that the authors presented publicly did not work properly on our end. We attributed this to the fact that they had stated on their page that it was still in development stage and might not work properly. As such we sent a few data points directly to the authors so that they may test it on their end and mail us the results. However, the authors said that they were unable to provide the output, citing other engagements on their part. However, we show our result against the method detailed in [34], which reports better results in their paper for the same database as outlined in [2]. While the work by Iizuka et al. [9] conclude that changing the domain of the images from RGB to LAB work better for colorization of grayscale images, it needs to be understood that in the current work, we are changing the domain of the input images from grayscale to a fused one. As such, since are working with a mask, we decided to use the RGB domain directly, since the concatenation at the post processing level, explained in Section 2.3 happens in the LAB domain itself. We choose logcosh as the loss function in our deep learning method because the operational values become very small after normalization and lie between (0,1). This simplifies the calculation by directly changing the values under consideration by making log(cosh(x)) approximately equal to the value shown in Eq. 7: x = x / 2 (7) Further , like Mean Squared Error (MSE), log-cosh does not discard outlier data as corrupt data. This means that it is somewhat resistive to the incorrect prediction which may occur at some times. However, the work outlined in [34] showed that by introducing skip connections in a stacked encoder-decoder model, they were able to outperform the original TIR2Lab method by a very high margin. This paper directly refers to [2] and also uses similar method while showing a better result. Hence we mailed the authors for access to the original code, but they did not reply to our mails. As such, we implemented their code independently for our paper and present the results for a Figure 5: Images where the Registration algorithm is unable to provide a perfect match of thermal-optical pairs in case of images captured via the Sonel thermal imager for (a) – (d) and those captured by the FLIR thermal imager for (e) - (h) (a) (b) (c) (d) (e) (f) (g) (h) his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Method Measure

L1 RMSE PSNR SSIM

Baseline [9] Average

Low

High

Method presented in [34] Average

Low

High

Our Method Average

Low

High

Table 4: Quantitative Analysis of the results obtained by the different methods comparison in Fig. 3 and Table 4. Since it is an independent reproduction of the code, we retrained the method on our database from the beginning and then obtained the output for the same. The changes that we did in the work as a result, included using different paddings across the layers to maintain parity for input to output. We also used our input of 200x200 images instead of upscaling it to 256x320, because upscaling images before feeding them into the network for training would result in worse output images as upscaling essentially pertains to creation of new information, and recreating the experiment in the original database image size is expected. Lastly, we take the values for α and β as 0.3 and 0.7 for the combined error that they describe in their paper since the values were not mentioned in their work. 0.3 refers to the weight used for the luminance layer and 0.7 refers to the weight used for the chrominance layer loss calculation. We estimated that since the work dealt with the colorization of images, the chrominance layer should, theoretically, be assigned more weight than the luminance layer. Keeping parity with our own method and the baseline, we create the mask and join it with the input to create the final output. Although the method is supposed to be end to end, without this step, it cannot be compared to our method. Also, the output from the method looks like a mask in itself as well. We also found interesting results of the colorized images showing localized colors for regions with high illumination when we trained the network to use categorical_crossentropy as a loss function instead of logcosh . Since categorical cross entropy loss rewards or penalizes only the prediction of the correct class, we find that the color mask for the thermal images, when trained using categorical_crossentropy loss produces a completely different color profile, which can be explored further in future works. The results for this are shown in Fig. 3 in the Supplementary section. We would like to state here that while we found one other instance of a large dataset of thermal images from the Military Sensing Information Analysis Center (SENSIAC) [31], the data that they provide is in the form of military specific data obtained from soldiers, military vehicles and civilian vehicles. Our work, however, is geared towards using data obtained wholly from civilian scenarios, so that it may be applied directly to data obtained in daily life. Also, military data is often obtained via very high-grade state of the art cameras, which are not feasible for general use. Also, the dataset is not publicly available, and needs a fee for its usage. This dataset was used in [33] for a new method on thermal image colorization, but the code for this is not available in the public domain, and we received no reply from the authors, even on multiple emails for the inquiry. Hence we could not compare that result in our work.

Quantitative Analysis

The quantitative analysis of the results are shown in Table 4. It needs to be understood that these results are not directly comparable as our method was not meant to yield results in the actual optical domain images since we are creating a cross-domain output. However, while we understand that our method will not yield accurate results on being compared with other methods which are trained to yield results which are closer to the optical domain outputs, we had no other quantitative baseline to compare our results with. Thus, we compared our results with the baseline method by Iizuka et al. [9], while treating the output from these methods as masks, similar to our proposed method and go on to compare a more recent literature, [34], which also deals with end to end thermal image colorization and reports much better results than TIR2Lab, the original paper which presented the idea in [2]. We use 4 different measures, namely L1, RMSE, PSNR and SSIM for the quantitative analysis. In all of these measures, our method shows a clear improvement in the output against the baseline work from [9] as well as the results obtained from [34]. his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Qualitative Analysis and Limitations

As can be seen in the results shown in Fig. 3, while our method is correctly able to segregate the color of the sky in most cases, in result (d), it mistakenly identifies the background cloud patterns as something else and imparts a reddish color to it. This also seems to be the case with (a), where the whole scene is reddish in color. While the buildings are clearly segregated by means of shades of the same color, the ground and the trees seem to be unnaturally colored as well. However, our database contains a large number of reddish hued buildings. This is primarily because one of the areas where the images were obtained were within our college campus, where the main color scheme is reddish in nature. Moreover, a lot of the historical buildings within our database also had a reddish hue, which is why the network may have been behaving in this way. However, we see that in images which are mainly dominated by greenery, like (g) results in greenish hues all throughout the image. In fact, this trend of classifying similar connected areas with a similar color is very highly noticeable in the RGB mask obtained from the trained deep network, which for images (b), (d), (f), (h) and even in some places of (c) has characteristics of the sky in terms of blue. However, for crowd data, like (h), it seems that the network is not able to well reproduce colors close to the real data. This is supported by the unnatural reddish hue for the crowd. This is primarily because the crowd data seems to be quite complex in comparison to the rest of the data types we have included in our database. Not only does the crowd in our database have bright colors, the background is very vividly different in every image as well. This is primarily because of the fact that we obtained our data from a festival ground, where bright colored apparel was the norm. As for the baseline, we find that almost all images trained via the colorization network from the algorithm described in [9] seem to become lighter in illumination. In fact, in some cases like (b), (c) or (d), the whole image appears to become lighter in illumination. We treat these images as mask in order to keep parity between our final produced images and the images we get via these algorithms, especially since this method was not meant to colorize thermal images in the first place. Thus, as can be seen in the base outputs, while the fusion does impart some form of extra information to the images, the highly illuminated masks make the output images lighter in shade as well. When we look at the output from [34], which is an end-to-end method for colorization of thermal images, we can see that the results are similar to what is obtained in the baseline output. However, it seems that the output from [34] does provide results in (e) which provide more clarity to the input, although there seems to be no color present in the output images. We attribute this to the fact that the method described in [34] is developed for only a single domain of images, comprising of images containing just roads and cars. Our database, on the other hand, contains images which are widely varying in nature and are much more complex to model in just a stacked encoder-decoder model, even with skip connections. We also check our method on the KAIST Multispectral Pedestrian (KAIST MS) dataset, which had a very low contrast difference throughout the images because they were obtained as consecutive frames from a thermal video, resulting in the same temperature range throughout. The images also needed to be resized for our use. Thus, we took 512x512 centre cropped images and applied Histogram equalization (HE) on them. This was predicated on the assumption that since we are talking about the variation in temperature range, HE will be able to bring different contrast levels to the images, thus providing a better range for the colorization method to work on while not affecting the actual temperature range in the grayscale thermal images. We show the results for this in Fig. 4.

Discussion

While our method seems to work with most data, it fails in few cases. The first is the failure at the registration module. We see that in the case of data where the subject is moving, the thermal and the optical sensors are unable to capture the same image. This can be seen in Fig. 5 (b) and (d). This occurs because there is a lag in the operation time of the thermal and the optical sensors as the 2 work serially. Subsequently, the data captured is different in the 2 cases. While we see that this does not affect the registration directly in some cases (where the subject of movement is very small in comparison to the background), we do not take these cases in consideration while preparing the training set. This is because our work deals with colorization and if the difference in images attributed due to the motion between the images of the input and the output domain become too large, the deep network cannot learn the necessary parameters needed to create the color mask from the input. The second case where the registration module fails to provide an accurate registration for an input is in case of a parallax error, as can be observed in Fig. 5 (c). The optical and the thermal sensors are physically located at different locations in both the thermal imagers we have used for creating our database. This creates a shift in the image being captured. We observed that while this does not create much of an error when the objects being captured are far away, at nearby distances, it becomes a major issue. In fact, the image captured becomes completely different when the object is very close to the imager, creating a shift in perspective, thus making the algorithm we have provided being unable to provide a good match. We his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible can notice that in Fig. 5 (c), the arches change in their perspective between the thermal and the optical images altogether and in Fig. 5 (f), we can see that the leaves on the right side of the structure change the image between the thermal and the captured optical images. Another problem that we have faced is that some images are highly bleached. By ‘bleached’, we mean that there is not sufficient difference in the different surfaces of the object in the thermal image. This is primarily the case when we deal with objects which are left for too long at a certain temperature (for example, an animal sitting in the sun). This can be seen in Fig. 5 (h) and (a). In the first one, you can see that due to the extreme difference between the foreground and the background, there is a distinctive loss in features in the image, while in the second case, the background becomes blurry, creating a case where the Mutual Information based algorithm is unable to get enough unique information to create a good match. Finally, the last case where we see problems with registration is where there is a loss of data due to the sensor being unable to capture precise data. This can be seen in Fig. 5 (e) and (h). In case of Fig. 5 (e), we can see that the small branches in front of the tree are not captured in the thermal image, thus creating a different image for the algorithm, while in (h), the optical sensor is unable to capture the correct image corresponding to a thermal map due to fog and imperfect illumination. Another problem that arises is in the deep network when using highly luminous thermal images. In fact, bleached thermal images, even if they are properly registered due to information obtained from the surrounding, tend to give very bad perceptual results when colorized. This is usually the case when the subject include buildings or animals, which do not show much difference in contrast on the walls or the skin, resulting in a uniform single-color scheme. This is more so because the thermal images are the ones that get higher weight in the construction of the luminance layer in the final output image in the post processing step described in Section 2.3. All of our data is available for public use in IEEE Dataport at [33]. We have also included the raw data for the unregistered images in the database so that others may yet use them in further researches in other avenues. All experiments were carried out on a machine with i7 7820X processor with 64 GB RAM and a GTX 1080 Ti GPU with 11 Gbps GDDR5X memory. All coding was done on Python with Keras 2.2.4, using Tensorflow 1.13.1 as the backend. On an average, one epoch of our program using the proposed algorithm takes 121 seconds to complete. The proposed algorithm reached loss saturation by the 27 th epoch. The thermal imagers used were FLIR E40 and Sonel KT400. Conclusion

We have presented a method which has been trained with data from Prayagraj (Allahabad) city in India for the modern setting based images and jointly from Chitrakoot and Prayagraj for the historical buildings images and the greenery data. The crowd data is collected from the Maha Kumbh Mela 2019 that occurs once every 12 years at Prayagraj. It is the biggest fair on earth and to our knowledge, this is the only dataset that contains images from it. The data was collected over a period of 1.5 years and collated together. While the deep learning network has been trained, it needs to be understood that we have been able to use only 1843 images as of now, and it is expected that the results would improve once more data is added to this database. Acknowledgements

We would like to thank the DIPR Lab, DRDO, Government of India for funding this research under grant number 2535/DIPR-II/MD/CARS-01 and Computer Visions and Biometric Lab (CVBL), IIIT Allahabad for providing the necessary facilities to conduct it. We would also like to thank Prof. Gaurav Sharma, University of Rochester, New York, USA for his expert guidance on the research work and Mr. Nand Kumar Yadav, IIIT Allahabad, without whose help the data collection drive would never have been completed. References [1]

Tang, Jun. "A color image segmentation algorithm based on region growing." In 2010 2nd International Conference on Computer Engineering and Technology, vol. 6, pp. V6-634. IEEE, 2010.]. [2]

Berg, Amanda, Jorgen Ahlberg, and Michael Felsberg. "Generating visible spectrum images from thermal infrared." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1143-1152. 2018. [3]

S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon. Multispectral Pedestrian Detection: Benchmark Dataset and Baseline. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1037–1045. IEEE, jun 2015. [4]

Kuang, Xiaodong, Xiubao Sui, Chengwei Liu, Yuan Liu, Qian Chen, and Guohua Gu. "Thermal Infrared Colorization via Conditional Generative Adversarial Network." arXiv preprint arXiv:1810.05399 (2018). [5]

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017). [6]

Salimans, Tim, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. "Improved techniques for training gans." In Advances in neural information processing systems, pp. 2234-2242. 2016. [7]

Metz, L., B. Poole, D. Pfau, and J. Sohl-Dickstein. "Unrolled generative adversarial networks (2016)." arXiv preprint arXiv:1611.02163. [8]

Poole, Ben, Alexander A. Alemi, Jascha Sohl-Dickstein, and Anelia Angelova. "Improved generator objectives for gans." arXiv preprint arXiv:1612.02780 (2016). [9]

Iizuka, Satoshi, Edgar Simo-Serra, and Hiroshi Ishikawa. "Let there be color!: joint end-to-end learning of global and his work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible local image priors for automatic image colorization with simultaneous classification." ACM Transactions on Graphics (TOG) 35, no. 4 (2016): 110. [10]

Yan, Bin, Xiangshan Kong, Fuyou Jiang, Peng Xu, and Hui Yuan. "Multi-Spectral Stereo Image Matching Based On Adaptive Window." In 2018 International Conference on Advanced Control, Automation and Artificial Intelligence (ACAAI 2018). Atlantis Press, 2018. [11]

Dubrofsky, Elan. "Homography estimation." Diplomová práce. Vancouver: Univerzita Britské Kolumbie (2009). [12]

Liu, Lixiong, Bao Liu, Hua Huang, and Alan Conrad Bovik. "No-reference image quality assessment based on spatial and spectral entropies." Signal Processing: Image Communication 29, no. 8 (2014): 856-863. [13]

Li, Jing, Quan Pan, Tao Yang, and Yong mei Cheng. "Color based grayscale-fused image enhancement algorithm for video surveillance." In Third International Conference on Image and Graphics (ICIG'04), pp. 47-50. IEEE, 2004. [14]

Suárez, Patricia L., Angel D. Sappa, and Boris X. Vintimilla. "Learning to colorize infrared images." In International Conference on Practical Applications of Agents and Multi-Agent Systems, pp. 164-172. Springer, Cham, 2017. [15]

Cheng, Zezhou, Qingxiong Yang, and Bin Sheng. "Deep colorization." In Proceedings of the IEEE International Conference on Computer Vision, pp. 415-423. 2015. [16]

Levin, Anat, Dani Lischinski, and Yair Weiss. "Colorization using optimization." In ACM transactions on graphics (tog), vol. 23, no. 3, pp. 689-694. ACM, 2004. [17]

Luan, Qing, Fang Wen, Daniel Cohen-Or, Lin Liang, Ying-Qing Xu, and Heung-Yeung Shum. "Natural image colorization." In Proceedings of the 18th Eurographics conference on Rendering Techniques, pp. 309-320. Eurographics Association, 2007. [18]

Welsh, Tomihisa, Michael Ashikhmin, and Klaus Mueller. "Transferring color to greyscale images." In ACM transactions on graphics (TOG), vol. 21, no. 3, pp. 277-280. ACM, 2002. [19]

Gupta, Raj Kumar, Alex Yong-Sang Chia, Deepu Rajan, Ee Sin Ng, and Huang Zhiyong. "Image colorization using similar images." In Proceedings of the 20th ACM international conference on Multimedia, pp. 369-378. ACM, 2012. [20]

Larsson, Gustav, Michael Maire, and Gregory Shakhnarovich. "Learning representations for automatic colorization." In European Conference on Computer Vision, pp. 577-593. Springer, Cham, 2016. [21]

Huang, Yi-Chin, Yi-Shin Tung, Jun-Cheng Chen, Sung-Wen Wang, and Ja-Ling Wu. "An adaptive edge detection based colorization algorithm and its applications." In Proceedings of the 13th annual ACM international conference on Multimedia, pp. 351-354. ACM, 2005. [22]

Cao, Yun, Zhiming Zhou, Weinan Zhang, and Yong Yu. "Unsupervised diverse colorization via generative adversarial networks." In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 151-166. Springer, Cham, 2017. [23]

Deshpande, Aditya, Jiajun Lu, Mao-Chuang Yeh, Min Jin Chong, and David Forsyth. "Learning diverse image colorization." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6837-6845. 2017. [24]

Guadarrama, Sergio, Ryan Dahl, David Bieber, Mohammad Norouzi, Jonathon Shlens, and Kevin Murphy. "Pixcolor: Pixel recursive colorization." arXiv preprint arXiv:1705.07208 (2017). [25]

Royer, Amelie, Alexander Kolesnikov, and Christoph H. Lampert. "Probabilistic image colorization." arXiv preprint arXiv:1705.04258 (2017). [26]

Limmer, Matthias, and Hendrik PA Lensch. "Infrared colorization using deep convolutional neural networks." In 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 61-68. IEEE, 2016. [27]

Suárez, Patricia L., Angel D. Sappa, and Boris X. Vintimilla. "Infrared image colorization based on a triplet dcgan architecture." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 18-23. 2017. [28]

Chen, Dongdong, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. "Coherent online video style transfer." In Proceedings of the IEEE International Conference on Computer Vision, pp. 1105-1114. 2017. [29]

Chen, Dongdong, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. "Stereoscopic neural style transfer." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6654-6663. 2018. [30]

Chen, Dongdong, Lu Yuan, Jing Liao, Nenghai Yu, and Gang Hua. "Stylebank: An explicit representation for neural image style transfer." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1897-1906. 2017. [31]

SENSIAC. 2008. Military Sensing Information Analysis Center. Retrieved from https://blogs.upm.es/gti-work/2013/05/06/sensiac-dataset-for-automatic-target-recognition-in-infrared-imagery/ [32]

Data available at https://ieee-dataport.org/open-access/thermal-visual-paired-dataset [33]

Liu, Shuo, Mingliang Gao, Vijay John, Zheng Liu, and Erik Blasch. "Deep Learning Thermal Image Translation for Night Vision Perception." ACM Transactions on Intelligent Systems and Technology (TIST) 12, no. 1 (2020): 1-18. [34]

Tao, Dan, Junsheng Shi, and Feiyan Cheng. "Intelligent Colorization for Thermal Infrared Image Based on CNN." In 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), vol. 1, pp. 1184-1190. IEEE, 2020. upplementary Materials (a) (b) Figure 1: A comparison of the off-centre parallax error for objects at photographed at around 40 cm distance from the thermal imager (a point where both Sonel and FLIR thermal imagers have a clear picture of the object being captured), focused manually via the included focusing ring present in the imagers. We measured the pixel positions of 2 corresponding pixels in the thermal and the resized optical images and they are (113, 277) for the Sonel thermal imager (image (b) ) and (294, 292) for the FLIR thermal imager (image (a) ) in the resized optical images; both being well within the starting value of (16, 18) and (105, 73) respectively for the Mutual Information based comparison being made for the registration method. A v e r a g e t i m e t a k e n ( i n s e c o nd s ) Time Comparison between proposed Algorithms

Figure 2: A comparison of the time taken on average between

Algorithm 1 , Algorithm 2 and

Algorithm 3 for registration of thermal-optical pairs Figure 3: Results obtained by changing the loss function of the proposed method from logcosh to categorical_crossentropy . The first row shows the input thermal image, the second row shows the output mask obtained via the trained Colorization network and the third row shows the output after the

Post Processing step. The last row shows the optical domain images for the same. lgorithm 3

Input : Thermal-Optical image pair

Output : Cropped region of registered Optical image for each image pair in database: 1.1. create an array from the thermal image in the range [30:length_thermal – 30, 30: height_thermal – 30] and name it as thermal image 1.2. create a matrix from the grayscale optical image 1.3. obtain the starting indices as (first1, first2) and the ending indices as (second1, second2): 1.3.1. first1 = (length_optical – length_thermal)/4 +30 1.3.2. first2 = (height_optical – height_thermal)/4 + 30 1.3.3. second1 = (length_optical – length_thermal)*3/4 1.3.4. second2 = (height_optical – height_thermal)*3/4 1.4. flatten the thermal image as thermal 1.5. count = 0 1.6. temp = 0 1.7. for i in first1 to second1: 1.7.1. check = 0 1.7.2. for j in first2 to second2: 1.7.2.1. if (count >= 3): 1.7.2.1.1. continue 1.7.2.2. obtain a region [i: i + length_thermal – 60, j: j + height_thermal – 60], flatten it and name it as optical 1.7.2.3. calculate MI (thermal, optical) and name it as mi 1.7.2.4. if (temp < mi): 1.7.2.4.1. temp = mi 1.7.2.4.2. x = i, y = j 1.7.2.4.3. check = 1, count = 0 1.7.2.5. if (check == 0): 1.7.2.5.1. count = count + 1 1.8. find the starting indices from (x, y) as (x – 30, y – 30) 1.9. obtain the registered image as (x – 30: x – 30 + length_thermal, y – 30: y – 30 + height_thermal) from the RGB optical image In the above algorithm,

Algorithm 3 , the shape of each optical image is ( length_optical, height_optical ) and that of each thermal image is ( length_thermal, height_thermallength_thermal, height_thermal