Deep Sparse Band Selection for Hyperspectral Face Recognition
DDeep Sparse Band Selection for Hyperspectral FaceRecognition
Fariborz Taherkhani, Jeremy Dawson, and Nasser M. Nasrabadi
West Virginia University { ft0009@mix, jeremy.dawson@mail, nasser.nasrabadi@mail } .wvu.edu Abstract.
Hyperspectral imaging systems collect and process information fromspecific wavelengths across the electromagnetic spectrum. The fusion of multi-spectral bands in the visible spectrum has been exploited to improve face recog-nition performance over all the conventional broad band face images. In thisbook chapter, we propose a new Convolutional Neural Network (CNN) frame-work which adopts a structural sparsity learning technique to select the optimalspectral bands to obtain the best face recognition performance over all of the spec-tral bands. Specifically, in this method, images from all bands are fed to a CNN,and the convolutional filters in the first layer of the CNN are then regularized byemploying a group Lasso algorithm to zero out the redundant bands during thetraining of the network. Contrary to other methods which usually select the usefulbands manually or in a greedy fashion, our method selects the optimal spectralbands automatically to achieve the best face recognition performance over allspectral bands. Moreover, experimental results demonstrate that our method out-performs state of the art band selection methods for face recognition on severalpublicly-available hyperspectral face image datasets.
In recent year, hyperspectral imaging has attracted much attention due to the decreas-ing cost of hyperspectral cameras used for image accusation [1]. A hyperspectral imageconsists of many narrow spectral bands within the visible spectrum and beyond. Thisdata is structured as a hyperspectral ”cube”, with x and y coordinates making up theimaging pixels and the z coordinate the imaging wavelength, which, in the case of facialimaging, results in several co-registered face images captured at varying wavelengths.Hyperspectral imaging has provided new opportunities for improving the performanceof different imaging tasks, such as face recognition in biometrics, that exploits the spec-tral characteristics of facial tissues to increase the inter-subject differences [2]. It hasbeen demonstrated that, by adding the extra spectral dimension, the size of the featurespace representing a face image is increased which results in a larger inter-class fea-tures differences between subjects for face recognition. Beyond the surface appearance,spectral measurements in the infra-red (i.e., 700 nm to 1000 nm ) can penetrate the sub-surface tissue which can notably produce different biometric features for each subject[3].A hyperspectral imaging camera simultaneously measures hundreds of adjacentspectral bands with a small spectral resolution (e.g., 10 nm ). For example, AVIRIS a r X i v : . [ c s . C V ] A ug Lecture Notes in Computer Science: Authors’ Instructions hyperspectral imaging includes spectral bands from 400 nm to 2500 nm [4]. Sucha large number of bands implies high-dimensional data which remarkably influences theperformance of face recognition. This is because, a redundancy exists between spectralbands, and some bands may hold less discriminative information than others. There-fore, it is advantageous to discard bands which carry little or no discriminative infor-mation during the recognition task. To deal with this problem, many band selectionapproaches have been proposed in order to choose the optimal and informative bandsfor face recognition. Most of these methods, such as those presented in [5], are basedon dimensionality reduction, but in an ad-hoc fashion. These methods, however, sufferfrom a lack of comprehensive and consolidated evaluation due to a) the small numberof subjects used during the testing of the methods, and b) lack of publicly availabledatasets for comparison. Moreover, these studies do not compare the performance oftheir algorithms comprehensively with other face recognition approaches that can beused for this challenge with some modifications [3].The development of hyperspectral cameras has introduced many useful techniquesthat merge spectral and spatial information. Since hyperspectral cameras have becomemore readily available, computational approaches introduced initially for remote sens-ing challenges have been leveraged to other application such as biomedical applications.Considering the vast person-to-person spectral variability for different types of tissue,hyperspectral imaging has the power to enhance the capability of automated systemsfor human re-identification. Recent face recognition protocols essentially apply spa-tial discriminants that are based on geometric facial features [4]. Many of these proto-cols have provided promising results on databases captured under controlled conditions.However, these methods often indicate significant performance drop in the presence ofvariation in face orientation [2,6].The work in [7], for instance, indicated that there is significant drop in the perfor-mance of recognition for images of faces which are rotated more than 32 degrees froma frontal image that is used to train the model. Furthermore, in [8], which uses a light-fields model for pose-invariant face recognition, provided well recognition results forprobe faces which are rotated more than 60 degrees from a gallery face. The method,however, requires the manual determination of the 3D transformation to register faceimages. The methods that use geometric features can also perform poorly if subjectsare imaged sacross varying spans of time. For instance, recognition performance candecrease by a maximum of 20 % if imaging sessions are separated by a two week in-terval [7]. Partial face occlusion also usually results in poor performance. An approach[9] that divided the face images into regions for isolated analysis can tolerate up to / face occlusion without a decrease in matching accuracy. Thermal infrared imaging pro-vides an alternative imaging modality that has been leveraged for face recognition [10].However, algorithms based on thermal images utilize spatial features and have difficultyrecognizing faces when presented with images containgin pose variation.A 3D morphable face approach has been introduced for face recognition across vari-ant poses [11]. This method has provided a good performance on a 68-subject dataset.However, this method is currently computationally intensive and requires significantmanual intervention. Many of the limitations of current face recognition methods canbe overcome by leveraging spectral information. The interaction of light with human ecture Notes in Computer Science: Authors’ Instructions 3 tissue has been explored comprehensively by many works [12] which consider thespectral properties of tissue. The epidermal and dermal layers of human skin are es-sentially a scattering medium that consists of several pigments such as hemoglobin,melanin, bilirubin, and carotene. Small changes in the distribution of these pigmentscause considerable changes in the skins spectral reflectance [13] . For instance, the im-pacts are large enough to enable algorithms for the automated separation of melanin andhemoglobin from RGB images [14]. Recent work [15] has calculated skin reflectancespectra over the visible wavelengths and introduced algorithms for the spectra. There are three common techniques used to construct a hyperspectral image: spatialscanning, spectral scanning, or snapshot imaging. These techniques will be describedin detail in the following sections.Spatial scan systems capture each spectral band along a single dimension as ascanned composite image of the object or area being viewed. The scanning aspect ofthese systems describes the narrow imaging field of view (e.g., a 1xN pixel array) ofthe system. The system creates images using an optical slit to allow only a thin strip ofthe image to pass through a prism or grating that then projects the diffracted scene ontoan imaging sensor. By limiting the amount of scene (i.e. spatial) information into thesystem at any given instance, most of the imaging sensor area can be utilized to cap-ture spectral information. This reduction in spatial resolution allows for simultaneouscapture of data at a higher spectral resolution. This data capture technique is a prac-tical solution for applications where a scanning operation is possible, specifically forairborne mounted systems that image the ground surface as an aircraft flies overhead.Food quality inspection is another successful application of these systems, as they canrapidly detect defective or unhealthy produce on a production or sorting line. Whilethis technique provides both high spatial and spectral resolution, line scan Hyperspec-tral Imaging Systems (HSIs) are highly susceptible to changes the morphology of thetarget. This means the system must be fixed to a steady structure as the subject passesthrough its linear field of view or, that the subject remains stationary as the imagingscan is conducted.HSIs, such as those employing an Acousto-Optical Tunable Filter (AOTF) or aLiquid Crystal Tunable Filter (LCTF), use tunable optical devices that allow specificwavelengths of electromagnetic radiation to pass through to a broadband camera sen-sor. While the fundamental technology behind these tunable filters is different, theirapplication achieves the same goal in a similar fashion by iteratively selecting the spec-tral bands of a subject that fall on the imaging sensor. Depending on the type of filterused, the integration time between the capture of each band can vary based upon thedriving frequency of the tunable optics and the integration time of the imaging plane.One limitation of scanning HSIs is that all bands in the data cube cannot be capturedsimultaneously. Fig. 1 (a) [16] illustrates a diagram which depicts the creation of thehyperspectral data cube by spatial and spectral scanning.
Lecture Notes in Computer Science: Authors’ Instructions(a) (b)
Fig. 1.
Building the spectral data cube in both line scan and snapshot systems
In contrast to scanning methods, a snapshot hyperspectral camera can capture hy-perspectral image data in which all wavelengths are captured instantly to create thehypercube, as shown in Fig. 1 (b) [16]. Snapshot hyperspectral technology is designedand built in configurations different from line scan imaging systems, often employing aprism to break up the light and causing the diffracted, spatially separated wavelengthsto fall on different portions of the imaging sensor dedicated to collecting light energyfrom a specific wavelength. Software is used to sort the varying wavelengths of lightfalling onto different pixels into wavelength-specific groups. While conventional line-scan hyperspectral cameras build the data cube by scanning through various filteredwavelengths or spatial dimensions, the snapshot HSI acquires an image and the spectralsignature at each pixel simultaneously. Snapshot systems have an advantage of fastermeasurement and higher sensitivity. However, one drawback is that the resolution islimited by down-sampling the light falling onto the imaging array into a smaller num-ber of spectral channels.
Most hyperspectral face recognition approaches are an extension of typical face recog-nition methods which have been adjusted to this challenge. For example, each band ofa hyperspectral image can be considered as a separate image, and as a result, gray-scaleface recognition approaches can be applied to them.Considering a hyperspectral cube as a set of images, image-set classification ap-proaches can be leveraged for this problem without using a dimensionality reductionalgorithm [3]. For example, Pan et al. [2] used 31 spectral band signatures at manually-chosen landmarks on face images which were captured within the near infra-red spec-trum. Their method provided high recognition accuracy under pose variations on adataset which contains 1400 hyperspectral images from 200 people. However, the method ecture Notes in Computer Science: Authors’ Instructions 5 does not achieve the same promising results on the public hyperspectral datasets usedin [6].Later on, Pan et al. [5] incorporated spatial and spectral information to improve therecognition results on the same dataset. Robila [17] distinguished spectral signaturesof different face locations by leveraging spectral angle measurements. Their experi-ments are restricted to a very small dataset which consists of only 8 subjects. Di et al. [18] projected the cube of hyperspectral images to a lower dimensional space by us-ing a two-dimensional PCA method, and then Euclidean distance was calculated forface recognition. Shen et al. [19] used Gabor-wavelets on hyperspectral data cubes togenerate 52 new cubes from each given cube. Then, they used an ad-hoc sub-samplingalgorithm to reduce the large amount of data for face recognition.A wide variety of approaches have been used to address the challenge of band selec-tion for hyperspectral face recognition. Some of these methods are information-basedmethods [20], transform-based methods [21], search based methods [22], and othertechniques which include maximization of a spectral angle mapper [23], high-ordermoments [24], wavelet analysis [25], and a scheme trading spectral for spatial resolu-tion [26]. Nevertheless, there are still some challenges with these approaches due topresence of local-minima problems, difficulties for real-time implementation and highcomputational cost. Hyperspectral imaging techniques for face recognition have pro-vided promising results in the field of biometrics, overcoming challenges such as posevariations, lighting variations, presentation attacks and facial expression variations [27].The fusion of narrow-band spectral images in the visible spectrum has been exploredto enhance face recognition performance [28]. For example, Chang et al. [21] havedemonstrated that the fusion of 25 spectral bands can surpass the performance of con-ventional broad band images for face recognition, mainly in cases where the trainingand testing images are collected under different types of illumination.Despite the new opportunities provided by hyperspectral imaging, challenges stillexist due to low signal to noise ratios, high dimensionality, and difficulty in data acqui-sition [29]. For example, hyperspectral images are usually stacked sequentially; hence,subject movements, specifically blinking of the eyes, can lead to band misalignment.This misalignment causes intra-class variations which cannot be compensated for byadding spectral dimension. Moreover, adding a spectral dimension makes the recogni-tion task challenging due to the difficulty of choosing the required discriminative infor-mation. Furthermore, the spectral dimension causes a curse of dimensionality concern,because the ratio between the dimension of the data and the number of training databecomes very large [3].Sparse dictionary learning has only been extended to the hyperspectral image classi-fication [30]. Sparse-based hyperspectral image classification methods usually rank thecontribution of each band in the classification task, such that each band is approximatedby a linear combination of a dictionary, which contains other band images. The sparsecoefficients represent the contribution of each dictionary atom to the target band image,where the large coefficient shows that the band has significant contribution for classi-fication, while the small coefficient indicates that the band has negligible contributionfor classification.
Lecture Notes in Computer Science: Authors’ Instructions
In recent years, deep learning methods have shown impressive learning ability inimage retrieval [31,32,33,33], generating images [34,35,36], security purposes [37,38] ,image classification [39,40,41], object detection [42,43], face recognition [44,45,46, ? , ? , ? , ? ]and many other computer vision and biometrics tasks. In addition to improving per-formance in computer vision and biometrics tasks, deep learning in combination withreinforcement learning methods was able to defeat the human champion in challenginggames such as Go [47]. CNN-based models have also been applied to hyperspectralimage classification [48], band selection [49,50], and hyperspectral face recognition[51]. However, few of these methods have provided promising results for hyperspec-tral image classification due to a sub-optimal learning process caused by an insufficientamount of training data, and the use of comparatively small scale CNNs [52]. Previous research on band selection for face recognition usually works in an ad hocfashion where the combination of different bands is evaluated to determine the bestrecognition performance. For instance, Di et al. [18] manually choose two disjoint sub-sets of bands which are centered at 540 nm and 580 nm to examine their discriminationpower. However, selecting the optimal bands manually may not be appropriate becauseof the huge search space of many spectral bands.In another case, Guo et al. [53] select the optimal bands by using an exhaustivesearch in such a way that the bands are first evaluated individually for face recognition,and a combination of the results are then selected by using a score-level fusion method.However, evaluating each band individually may not consider the complementary re-lationships between different bands. As a result, the selected subset of bands may notprovide an optimal solution. To address this problem, Uzair et al. [3] leverage a se-quential backward selection algorithm to search for a set of most discriminative bands.Sharma et al. [51] adopt a CNN-based model for band selection which uses a CNN toobtain the features from each spectral band independently, and then they use Adaboostin a greedy fashion (similar to other methods in the literature) for feature selection todetermine the best bands. This method selects one band at a time, which ignores thecomplementary relationships between different bands for face recognition.In this book chapter, we propose a CNN-based model which adopts a StructuralSparsity Learning (SSL) technique to select the optimal bands to obtain the best recog-nition performance over all broad band images. We employ a group Lasso regulariza-tion algorithm [54] to sparsify the redundant spectral bands for face recognition. Thegroup Lasso puts a constraint on the structure of the filters in the first layer of our CNNduring the training process. This constraint is a loss term augmented to the total lossfunction used for face recognition to zero out the redundant bands during the trainingof the CNN. To summarize, the main contributions of this book chapter include:1: Joint face recognition and spectral band selection: We propose an end-to-end deepframework which jointly recognizes hyperspectral face images and selects the optimalspectral bands for the face recognition task.2: Using group sparsity to automatically select the optimal bands: We adopt a groupsparsity technique to reduce the depth of convolutional filters in the first layer of ourCNN network. This is done to zero out the redundant bands during face recognition. ecture Notes in Computer Science: Authors’ Instructions 7 Contrary to most of the existing methods which select the optimal bands in a greedyfashion or manually, our group sparsity technique selects the optimal bands automati-cally to obtain the best face recognition performance over all the spectral bands.3: Comprehensive evaluation and obtaining the best recognition accuracy: We eval-uate our algorithm comprehensively on three standard publicly available hyperspectralface image datasets. The results indicate that our method outperforms state of the artspectral band selection methods for face recognition.
The Sparsity of signals has been a powerful tool in many classical signal processingapplications, such as denoising and compression. This is because most natural signalscan be represented compactly by only a few coefficients that carry the most principalinformation in a certain dictionary or basis. Currently, applications in sparse data rep-resentation have also been leveraged to the field of pattern recognition and computervision by the development of compressed sensing (CS) framework and sparse mod-eling of signals and images. These applications are essentially based on the fact that,when contrasted to the high dimensionality of natural signals, the signals in the samecategory usually exist in a low-dimensional subspace. Thus, for each sample, there isa sparse representation with respect to some proper basis which encodes the importantinformation. The CS concepts guarantee that a sparse signal can be recovered from itsincomplete but incoherent projections with a high probability. This enables the recov-ery of the sparse representation by decomposing the sample over an often over-completedictionary constructed by or learned from the representative samples. Once the sparserepresentation vector is constructed, the important information can be obtained directlyfrom the recovered vector.Sparsity was also introduced to enhance the accuracy of prediction and interpretabil-ity of regression models by altering the model fitting process to choose only a subset ofprovided covariates for use in the final model rather than using all of them. Sparsity isimportant for many reasons as follows:a) It is essential to havesas the smallest possible number of neurons in neural net-work firing at a given time when a stimulus is presented. This means that a sparse modelis faster as it is possible to make use of that sparsity to construct faster specialized algo-rithms. For instance, in structure from motion, the obtained data matrix is sparse whenapplying bundle adjustments of many methods that have been proposed to take advan-tage of the sparseness and speedup things. Sparse models are normally very scalablebut they are compact. Recently, large scale deep learning models can easily have largerthan 200k nodes. But why are they not very functional? This is because they are notsparse.b) Sparse models can allow more functionalities to be compressed into a neuralnetwork. Therefore, it is essential to have sparsity at the neural activity level in deeplearning and exploring a way to keep more neurons inactive at any given time throughneural region specialization. Neurological studies of biological brains indicate this re-gion specialization is similar to face regions firing if a face is presented, while otherregions remain mainly inactive. This means finding ways to channel the stimuli to the
Lecture Notes in Computer Science: Authors’ Instructions right regions of the deep model and prevent computations that end up resulting in no re-sponse. This can help in making deep model not only more efficient but more functionalas well.c) In a deep neural network architecture, the main characteristic that matters is spar-sity of connections; each unit should often be connected to comparatively few otherunits. In the human brain, estimates of the number of neurons are around - neurons. However, each neuron is only connected to about other neurons on aver-age. In deep learning, we see this in convolutional networks architectures. Each neuronreceives information only from a very small patch in the lower layer.d) Sparsity of connections can be considered as resembling sparsity of weights. Thisis because it is equivalent to having a fully connected network that has zero weights inmost places. However, sparsity of connections is better, because we do not spend thecomputational cost of explicitly multiplying each input by zero and augmenting allthose zeros.Statisticians usually learn sparse models to understand which variables are mostcritical. However, it is an analysis strategy, not a strategy for making better predictions.The process of learning activations that are sparse does not really seem to matter aswell. Previously, researchers thought that part of the reason that the Rectified LinearUnit (ReLU) worked well was that they were sparse. However, it was shown that allthat matters is that they are piece-wise linear. Our algorithm is closely related to a compression technique based on sparsity. Here, wealso provide a brief overview of other two popular methods: quantization and decom-position.
Initial research on neural network compression concentrates on removing useless con-nections by using weight decay. Hanson and Pratt [55] propose hyperbolic and expo-nential biases to the cost objective function. Optimal Brain Damage and Optimal BrainSurgeon [56] prune the networks by using second-order derivatives of the objectives.Recent research by Han et al. [57] alternates between pruning near-zero weights, whichare encouraged by (cid:96) or (cid:96) regularization, and retraining the pruned networks. Morecomplex regularizers have also been introduced. Wen et al. [58] and Li et al. [59] placestructured sparsity regularizers on the weights, while Murray and Chiang [60] placethem on the hidden units. Feng and Darrell [61] propose a nonparametric prior basedon the Indian buffet processes [62] on the network layers. Hu et al. [63] prune neuronsbased on the analysis of their outputs on a large dataset. Anwar et al. [64] use partic-ular sparsity patterns: channel-wise (deleting a channel from a layer or feature map),kernel-wise (deleting all connections between two feature maps in successive layers),and intra-kernel-strided (deleting connections between two features with special strideand offset). They also introduce the use of a particle filter to point out the necessity of theconnections and paths over the course of training. Another line of research introduces ecture Notes in Computer Science: Authors’ Instructions 9 fixed network architectures with some subsets of connections deleted. For instance, Le-Cun et al. [65] delete connections between the first two convolutional feature maps inan entirely uniform fashion. This approach, however, only considers a pre-defined pat-tern in which the same number of input feature map are assigned to each output featuremap. Moreover, this method does not investigates how sparse connections influence theperformance compared to dense networks.Likewise, Ciresan et al. . [66] delete random connections in their MNIST experi-ments. However, they do not aim to preserve the spatial convolutional density and itmay be challenging to harvest the savings on existing hardware. Ioannou et al. [67]investigate three kinds of hierarchical arrangements of filter groups for CNNs, whichdepend on different assumptions about co-dependency of filters within each layer. Thesearrangements contain columnar topologies which are inspired by AlexNet [40], tree-liketopologies have been previously used by Ioannou et al. [67], and root-like topologies.Finally, [68] introduces the depth multiplier technique to scale down the number of fil-ters in each convolutional layer by using a scalar. In this case, the depth multiplier can beconsidered as a channel-wise pruning method, which has been introduced in [64]. How-ever, the depth multiplier changes the network architectures before the training phaseand deletes feature maps of each layer in a uniform fashion. With the exception of [64]and the depth multiplier [68], the above previous work performs connection pruningthat causes nonuniform network architectures. Therefore, these approaches need addi-tional efforts to represent network connections and may or may not lead to a reductionin computational cost. Decreasing the degree of redundancy of the parameters of the model can be performedin the form of quantization of the network parameters. Arora et al. [69] propose to trainCNNs with binary and ternary weights, accordingly. Gong et al. [70] leverage vectorquantization for parameters in fully connected layers. Anwar et al. [71] quantize a net-work with the squared error minimization. Chen et al. [72] group network parametersrandomly by using a hash function. Note that this method can be complementary tothe network pruning method. For instance, Han et al. [73] merge connection pruning in(Han et al. [57]) with quantization and Huffman coding.
Decomposition is another method which is based on low-rank decomposition of theparameters. Decomposition approaches include truncated Singular Value Decomposi-tion (SVD) [74], decomposition to rank-1 bases [75], Canonical Polyadic Decomposi-tion (CPD) [76], sparse dictionary learning , asymmetric (3D) decomposition by usingreconstruction loss of non-linear responses which is integrated with a rank selectionmethod based on Principal Component Analysis (PCA) [77], and Tucker decomposi-tion by applying a kernel tensor reconstruction loss which is integrated with a rankselection approach based on global analytic variational Bayesian matrix factorization[78].
Alex et al. [40] proposed Dropout to regularize fully connected layers in the neuralnetworks layers by randomly setting a subset of activations to zero over the course oftraining. Later, Wan et al. [79] introduced DropConnect, a generalization of Dropoutthat instead randomly zero out a subset of weights or connections. Recently, Han et al. [73] and Jin et al. [80] propose a kind of regularization where dropped connections areunfrozen and the network is retrained. This method can be thought of as an incrementaltraining approach.
Network architectures and compression are closely related. The purpose of compres-sion is to eliminate redundancy in network parameters. Therefore, the knowledge abouttraits that indicate the success of architecture success is advantageous. Other than thediscovery that depth is an essential factor, little is known regarding such traits. Someprevious research performs architecture search but without the main purpose of per-forming compression. Recent work introduces skip connections or shortcut to convolu-tional networks such as residual networks [39].
CNN is a well-known used deep learning framework which was inspired by the visualcortex of animals. First, it was widely applied for object recognition but now it is usedin other areas as well like object tracking [81], pose estimation [82], visual saliencydetection [83], action recognition [84], and object detection [85]. CNNs are similar totraditional neural network in such a way that they are consists of neurons that self-optimize through learning. Each neuron receives an input and conduct an operation(such as a product of scalar followed by a non-linear function) the basis of countlessneural networks. From the given input image to the final output of the class score,the entire of the network still represents a single perceptive score function. The lastlayer consists of a loss functions associated with the classes, and all of the regularmethodologies and techniques introduced for traditional neural network still can beused. The only important difference between CNNs and traditional neural network isthat CNNs are essentially used in the field of pattern recognition within images. Thisgives us the opportunity to encode image-specific features into the architecture, makingthe network more suited for image-focused tasks, while further reducing the parametersrequired to set up the model. One of the largest limitations of traditional forms of neuralnetwork is that they aim to challenge with the computational complexity needed tocompute image data. Common machine learning datasets such as the MNIST databaseof handwritten digits are appropriate for most types of neural network, because of itsrelatively small image dimensionality of just × . With this dataset, a single neuronin the first hidden layer will consists of 784 weights ( × × where one considersthat MNIST is normalized to just black and white values), which can be controlledfor most types of neural networks. Here, we used a CNN for our hyperspectral bandselection for face recognition. We used the VGG-19 [41] as our baseline CNN. [86]. ecture Notes in Computer Science: Authors’ Instructions 11 The convolutional layer constructs the basic unit of a CNN where most of the compu-tation is conducted. It is basically a set of feature maps with neurons organized in it.The weights of the convolutional layer are a set of filters or kernels which are learnedduring the training. These filters are convolved by the feature maps to create a separatetwo-dimensional activation map stacked together alongside the depth dimension, pro-viding the output volume. Neurons that exist in the same feature map shares the weightwhereby decreasing the complexity of the network by keeping the number of weightslow. The spatial extension of sparse connectivity between the neurons of two layers isa hyperparamter named the receptive field. The hyperparameters that manage the sizeof the output volume are the depth (number of filters at a layer), stride (for moving thefilter) and zero-padding (to manage spatial size of the output). The CNNs are trained byback-propagation and the backward pass as well, performs a convolution operation, butwith spatially flipped filters. Fig. 2 shows the basic convolution operation of a CNN.One of the traditional versions of a CNN is ”Network In Network”(NIN), intro-duced by Lin et al. [87], where the × convolution filter leveraged is a Multi-LayerPerceptron (MLP) instead of the typical linear filters and the fully connected layers arereplaced by a Global Average Pooling (GAP) layer. The output structure is named theMLP-Conv layer because the micro network contains of stack of MLP-Conv layers.Dissimilar to a regular CNN, NIN can improve the abstraction ability of the latent con-cepts. They work very well in providing for justification that the last MLP-Conv layerof NIN were confidence maps of the classes leading to the possibility of conductingobject recognition using NIN. The GAP layer within the architecture is used to reducethe parameters of our framework. Indeed, reducing the dimension of the CNN outputby the GAP layer prevents our model from becoming over-parametrized and having alarge dimension. Therefore, the chance of overfitting in model is potentially reduced. Basic CNN architectures have alternating convolutional and pooling layers and the lat-ter functions to reduce the spatial dimension of the activation maps (without loss ofinformation) and the number of parameters in the network and therefore decreasingthe overall computational complexity. This manages the problem of overfitting. Someof the common pooling operations are max pooling, average pooling, stochastic pool-ing [88], spectral pooling [89], spatial pyramid pooling [90] and multiscale orderlesspooling [91]. The work by Alexey Dosovitskiy et al. [92] evaluates the functionality ofdifferent components of a CNN, and has found that max pooling layers can be replacedwith convolutional layers with stride of two. This essentially can be applied for sim-ple networks which have proven to beat many existing intricate architectures. We usedmax-pooling in our deep model. Fig. 3 shows the operation of max pooling.
Neurons in this layer are Fully Connected (FC) to all neurons in the previous layer, as ina regular neural network. High level reasoning is performed here. The neurons are not
Fig. 2.
Convolution operation.
Fig. 3.
Max pooling operation. spatially arranged so there cannot be a convolution layer after a fully connected layer.Currently, some deep architecture have their FC layer replaced, as in NIN, where FClayer is replaced by a GAP layer.
The last FC layer serves as the classification layer that calculates the loss or error whichis a penalty for discrepancy between actual output and desired. For predicting a singleclass out of k mutually exclusive classes, we use Softmax loss. It is the commonly andwidely used loss function. Specifically, it is multinomial logistic regression. It maps thepredictions to non-negative values and is normalized to achieve probability distributionover classes. Large margin classifier, SVM, is trained by computing a Hinge loss. Forregressing to real-valued labels, Euclidean loss can be calculated. We used Softmax lossto train our deep model. The Softmax loss is formulated as follows: L ( w ) = − n (cid:88) i =1 k (cid:88) j =1 y ( j ) i log ( p ( j ) i ) , (1)where, n is the number of training samples, y ( i ) is the one-hot encoding label for the i -th sample, and y ( j ) i is the j -th element in the label vector y i . The varaible p i is theprobability vector and p ( j ) i is the j -th element in the label vector p i which indicate theprobability that CNN assign to class j . The varaible w is the parameter of the CNN. ecture Notes in Computer Science: Authors’ Instructions 13 Fig. 4.
ReLu activation function.
ReLU is the regular activation function that is used in CNN models. It is a linear ac-tivation function which has thresholding at zero as shown in Eq.2. It has been shownthat the convergence of gradient descent is accelerated by applying ReLU. The ReLUactivation function is shown in Fig. 4 f ( x ) = max { , x } . (2) Our band selection algorithm can be used for any other deep architecture includingResNet [93], and AlexNet [94], and there is no restriction on choosing a specific deepmodel during the process of band selection in the first convolutional layer of these net-works using our algorithm. We used VGG-19 network since a) it is easy to implementin Tensorflow and it is more popular than other deep models and b) it achieved ex-cellent results on the ILSVRC-2014 dataset (i.e., ImageNet competition). The input toour VGG-19 based CNN is a fixed-size × hypespectral image. The only pre-processing that we perform is to subtract the mean spectral value, calculated on thetraining set, from each pixel. The image is sent through a stack of convolutional oper-ation, where we use filters with a very small receptive field of × . This filter sizeis the smallest size that capture the notion of left and right, up and down, and center.In one of the configurations, we also can use × convolutional filters, which can beconsidered as a linear transformation of the input channels. The convolutional stride isset to 1 pixel. The spatial padding of the convolutional layer input is such that the spa-tial resolution is preserved after convolution, which means that the padding is 1 pixelfor × convolutional layers. Spatial pooling is performed by five max-pooling layers,which follow some of the convolutional layers. Note that not all of the convolutionallayers are followed by max-pooling. In VGG-19 network, max-pooling is carried outon a × pixel window, with stride of 2. A stack of convolutional layers is followed bytwo FC layers as follows: the first has 4096 nodes, the second performs k nodes (i.e.,one for each class). The second layer is basically the soft-max layer. The hidden layeris followed by rectification ReLU non-linearity. The overall architecture of VGG-19 isshown in Fig. 5. Fig. 5.
Block diagram of hyperspectral band selection for face recognition based on structurallysparsified CNN.
We propose a regularization scheme which uses a SSL technique to specify the opti-mal spectral bands to obtain the best face recognition performance over all the spec-tral bands. Our regularization method is based on a group Lasso algorithm [54] whichshrinks a set of groups of weights during the training of our CNN architecture. By usingthis regularization method, our algorithm recognizes face images with high accuracy,and simultaneously, forces some groups of weights corresponding to redundant bandsto become zero. In our framework, the goal is achieved by adding the (cid:96) norm of thegroups as a sparsity constraint term to the total loss function of the network for facerecognition. Depending on how much sparsity that we want to impose to our model,we scale the sparsity term by a hyperparameter. The hyperparameter creates a balancebetween face recognition loss and the sparsity constraint during the training step. It canbe shown that if we enlarge the hyperparameter value, we impose more sparsity on ourmodel, and if the hyperparameter is set to a value close to zero, we add less sparsityconstraint to our model. In our regularization framework, the hyperspectral images are directly fed to the CNN.Therefore, the depth of each convolutional filter in the first layer of the CNN is equal tothe number of spectral bands, and all the weights belonging to the same channel for allthe convolutional filters in the first layer construct a group of weights. This results in thenumber of groups in our regularization scheme being equal to the number of spectralbands. The group Lasso regularization algorithm attempts to zero out the groups ofweights that are related to the redundant bands during the training of our CNN. ecture Notes in Computer Science: Authors’ Instructions 15
Suppose that w is all the weights for the convolutional filters of our CNN, and w denotes all the weights in the first convolutional layer of our CNN. Therefore, eachweight in a given layer is identified by a 4-D tensor (i.e., IR L × C × P × Q , where L , C , P ,and Q are the dimensions of the weight in the tensor space along the axes of the filter,channel, spatial height, and width, respectively). The proposed loss function which usesSSL to train our CNN is formulated as follows: L ( w ) = L r ( w ) + λ g . R g ( w ) , (3)where L r ( . ) is loss function used for face recognition, and R g ( . ) is SSL loss termapplied on the convolutional filters in the first layer. The variable λ g is a hyperparameterused to balance the two loss terms in (3). Since group Lasso can effectively zero out allof the weights in some groups [54], we leverage it in our total loss function to zero outgroups of weights corresponding to the redundant spectral bands in the band selectionprocess. Indeed the total loss function in (3) consists of two terms in which the firstterm performs face recognition, while the second term performs band selection basedon the SSL. These two terms are optimized jointly during the training of the network. In this section, we describe the loss function, L r ( w ) , that we have used for face recogni-tion. We use the center loss [95] to learn a set of discriminative features for hyperspec-tral face images. The softmax classifier loss is typically used in a CNN only forces theCNN features of different classes to stay apart. However, the center loss not only doesthis, but also efficiently brings the CNN features of the same class close to each other.Therefore, by considering the center loss during the training of the network, not onlyare the inter-class feature differences enlarged, but also the intra-class feature variationsare reduced. The center loss function for face recognition is formulated as follows: L r ( w ) = − n (cid:88) i =1 k (cid:88) j =1 y ( j ) i log ( p ( j ) i ) + γ n (cid:88) i =1 || f ( w, x i ) − c yi || , (4)where n is the number of training data, f ( w, x i ) is the output of the CNN, x i is the i th image in the training batch. The variable y i is one hot encoding label corresponding tothe sample x i , and y ( j ) i is the j th element in vector y i , k is the number of classes, and p i is the output of the softmax applied only on the output of the CNN (i.e., f ( w, x i ) ).The variable c yi indicates the center of the features corresponding to the i th class. Thevariable γ is a hyperparameter used to balance the two terms in the center loss. Assume that each hyperspectral image has C number of spectral bands. Since, in ourregularization scheme, hyperspectral images are directly fed to the CNN, the depth ofeach convolutional filter in the first layer of our CNN is equal to C . Here, we adopt a group Lasso to regularize the depth of each convolutional filter in the first layer of ourCNN. We use the group Lasso because it can effectively zero out all of the weights insome groups [54]. Therefore, the group Lasso can zero out groups of weights whichcorrespond to redundant spectral bands. In the setup of our group Lasso regularization,weights belonging to the same channel for all the convolutional filters in the first layerform a group (red squares in Fig. 5) which can be removed during the training step byusing R g ( w ) function as defined in (3). Therefore, there are C number of groups inour regularization framework. The group Lasso regularization on the parameters of w is an (cid:96) norm which can be expressed as follows: R g ( w ) = C (cid:88) g =1 || w ( g )1 || , (5)where w ( g )1 is the subset of weights (i.e., a group of weights) from w , and C is the totalnumber of groups. Generally, different groups may overlap in the group Lasso regular-ization. However, this does not happen in our case. The notation || . || represents an (cid:96) norm on the parameters of the group w ( g )1 . Therefore, the group Lasso regularization asa sparsity constraint for band selection can be expressed as follows: R g ( w ) = C (cid:88) c =1 (cid:118)(cid:117)(cid:117)(cid:116) L (cid:88) l =1 P (cid:88) p =1 Q (cid:88) q =1 ( w ( l, c, p, q )) , (6)where w ( l, c, p, q ) denotes a weight located in l th convolutional filter, c th channel,and ( p, q ) spatial position. In this formulation, all of the weights w (: , c, : , :) (i.e., theweights which have the same index c ), belong to the same group w ( c )1 . Therefore, R g ( w ) is an (cid:96) regularization term in which (cid:96) is performed on the (cid:96) norm of eachgroup. The proposed framework automatically selects the optimal bands from all spectralbands for face recognition during the training phase. For clarification, we can assumethat in a typical RGB image, we have three bands and the depth of each filter in the firstconvolutional layer is three. However, here, there are C spectral bands and as a conse-quence, the depth of each filter in the first layer is C . As shown in Fig. 5, hyperspectralimages are fed into the CNN directly. The group Lasso efficiently removes redundantweight groups (associated with different spectral band) to improve the recognition ac-curacy during the training phase. In the beginning of the training, the depth of the filtersis C , and once we start to sparsify the depth of the convolutional filters, the depth ofeach filter will be reduced (i.e., C (cid:48) << C ).It should be noted that the dashed cube in the Fig. 5 is not part of our CNN archi-tecture. This is the structure of the convolutional filters in the first layer after severalepochs training the network using the network loss function defined in (3). ecture Notes in Computer Science: Authors’ Instructions 17 Bands A cc u r ac y ( % ) Fig. 6.
Face recognition accuracy of each individual band on the UWA -HSFD
We use the VGG-19 [86] architecture as shown in Fig. 5 with the same filter size,pooling operation and convolutional layers. However, the depth of the filters in thefirst convolutional layer of our CNN is set to the number of the hyperspectral bands.The network uses filters with a receptive field of × . We set the convolution strideto 1 pixel. To preserve spatial resolution after convolution, the spatial padding of theconvolutional layer is fixed to 1 pixel for all the × convolutional layers. In thisframework, each hidden layer is followed by a ReLU activation function. We applybatch normalization (i.e., shifting inputs to zero-mean and unit variance) after eachconvolutional and fully connected layer, and before performing the ReLU activationfunction. Batch normalization potentially helps to achieve faster learning as well ashigher overall accuracy. Furthermore, batch normalization allows us to use a higherlearning rate, which potentially provides another boost in speed. In this section, we describe how we initialize the parameters of our network for thetraining phase. Thousands of images are needed to train such a deep model. For thisreason, we initialize the parameters of our network by a VGG-19 network pre-trainedon the ImageNet dataset and then we fine tune it as a classifier by using the CASIA-WebFace dataset [96]. CASIA-Web Face contains 10,575 subjects and 494,414 images. Asfar as we know, this is the largest publicly available face image dataset, second only tothe private Facebook dataset. In our case, however, since the depth of the filters in thefirst layer is the number of spectral bands, we initialize these filters by duplicating the filters of the pre-trained VGG-19 network in the first convolutional layer. For example,assume that the depth of the filters in the first layer is n (we have n spectral bands).Then, in such a case, we duplicate filters of the first layer n times as an initializationpoint for training the network. We use the Adam optimizer [97] with the default hyper-parameter values ( (cid:15) = 10 − , β = 0 . , β = 0 . ) to minimize the total loss function of our network defined in(3). The Adam optimizer is a robust and well-adapted optimizer that can be applied toa variety of non-convex optimization problems in the field of deep neural networks. Weset the learning rate to 0.001 to minimize loss function (3) during the training process.The hyperparameter λ g is selected by cross-validation in our experiments. We ran theCNN model through 100 epochs, although the model nearly converged after 30 epochs.The batch size in all experiments is fixed to 32. We implemented our algorithm inTensorFlow, and all experiments are conducted on two GeForce GTX TITAN X 12GBGPUs. We performed our experiments on three standard and publicly available hyperspectralface image datasets including CMU [98], HK PolyU [18], and UWA [99]. Descriptionsof these datasets are as follows:
CMU-HSFD:
The face cubes in this dataset have been obtained by a spectro-polarimetric camera. The spectral wavelength range during the image acquisition isfrom 450 nm to 1100 nm with a step size of 10 nm . The images of this dataset havebeen collected in multiple sessions from 48 subjects. HK PolyU-HSFD:
The face images in this dataset have been obtained by using anindoor system made up of CRI’s VariSpec Liquid Crystal Tunable Filter with a halogenlight source. The spectral wavelength range during the image acquisition is from 400 nm to 720 nm with a step size of 10 nm , which creates 33 bands in total. There are 300hyperspectral face cubes captured from 24 subjects. For each subject, the hyperspectralface cubes have been collected from multiple sessions in an average span of five months. UWA-HSFD:
Similar to the HK PolyU dataset, the face images in this dataset havebeen acquired by using an indoor imaging system made up of CRI’s VariSpec LiquidCrystal Tunable Filter integrated with a Photon focus camera. However, the cameraexposure time is set and altered based on the signal-to-noise ratio for different bands.Therefore, this dataset has the advantage of having lower noise levels in comparison toother two datasets. There are 70 subjects in this dataset, the spectral wavelength rangeduring the image acquisition is from 400 nm to 720 nm with a step size of 10 nm .Table. 1 indicates a summary of the datasets that we have used in our experiments. We explore the influence of the hyper-parameter λ g defined in (3) on face recognitionperformance. Fig. 8 shows the CMC curves for CMU, HK PolyU, and UWA HSFD with ecture Notes in Computer Science: Authors’ Instructions 19(a) CMU-HSFD (b) UWA-HSFD(c) HK PolyU-HSFD Fig. 7.
Samples of hyperspectral images.
Dataset Subjects HS Cubes Bands Spectral RangeCMU 48 147 65 450-1090 nm HK PolyU 24 113 33 400-720 nm UWA 70 120 33 400-720 nm Table 1.
A summary of Hyperspectral Face datasets0 Lecture Notes in Computer Science: Authors’ Instructions(a) UWA-HSFD (b) HK PolyU-HSFD(c) CMU-HSFD
Fig. 8.
Accuracy of our model using different values of λ g .ecture Notes in Computer Science: Authors’ Instructions 21 Bands A cc u r ac y ( % ) Fig. 9.
Face recognition accuracy of each individual band on the HK PolyU-HSFD different values of { , , − , − , − } , respectively. We can see that our networktotal loss defined in (3) is not significantly sensitive to λ g if we set these parameterswithin [10 − , interval. We used the strategy presented in [95] to update the center of each class (i.e., c yi in (4)).In this strategy, first, instead of updating the centers with respect to the entire trainingset, we update the centers based on a mini-batch such that, in each iteration, the centersare obtained by averaging the features of the corresponding classes. Second, to preventthe large perturbations made by a few mislabeled samples, we scale it by a small numberof 0.001 to control the learning rate of the centers, as suggested in [95]. RGB cameras produce 3 bands over the whole visible spectrum. However, hyperspectralimaging camera divides this range into many narrow bands (e.g., 10 nm). Both of thesetypes of imaging cameras are the extreme cases of spectral resolution. Even thoughRGB cameras divides the visible spectrum into three bands, they are wide and the centerof the wavelengths in these bands are selected to approximate the human visual systeminstead of maximizing the performance of the face recognition task.In this work, we conducted experiments to find the optimal number of bands andtheir center wavelengths that maximize face recognition accuracy. Our method adoptsthe SSL technique during the training of our CNN to automatically select spectral bandswhich provide the maximum recognition accuracy. The results indicate that maximumdiscrimination power can be achieved by using a small number of bands rather than A cc u r ac y ( % ) Fig. 10.
Face recognition accuracy of each individual band on the CMU-HSFD all the spectral bands but more than three bands in RGB for the CMU dataset. Specifi-cally, the results demonstrate that the most discriminative spectral wavelengths for facerecognition are obtained by a subset of red and green wavelengths.In addition to the improvement in face recognition accuracy, other advantage ofthe band selection include: a reduction in computational complexity, a reduction inthe cost and time during image acquisition for hyperspectral cameras, and reductionin redundancy of the data. This is because one can capture the bands which are morediscriminative for a face recognition task instead of capturing images from the entirevisible spectrum. Table. 2 indicates the optimal spectral bands from all of the bands
Fig. 11.
Images of selected bands from UWA dataset. selected by our method. Our algorithm selects 4 bands including { } for the CMU dataset, 3 bands including { } for PolyU, and 3 bands includ-ing { } for the UWA dataset. The results show that SSL selects the optimalbands from the green and red spectra and ignores bands within the blue spectrum. Fig. ecture Notes in Computer Science: Authors’ Instructions 23
11 and Fig. 12 demonstrate some of the face images from the bands which are selectedby our algorithm. The experimental results indicate that the blue wavelength bands arediscarded earlier during the sparsification procedure because they are less discrimina-tive and they are less useful compared to the green, red and IR ranges for the task of facerecognition. The group sparsity technique used in our algorithm automatically selectsthe optimal bands by combining the informative bands so that the selected bands havethe most discriminative information for the task of face recognition.
Fig. 12.
Images of selected bands from CMU dataset.
Dataset BandsCMU { } nmHK PolyU { } nmUWA { } nm Table 2.
Center wavelengths of the selected bands for different hyperspectral datasets.
Fig. 6, Fig. 9, and Fig. 10 indicate the face recognition accuracy for each individualband on the UWA, CMU, and PolyU datasets, respectively. In Table. 3, we reported themaximum and minimum accuracy obtained from each spectral band when we use eachband individually during the training. We also reported the case where we use all bandswithout using the SSL technique for face recognition. Finally, we provided the results ofour framework in the case where we use SSL during the training. The results show thatusing SSL not only removes the redundant spectral bands for the face recognition task,but it can also improve the recognition performance in comparison to the case where allthe spectral bands are used for face recognition. These improvements are around 0.59%, 0.36%,and 0.32% on the CMU, HK PolyU, and UWA datasets, respectively.
Dataset Min Max All the bands SSLCMU 96.73 98.82 99.34 99.93HK PolyU 90.91 96.46 99.52 99.88UWA 91.86 97.41 99.63 99.95
Table 3.
Accuracy (%) of our band selection algorithm in different cases
Methods CMU PolyU UWAHyperspectral
Spectral Angle [17] . . . Spectral Eigeface [5] . . .
2D PCA [18] . . .
3D Gabor Wavelets [19] . . . Image Set Classification
DCC [100] . . . MMD [101] . . . MDA [102] . . . AHISD [103] . . . SHIDS [103] . . . SANP [104] . . . CDL[105] . . . PLS [3] . . . PLS* [3] . . . Grayscale and RGB
SRC [106] . . . CRC [107] . . . LCVBP+RLDA [108] . . . CNN-Based Models
S-CNN [51] 98.8 97.2 -S-CNN+SVM [51] 99.2 99.3 -S-CNN+SVM * [51] 99.4 99.6 -Deep-Baseline 99.3 99.5 99.6Deep-SSL
Table 4.
Comparing accuracy (%) of different band selection methods with our proposed method.ecture Notes in Computer Science: Authors’ Instructions 25
We compared our proposed algorithm with several existing face recognition techniquesthat are extended to the hyperspectral face recognition methods. We categorize thesemethods into four groups including four existing hyperspectral face recognition meth-ods [17,5,18,19], eight image-set classification methods [100,101,102,103,104,105,3]three RGB/grayscale face recognition algorithms [106,107,108], and one existing CNN-based model for hyperspectral face recognition [51]. For a fair comparison, we havebeen consistent with other compared methods in experimental setup including the num-ber of images in the gallery and probe data. Specifically, for the PolyU-HSFD dataset,we use the first 24 subjects which contain 113 hyperspectral image cubes. For each sub-ject, we randomly select two cubes for the gallery and we use the remaining 63 cubesfor probes. For the CMU-HSFD dataset, the dataset includes 48 subjects, each subjecthas 4 to 20 cubes obtained from different sessions and different lighting conditions. Weuse only the cubes which are obtained in a condition that all lights are turned on. Thus,there are 147 hyperspectral cubes of 48 subjects such that each subject has 1 to 5 cubes.We construct the gallery randomly by selecting one cube per subject, and we use theremaining 99 cubes for probes. For the UWA-HSFD dataset, we randomly select onecube for each of 70 subjects to construct a gallery and we use the remaining 50 cubesfor probes.Table. 4 indicates the average accuracy of the compared methods when all bandsare available for different algorithms during the face recognition. The Deep-Baselineis the case where we use all the bands in our CNN framework for face recognition.Therefore, in this case we turn off the SSL regularization term in (3), while Deep-SSL is the case that we perform face recognition using the SSL regularization term.We reported the face recognition accuracy of Deep-SSL in Table. 4 to compare it withthe best recognition results reported in the literature. The results show that Deep-SSLoutperforms the state-of-the-art methods including PLS* and S-CNN+SVM* methods.The symbol * represents the case that the algorithms perform face recognition whenthey use their optimal hyperspectral bands.Please email us if you want to receive the data and the source code of our proposedalgorithm presented in this book chapter.
10 Conclusion
In this work, we proposed a CNN-based model which uses an SSL technique to selectthe optimal spectral bands to obtain the best face recognition performance from all thespectral bands. In this method, convolutional filters in the first layer of our CNN areregularized by using a group Lasso algorithm to remove the redundant bands during thetraining. Experimental results indicate that our method automatically selects the optimalbands to obtain the best face recognition performance over that achieved using conven-tional broad-band (RGB) face images. Moreover, the results indicate that our modeloutperforms existing methods which also perform band selection for face recognition. [email protected] or6 Lecture Notes in Computer Science: Authors’ Instructions References
1. D. W. Allen, “An overview of spectral imaging of human skin toward face recognition,” in
Face Recognition Across the Imaging Spectrum , pp. 1–19, Springer, 2016.2. Z. Pan, G. Healey, M. Prasad, and B. Tromberg, “Face recognition in hyperspectral images,”
IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 25, no. 12, pp. 1552–1560, 2003.3. M. Uzair, A. Mahmood, and A. Mian, “Hyperspectral face recognition with spatiospectralinformation fusion and PLS regression,”
IEEE Transactions on Image Processing , vol. 24,no. 3, pp. 1127–1137, 2015.4. F. A. Kruse et al. , “Comparison of AVIRIS and Hyperion for hyperspectral mineral map-ping,” in , vol. 4, 2002.5. Z. Pan, G. Healey, and B. Tromberg, “Comparison of spectral-only and spectral/spatial facerecognition for personal identity verification,”
EURASIP journal on Advances in SignalProcessing , vol. 2009, p. 8, 2009.6. D. M. Ryer, T. J. Bihl, K. W. Bauer, and S. K. Rogers, “Quest hierarchy for hyperspectralface recognition,”
Advances in Artificial Intelligence , vol. 2012, p. 1, 2012.7. R. Gross, J. Shi, and J. F. Cohn,
Quo vadis face recognition?
Carnegie Mellon University,The Robotics Institute, 2001.8. R. Gross, I. Matthews, and S. Baker, “Appearance-based face recognition and light-fields,”
IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 26, no. 4, pp. 449–465, 2004.9. A. M. Mart´ınez, “Recognizing imprecisely localized, partially occluded, and expressionvariant faces from a single sample per class,”
IEEE Transactions on Pattern Analysis &Machine Intelligence , no. 6, pp. 748–763, 2002.10. J. Wilder, P. J. Phillips, C. Jiang, and S. Wiener, “Comparison of visible and infra-red im-agery for face recognition,” in
Proceedings of the Second International Conference on Au-tomatic Face and Gesture Recognition , pp. 182–187, IEEE, 1996.11. V. Blanz, S. Romdhani, and T. Vetter, “Face identification across different poses and illu-minations with a 3D morphable model,” in
Proceedings of Fifth IEEE International Con-ference on Automatic Face Gesture Recognition , pp. 202–207, IEEE, 2002.12. R. R. Anderson and J. A. Parrish, “The optics of human skin,”
Journal of investigativedermatology , vol. 77, no. 1, pp. 13–19, 1981.13. E. A. Edwards and S. Q. Duntley, “The pigments and color of living human skin,”
AmericanJournal of Anatomy , vol. 65, no. 1, pp. 1–33, 1939.14. N. Tsumura, H. Haneishi, and Y. Miyake, “Independent-component analysis of skin colorimage,”
JOSA A , vol. 16, no. 9, pp. 2169–2176, 1999.15. E. Angelopoulo, R. Molana, and K. Daniilidis, “Multispectral skin color modeling,” in
Pro-ceedings of the 2001 IEEE Computer Society Conference on Computer Vision and PatternRecognition. CVPR 2001 , vol. 2, pp. II–II, IEEE, 2001.16. N. A. Hagen and M. W. Kudenov, “Review of snapshot spectral imaging technologies,”
Optical Engineering , vol. 52, no. 9, p. 090901, 2013.17. S. A. Robila, “Toward hyperspectral face recognition,” in
Image Processing: Algorithmsand Systems VI , vol. 6812, p. 68120X, International Society for Optics and Photonics, 2008.18. W. Di, L. Zhang, D. Zhang, and Q. Pan, “Studies on hyperspectral face recognition invisible spectrum with feature band selection,”
IEEE Transactions on Systems, Man, andCybernetics-Part A: Systems and Humans , vol. 40, no. 6, pp. 1354–1361, 2010.19. L. Shen and S. Zheng, “Hyperspectral face recognition using 3D Gabor wavelets,” in
Pat-tern Recognition (ICPR), 2012 21st International Conference on , pp. 1574–1577, IEEE,2012.ecture Notes in Computer Science: Authors’ Instructions 2720. P. Bajcsy and P. Groves, “Methodology for hyperspectral band selection,”
PhotogrammetricEngineering & Remote Sensing , vol. 70, no. 7, pp. 793–802, 2004.21. C.-I. Chang, Q. Du, T.-L. Sun, and M. L. Althouse, “A joint band prioritization and band-decorrelation approach to band selection for hyperspectral image classification,”
IEEEtransactions on geoscience and remote sensing , vol. 37, no. 6, pp. 2631–2641, 1999.22. F. Melgani and L. Bruzzone, “Classification of hyperspectral remote sensing images withsupport vector machines,”
IEEE Transactions on geoscience and remote sensing , vol. 42,no. 8, pp. 1778–1790, 2004.23. N. Keshava, “Best bands selection for detection in hyperspectral processing,” in
Acoustics,Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE InternationalConference on , vol. 5, pp. 3149–3152, IEEE, 2001.24. Q. Du, “Band selection and its impact on target detection and classification in hyperspectralimage analysis,” in
Advances in Techniques for Analysis of Remotely Sensed Data, 2003IEEE Workshop on , pp. 374–377, IEEE, 2003.25. S. Kaewpijit, J. Le Moigne, and T. El-Ghazawi, “Automatic reduction of hyperspectral im-agery using wavelet spectral analysis,”
IEEE transactions on Geoscience and Remote Sens-ing , vol. 41, no. 4, pp. 863–871, 2003.26. J. C. Price, “Spectral band selection for visible-near infrared remote sensing: spectral-spatial resolution tradeoffs,”
IEEE Transactions on Geoscience and Remote Sensing ,vol. 35, no. 5, pp. 1277–1285, 1997.27. H. Steiner, A. Kolb, and N. Jung, “Reliable face anti-spoofing using multispectral swirimaging,” in
Biometrics (ICB), 2016 International Conference on , pp. 1–8, IEEE, 2016.28. H. J. Bouchech, S. Foufou, and M. Abidi, “Dynamic best spectral bands selection for facerecognition,” in
Information Sciences and Systems (CISS), 2014 48th Annual Conferenceon , pp. 1–6, IEEE, 2014.29. F. Taherkhani and M. Jamzad, “Restoring highly corrupted images by impulse noise usingradial basis functions interpolation,”
IET Image Processing , vol. 12, no. 1, pp. 20–30, 2017.30. Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classification via ker-nel sparse representation,”
IEEE Transactions on Geoscience and Remote sensing , vol. 51,no. 1, pp. 217–231, 2013.31. V. Talreja, F. Taherkhani, M. C. Valenti, and N. M. Nasrabadi, “Attribute-guided coupledgan for cross-resolution face recognition,” arXiv preprint arXiv:1908.01790 , 2019.32. F. Taherkhani, V. Talreja, H. Kazemi, and N. Nasrabadi, “Facial attribute guided deep cross-modal hashing for face image retrieval,” in , pp. 1–6, IEEE, 2018.33. V. Talreja, F. Taherkhani, M. C. Valenti, and N. M. Nasrabadi, “Using deep cross modalhashing and error correcting codes for improving the efficiency of attribute guided facialimage retrieval,” in , pp. 564–568, IEEE, 2018.34. F. Taherkhani, H. Kazemi, and N. M. Nasrabadi, “Matrix completion for graph-baseddeep semi-supervised learning,” in
Thirty-Third AAAI Conference on Artificial Intelligence ,2019.35. H. Kazemi, S. Soleymani, F. Taherkhani, S. Iranmanesh, and N. Nasrabadi, “Unsupervisedimage-to-image translation using domain-specific variational information bound,” in
Ad-vances in Neural Information Processing Systems , pp. 10369–10379, 2018.36. H. Kazemi, F. Taherkhani, and N. M. Nasrabadi, “Unsupervised facial geometry learningfor sketch to photo synthesis,” in , pp. 1–5, IEEE, 2018.37. V. Talreja, M. C. Valenti, and N. M. Nasrabadi, “Multibiometric secure system based ondeep learning,” in , pp. 298–302, IEEE, 2017.8 Lecture Notes in Computer Science: Authors’ Instructions38. V. Talreja, T. Ferrett, M. C. Valenti, and A. Ross, “Biometrics-as-a-service: A framework topromote innovative biometric recognition in the cloud,” in , pp. 1–6, IEEE, 2018.39. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition , pp. 770–778, June2016.40. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convo-lutional neural networks,” in
Proc. Advances in Neural Information Processing Systems ,pp. 1097–1105, Dec. 2012.41. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale imagerecognition,”
CoRR , vol. abs/1409.1556, Sept. 2014.42. D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable object detection using deepneural networks,” in
Proc. IEEE Conference on Computer Vision and Pattern Recognition ,June 2014.43. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detectionwith region proposal networks,” in
Proc. Advances in Neural Information Processing Sys-tems , pp. 91–99, Dec. 2015.44. S. Soleymani, A. Dabouei, J. Dawson, and N. M. Nasrabadi, “Defending against adversarialiris examples using wavelet decomposition,” arXiv preprint arXiv:1908.03176 , 2019.45. S. Soleymani, A. Dabouei, S. M. Iranmanesh, H. Kazemi, J. Dawson, and N. M.Nasrabadi, “Prosodic-enhanced siamese convolutional neural networks for cross-devicetext-independent speaker verification,” in , pp. 1–7, IEEE, 2018.46. S. Soleymani, A. Dabouei, J. Dawson, and N. M. Nasrabadi, “Adversarial examples to fooliris recognition systems,” arXiv preprint arXiv:1906.09300 , 2019.47. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrit-twieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. , “Mastering the game of gowith deep neural networks and tree search,”
Nature , vol. 529, p. 484, Jan. 2016.48. Z. Zhong, J. Li, L. Ma, H. Jiang, and H. Zhao, “Deep residual networks for hyperspectralimage classification,” in
Geoscience and Remote Sensing Symposium (IGARSS), 2017 IEEEInternational , pp. 1824–1827, IEEE, 2017.49. Y. Zhan, D. Hu, H. Xing, and X. Yu, “Hyperspectral band selection based on deep convolu-tional neural network and distance density,”
IEEE Geoscience and Remote Sensing Letters ,vol. 14, no. 12, pp. 2365–2369, 2017.50. Y. Zhan, H. Tian, W. Liu, Z. Yang, K. Wu, G. Wang, P. Chen, and X. Yu, “A new hyperspec-tral band selection approach based on convolutional neural network,” in
Geoscience andRemote Sensing Symposium (IGARSS), 2017 IEEE International , pp. 3660–3663, IEEE,2017.51. V. Sharma, A. Diba, T. Tuytelaars, and L. Van Gool, “Hyperspectral cnn for image classifi-cation & band selection, with application to face recognition,” 2016.52. N. Li, C. Wang, H. Zhao, X. Gong, and D. Wang, “A novel deep convolutional neuralnetwork for spectral-spatial classification of hyperspectral data.,”
International Archives ofthe Photogrammetry, Remote Sensing & Spatial Information Sciences , vol. 42, no. 3, 2018.53. Z. Guo, D. Zhang, L. Zhang, and W. Liu, “Feature band selection for online multispectralpalmprint recognition,”
IEEE Transactions on Information Forensics and Security , vol. 7,no. 3, pp. 1094–1099, 2012.54. M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,”
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , vol. 68, no. 1,pp. 49–67, 2006.ecture Notes in Computer Science: Authors’ Instructions 2955. S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network construction withback-propagation,” in
Advances in neural information processing systems , pp. 177–185,1989.56. B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brainsurgeon,” in
Advances in neural information processing systems , pp. 164–171, 1993.57. S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficientneural network,” in
Advances in neural information processing systems , pp. 1135–1143,2015.58. W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neuralnetworks,” in
Advances in neural information processing systems , pp. 2074–2082, 2016.59. H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficientconvnets,” arXiv preprint arXiv:1608.08710 , 2016.60. K. Murray and D. Chiang, “Auto-sizing neural networks: With applications to n-gram lan-guage models,” arXiv preprint arXiv:1508.05051 , 2015.61. J. Feng and T. Darrell, “Learning the structure of deep convolutional networks,” in
Pro-ceedings of the IEEE international conference on computer vision , pp. 2749–2757, 2015.62. T. L. Griffiths and Z. Ghahramani, “The indian buffet process: An introduction and review,”
Journal of Machine Learning Research , vol. 12, no. Apr, pp. 1185–1224, 2011.63. H. Hu, R. Peng, Y.-W. Tai, and C.-K. Tang, “Network trimming: A data-driven neuronpruning approach towards efficient deep architectures,” arXiv preprint arXiv:1607.03250 ,2016.64. S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural net-works,”
ACM Journal on Emerging Technologies in Computing Systems (JETC) , vol. 13,no. 3, p. 32, 2017.65. Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
Advances in neuralinformation processing systems , pp. 598–605, 1990.66. D. C. Cires¸an, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber,“High-performance neural networks for visual object classification,” arXiv preprintarXiv:1102.0183 , 2011.67. Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi, “Deep roots: Improving cnn effi-ciency with hierarchical filter groups,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 1231–1240, 2017.68. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision ap-plications,” arXiv preprint arXiv:1704.04861 , 2017.69. S. Arora, A. Bhaskara, R. Ge, and T. Ma, “Provable bounds for learning some deep repre-sentations,” in
International Conference on Machine Learning , pp. 584–592, 2014.70. Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networksusing vector quantization,” arXiv preprint arXiv:1412.6115 , 2014.71. S. Anwar, K. Hwang, and W. Sung, “Fixed point optimization of deep convolutional neuralnetworks for object recognition,” in , pp. 1131–1135, IEEE, 2015.72. W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural networkswith the hashing trick,” in
International Conference on Machine Learning , pp. 2285–2294,2015.73. S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, and W. J.Dally, “DSD: regularizing deep neural networks with dense-sparse-dense training flow,” arXiv preprint arXiv:1607.04381 , vol. 3, no. 6, 2016.74. E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structurewithin convolutional networks for efficient evaluation,” in
Advances in neural informationprocessing systems , pp. 1269–1277, 2014.0 Lecture Notes in Computer Science: Authors’ Instructions75. M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networkswith low rank expansions,” arXiv preprint arXiv:1405.3866 , 2014.76. V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky, “Speeding-up convolutional neural networks using fine-tuned CP-decomposition,” arXiv preprintarXiv:1412.6553 , 2014.77. X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks forclassification and detection,”
IEEE transactions on pattern analysis and machine intelli-gence , vol. 38, no. 10, pp. 1943–1955, 2016.78. Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deep con-volutional neural networks for fast and low power mobile applications,” arXiv preprintarXiv:1511.06530 , 2015.79. L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networksusing dropconnect,” in
International conference on machine learning , pp. 1058–1066.80. X. Jin, X. Yuan, J. Feng, and S. Yan, “Training skinny deep neural networks with iterativehard thresholding methods,” arXiv preprint arXiv:1607.05423 , 2016.81. J. Fan, W. Xu, Y. Wu, and Y. Gong, “Human tracking using convolutional neural networks,”
IEEE Transactions on Neural Networks , vol. 21, no. 10, pp. 1610–1623, 2010.82. A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural net-works,” in
Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 1653–1660, 2014.83. R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection by multi-context deep learn-ing,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pp. 1265–1274, 2015.84. J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf:A deep convolutional activation feature for generic visual recognition,” in
Internationalconference on machine learning , pp. 647–655, 2014.85. Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,”
IEEE transactions on neural networks and learning systems , 2019.86. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale imagerecognition,” arXiv preprint arXiv:1409.1556 , 2014.87. M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in
European conference on computer vision , pp. 818–833, Springer, 2014.88. M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutionalneural networks,” arXiv preprint arXiv:1301.3557 , 2013.89. O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neuralnetworks,” in
Advances in neural information processing systems , pp. 2449–2457, 2015.90. A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High con-fidence predictions for unrecognizable images,” in
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pp. 427–436, 2015.91. Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling of deep con-volutional activation features,” in
European conference on computer vision , pp. 392–407,Springer, 2014.92. J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity:The all convolutional net,” arXiv preprint arXiv:1412.6806 , 2014.93. S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprintarXiv:1605.07146 , 2016.94. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu-tional neural networks,” in
Advances in neural information processing systems , pp. 1097–1105, 2012.ecture Notes in Computer Science: Authors’ Instructions 3195. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deepface recognition,” in
European Conference on Computer Vision , pp. 499–515, Springer,2016.96. D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXivpreprint arXiv:1411.7923 , 2014.97. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprintarXiv:1412.6980 , 2014.98. L. J. Denes, P. Metes, and Y. Liu,
Hyperspectral face database . Carnegie Mellon University,The Robotics Institute, 2002.99. M. Uzair, A. Mahmood, and A. S. Mian, “Hyperspectral face recognition using 3D-DCTand partial least squares.,” in
BMVC , 2013.100. T.-K. Kim, J. Kittler, and R. Cipolla, “Discriminative learning and recognition of image setclasses using canonical correlations,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 29, no. 6, pp. 1005–1018, 2007.101. R. Wang, S. Shan, X. Chen, and W. Gao, “Manifold-manifold distance with application toface recognition based on image set,” in
Computer Vision and Pattern Recognition, 2008.CVPR 2008. IEEE Conference on , pp. 1–8, IEEE, 2008.102. R. Wang and X. Chen, “Manifold discriminant analysis,” in
Computer Vision and PatternRecognition, 2009. CVPR 2009. IEEE Conference on , pp. 429–436, IEEE, 2009.103. H. Cevikalp and B. Triggs, “Face recognition based on image sets,” in
Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on , pp. 2567–2573, IEEE, 2010.104. Y. Hu, A. S. Mian, and R. Owens, “Face recognition using sparse approximated nearestpoints between image sets,”
IEEE transactions on pattern analysis and machine intelli-gence , vol. 34, no. 10, pp. 1992–2004, 2012.105. R. Wang, H. Guo, L. S. Davis, and Q. Dai, “Covariance discriminative learning: A naturaland efficient approach to image set classification,” in
Computer Vision and Pattern Recog-nition (CVPR), 2012 IEEE Conference on , pp. 2496–2503, IEEE, 2012.106. J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition viasparse representation,”
IEEE transactions on pattern analysis and machine intelligence ,vol. 31, no. 2, pp. 210–227, 2009.107. L. Zhang, M. Yang, and X. Feng, “Sparse representation or collaborative representation:Which helps face recognition?,” in
Computer vision (ICCV), 2011 IEEE international con-ference on , pp. 471–478, IEEE, 2011.108. S. H. Lee, J. Y. Choi, Y. M. Ro, and K. N. Plataniotis, “Local color vector binary patternsfrom multichannel face images for face recognition,”