Average Biased ReLU Based CNN Descriptor for Improved Face Retrieval
AAVERAGE BIASED RELU 1
Average Biased ReLU Based CNN Descriptor forImproved Face Retrieval
Shiv Ram Dubey,
Member, IEEE , Soumendu Chakraborty,
Member, IEEE
Abstract —The convolutional neural networks (CNN) likeAlexNet, GoogleNet, VGGNet, etc. have been proven as thevery discriminative feature descriptor for many computer visionproblems. The trained CNN model over one dataset performs rea-sonably well over another dataset of similar type and outperformsthe hand-designed feature descriptor. The Rectified Linear Unit(ReLU) layer discards some information in order to introduce thenon-linearity. In this paper, it is proposed that the discriminativeability of deep image representation using trained model can beimproved by Average Biased ReLU (AB-ReLU) at last few layers.Basically, AB-ReLU improves the discriminative ability by twoways: 1) it also exploits some of the discriminative and discardednegative information of ReLU and 2) it kills the irrelevant andpositive information used by ReLU. The VGGFace model alreadytrained in MatConvNet over the VGG-Face dataset is used as thefeature descriptor for face retrieval over other face datasets. Theproposed approach is tested over six challenging unconstrainedand robust face datasets like PubFig, LFW, PaSC, AR, etc. inretrieval framework. It is observed that AB-ReLU is consistentlyperformed better than ReLU using VGGFace pretrained modelover face datasets.
I. I
NTRODUCTION
The image descriptors are the fundamental signature forimage matching. Most of the research in the early days wasfocused over designing of hand-crafted descriptors such asScale Invariant Feature Transform (SIFT) [1], Local BinaryPattern (LBP) [2], etc. The hand-designed descriptors haveshown very promising performance in several computer visionproblems such as image matching [3], face recognition [4], [5],image retrieval [6], texture classification [7], [8], [9] [10], [11],biomedical image analysis [12], [13], [14], object detection[15], [16], etc. Several descriptors are also proposed for faceretrieval such as [5], [17], [18], [19], [20], [21]. The maindrawback of the hand-designed descriptors are with the lessdiscriminative power due to the data in-dependency nature.Since last few years, deep convolutional neural networkshave attracted full attention of researchers in computer visioncommunity. The first remarkable work was done in 2012by Alex et al. named as the AlexNet [22] for the Imagenetclassification task [23]. After Alexnet, several CNN modelsproposed for the Imagenet classification such as VGGNet[24], GoogLeNet [25] and ResNet [26]. The network overthe time became deeper and deeper, from AlexNet (8 stages)to VGGNet (16 and 19 stages) to GoogLeNet (22 stages) toResNet (152 stages).
S.R. Dubey is with the Computer Vision Group, Indian Institute ofInformation Technology (IIIT), Sri City, Andhra Pradesh, India (Email:[email protected]).S. Chakraborty is with the Indian Institute of Information Technology (IIIT),Lucknow, Uttar Pradesh, India (Email: [email protected] ).
The deep neural networks are also proposed for the facerecognition task. Some recent and renowned deep learningbased approaches are DeepFace [27], FaceNet [28], and VG-GFace [29], Bilinear CNN (BCNN) [30], Deep CNN (DCNN)[31] and All-in-One CNN [32] among others for face recog-nition. The DeepFace used a nine-layer deep neural networkfor face representation [27]. The number of parameters inDeepFace is too high as it is not using the weight sharing.DeepFace reported an accuracy of 97.35% on the LabeledFaces in the Wild (LFW) database [27], [33]. FaceNet isalso proposed as the feature extractor for face recognitionand clustering [28]. It uses the deep convolutional network asthe feature embedding. FaceNet reported 99.63% of accuracyover LFW face database. VGGFace utilized the convolutionalneural network (CNN) based end-to-end learning for facerecognition [29]. It is trained over a very large scale VGGFacedatabase with 2.6M images from 2.6K subjects.The RoyChowdhury et al. used the Bilinear CNN (BCNN)[34] for face recognition task [30]. They converted the standardpre-trained VGGFace Model into a BCNN without any extratraining cost. They reported 89.5% rank-1 recall using BCNNover the IJB-A benchmark [30], [35]. The DCNN is madewith 18 layers consisting of 10 convolution layer, 5 poollayer, 1 dropout layer, 1 fully connected layer, and 1 softmaxlayer [31]. It is trained over the CASIAWebFace dataset andevaluated over the IJB-A (97.70% rank-10 accuracy) and theLFW (97.45% accuracy) datasets [31], [35], [33]. A veryrecently, Ranjan et al. proposed All-in-One CNN for facialanalysis [32]. It is a multi-purpose network tracking facedetection, face alignment, pose estimation, gender recognition,smile detection, age estimation and face recognition through asingle network. All-in-One CNN utilized a multi-task learningframework by regularizing the shared parameters of CNN[32]. In this work, the VGGFace model is used as the featureextractor for the face retrieval experiments.The pre-trained models are also used for several tasks in theComputer Vision. Marmanis et al. used the pretrained CNNmodel (trained over ImageNet database) as the initial featureextractor for the Earth observation classification task [36].They observed 92.4% over UC Merced Land Use benchmarkwhich is far better than the hand-designed approaches [36].Liu et al. fused the CNN features with hand-designed featuresand experimented for content-based image retrieval [37]. It isalso reported that if pre-trained CNN model is directly appliedat more abstract level such as sketches, whereas it is trainedover the photos, the performance degrades drastically [38].Very recently, Bansal et al. claimed that the trained networkover still face images can be used for face verification in a r X i v : . [ c s . C V ] A p r VERAGE BIASED RELU 2 videos also effectively [39]. Pre-trained CNN models overImagenet database are also successfully applied in medicalimage applications for Mammogram Analysis [40]. Schwarzet al. also used the pre-trained CNN features for RGBD objectrecognition and pose estimation [41]. The CNN has also shownpromising performance for event detection in videos whichis actually trained over image classification database [42].Karpathy and Fei-Fei used the pre-trained CNN on ImageNet[23] for sentence generation from the image [43]. The trainedCNN model is fine-tuned for Cross-scene Crowd Counting byZhang et al. [44]. Pre-trained CNN model is also used forthe content based image retrieval [45]. Some researchers alsoadapted the transfer learning to utilize the trained network ofa domain in some other domain, such as Deep transfer [46]and Residual transfer [47]. A very recently, Ge et al. used thepre-trained VGG convolutional neural networks for remote-sensing image retrieval [48]. In this paper also, the pre-trainednetwork is used for the face retrieval task.Some researchers are also focused over different layers ofCNN model. Wen et al. have used the center loss functioninstead of the softmax loss function for face recognition[49]. The ReLU discards the negative values which actuallyrepresent the absence of events and might be useful to improvethe discriminative ability. In order to get rid of negativevalues of ReLU, a Rectified Factor Network is introducedin [50]. A Parametric Rectified Linear Unit (PReLU) is usedby He et al. as a generalization of the Rectified Linear Unit(ReLU) by considering the slope of the negative region into theparameter of each neuron [51]. The ReLU also has the “dyingGradient” problem where the gradient flow through a unit canbe zero forever [22]. Leaky ReLU (LReLU) tried to fix thedying gradient problem during training by considering smallnegative slope [52]. The LReLU is extended to randomizedleaky rectified linear units (RReLU) by considering a randomsmall number of negative slope [53]. An exponential linearunit (ELU) is proposed by Clevert et al. which also considersthe ReLU’s negative values [54]. Most of the existing rectifierunits do not consider the negative values which might beimportant. These rectifier units are also not dependent upon theinput data. In this paper, a new data dependent rectifier unitis proposed to boost the discriminative power of VGGFacedescriptor at the testing time.The main contributions of the this paper are as follows: • The suitability of using pre-trained CNN model overother databases of similar type is explored. • A new data dependent Average Biased Rectified LinerUnit (AB-ReLU) is proposed to boost the discriminativepower of the pre-trained network at testing time. • The suitability of proposed AB-ReLU is tested at differ-ent layers of the network. • The image retrieval experiments are conducted over sixchallenging face datasets.The rest of the paper is organized as follows: Section 2reviews the VGGFace model and rectified linear unit; Section3 proposes a new data dependent rectified linear unit andmodified VGGFace descriptor; Section 4 presents the experi-mental setup; Section 5 presents the results and discussions; TABLE I: VGGFace Layer Description. In Filter column, f , s and p represent the filter size, stride and padding respectively.In Volume Size column, the first value is a dimension ofvolume and the second value is depth of volume, i.e. 224,3represents volume size 224 × ×
3. The last fully connectedlayer and softmax layer are not shown because the output of‘relu7’ is considered as the 4096-dimensional feature vectorin this work.
No. LayerName LayerType Filter VolumeSize0 input Image n/a 224,31 conv1 1 Conv f :3,3,64, s :1, p :1 224,642 relu1 1 Relu n/a 224,643 conv1 2 Conv f :3,64,64, s :1, p :1 224,644 relu1 2 Relu n/a 224,645 pool1 Pool f :2, s :2, p :0 112,646 conv2 1 Conv f :3,64,128, s :1, p :1 112,1287 relu2 1 Relu n/a 112,1288 conv2 2 Conv f :3,128,128, s :1, p :1 112,1289 relu2 2 Relu n/a 112,12810 pool2 Pool f :2, s :2, p :0 56,12811 conv3 1 Conv f :3,128,256, s :1, p :1 56,25612 relu3 1 Relu n/a 56,25613 conv3 2 Conv f :3,256,256, s :1, p :1 56,25614 relu3 2 Relu n/a 56,25615 conv3 3 Conv f :3,256,256, s :1, p :1 56,25616 relu3 3 Relu n/a 56,25617 pool3 Pool f :2, s :2, p :0 28,25618 conv4 1 Conv f :3,256,512, s :1, p :1 28,51219 relu4 1 Relu n/a 28,51220 conv4 2 Conv f :3,512,512, s :1, p :1 28,51221 relu4 2 Relu n/a 28,51222 conv4 3 Conv f :3,512,512, s :1, p :1 28,51223 relu4 3 Relu n/a 28,51224 pool4 Pool f :2, s :2, p :0 14,51225 conv5 1 Conv f :3,512,512, s :1, p :1 14,51226 relu5 1 Relu n/a 14,51227 conv5 2 Conv f :3,512,512, s :1, p :1 14,51228 relu5 2 Relu n/a 14,51229 conv5 3 Conv f :3,512,512, s :1, p :1 14,51230 relu5 3 Relu n/a 14,51231 pool5 Pool f :2, s :2, p :0 7,51232 fc6 Conv f :7,512,4096, s :1, p :0 1,409633 relu6 Relu n/a 1,409634 fc7 Conv f :1,4096,4096, s :1, p :0 1,409635 relu7 Relu n/a 1,4096 and finally Section 6 sets the concluding remarks.II. R ELATED W ORKS
In this section, first the original VGGFace model used inthis work is described in detail and then the original rectifiedlinear unit is presented.
A. VGGFace Model
In this work, the original pre-trained VGGFace model istaken from MatConvNet library [55] released by Universityof Oxford . This model is based on the CNN implementationof VGG-Very-Deep-16 CNN architecture as described in [29].This model is trained over VGGFace database which consists http : ∼ vgg/software/vgg face/ http : ∼ vgg/data/vgg face/ VERAGE BIASED RELU 3
Fig. 1: The original rectified linear unit (ReLU) function [23].All the − ive input values are converted into zero, whereas allthe + ive input values are passed as it is.2.6M faces images from 2,622 subjects. The layers of VG-GFace model are summarized in Table I. In this table, the lastfully connected layer and sofmax layer of VGGFace are notlisted as it is not required in this work. The output of ‘relu7’ isconsidered as the VGGFace feature descriptor. The filter size,stride and padding are mentioned in the Filter column withfields f , s and p respectively. A filter size f :3,128,256 meanstotal 256 filters of dimension 3 × ×
112 with depth 64. In this work, the changes are madein selected rectified linear unit (ReLU) layers, especially inlast few layers which is described in the next section.
B. Rectified Linear Unit
The rectified linear unit (ReLU) in a neural network is usedto introduce the non-linearity [22]. The ReLU simply workslike a filter, ignores the negative signals and pass the positivesignals. Consider I nv is the input volume to ReLU at n th layerof any network and I n +1 v is the output volume of ReLU for ( n +1) th layer. Suppose the input volume I nv is d dimensionaland D k is the size of the input volume in k th dimension ∀ k ∈ [1 , d ] . Then, an element at position ρ = ( ρ , ρ , · · · , ρ d ) of output volume I n +1 v is computed from the correspondingelement of input volume I nv as follows, I n +1 v ( ρ ) = (cid:40) I nv ( ρ ) , if I nv ( ρ ) > , otherwise (1)where ρ is d -dimensional, D k is the size of I nv in k th dimension, ρ k ∈ [1 , D k ] ∀ k ∈ [1 , d ] . The ReLU function isillustrated in Fig. 1. It is linear in the + ive range, whereaszero in the − ive range. The main drawback with ReLU isthat it passes all + ive values even it might not be importantand blocks all − ive values even it might be important. Thisproblem is solved in the next section by introducing a datadependent ReLU.III. P ROPOSED F ACE D ESCRIPTOR
In this section, first a data dependent average biased rectifiedlinear unit (AB-ReLU) is proposed, then it is applied withexisting pre-trained VGGFace model [29] to create a morediscriminative face descriptor, and finally AB-ReLU basedVGGFace descriptor is used for face retrieval. (a) AB-ReLU if A nv < (b) AB-ReLU if A nv ≥ Fig. 2: The average biased rectified linear unit (AB-ReLU)function. Here, β represents an average biased factor. Theeffective biased is + ive in (a) because the value of β is − ive ,whereas effective biased is − ive in (b) because the value of β is + ive . The − ive β represents that the − ive values are alsoimportant, whereas the + ive β represents that all the + ive values are not important. A. Average Biased Rectified Linear Unit
It can be noticed from ReLU in the previous section that it isnot data dependent, vanishes all the − ive signals and passes allthe + ive signals which can lead to less discriminative features.In this section, this problem is resolved by introducing anew data dependent ReLU named as average biased rectifiedlinear unit (AB-ReLU). The AB-ReLU is data dependent byexploiting the average property of the input volume. It alsoworks like a filter and pass only those signals which satisfythe average biased criteria. The average biased criteria ensuresthat only important features get passed irrespective of its sign.Suppose, AB-ReLU is used in any network at n th layer and I nv and I n +1 v are input volume and output volume for thislayer respectively. Then, the ρ th element of output layer I n +1 v is given by following equation, I n +1 v ( ρ ) = (cid:40) I nv ( ρ ) − β, if I nv ( ρ ) − β > , otherwise (2)where ρ = ( ρ , ρ , · · · , ρ d ) represents the position of anelement, d is the dimension of I nv , D k is the size of I nv in k th dimension, ρ k ∈ [1 , D k ] ∀ k ∈ [1 , d ] , and β is the averagebiased factor defined as follows, β = α × A nv (3)where α is a parameter to be set empirically and A nv is theaverage of input volume computed as follows, A nv = (cid:80) D ρ =1 (cid:80) D ρ =1 · · · (cid:80) D d ρ d =1 I nv ( ρ , ρ , · · · , ρ d ) D × D × · · · × D d (4)The AB-ReLU leads to two AB-ReLUs, i.e. + ive AB-ReLUand − ive AB-ReLU based upon the input data. This behaviorof AB-ReLU is illustrated in Fig. 2 where Fig. 2a shows the + ive ReLU function and Fig. 2b depicts the − ive AB-ReLUfunction. The + ive AB-ReLU signifies the + ive averagebiased scenario when the input data volume A nv has the − ive majority, i.e. A nv < and allows some prominent − ive signalsby converting it into + ive signal with the addition of anaverage biased factor of input volume ( β ). Similarly, if theinput data volume A nv has the + ive majority, i.e. A nv ≥ VERAGE BIASED RELU 4 then AB-ReLU blocks even some inferior + ive signals alongwith all − ive signals by subtracting the average biased factorof input volume ( β ). The default value of α is set to . In thenext subsection, AB-ReLU is used to construct the descriptor. B. AB-ReLU based VGGFace Descriptor
In this subsection, the VGGFace model is used with AB-ReLU to construct the improved VGGFace descriptors. TheAB-ReLU is applied directly over pre-trained VGGFace modelat some layers instead of simple ReLU. The output of layer35(i.e. ReLU) of original pre-trained VGGFace model afterreshaping into a 1-D array is used as the VGGFace de-scriptor and represented by VGGFace35ReLU (or just 35Ras shorthand notation). The first descriptor is proposed bysimply replacing the last ReLU, i.e. at layer35 with AB-ReLUand converting its output into a 1-D array. This descriptor isrepresented by VGGFace35AB-ReLU (i.e. 35AR) for α = 1 .The other variants of this descriptor are VGGFace35AB-ReLU2 (i.e. 35AR2) and VGGFace35AB-ReLU5 (i.e. 35AR5)for α = 2 and α = 5 respectively. Similarly, other descriptorsare generated by replacing some ReLU of VGGFace with AB-ReLU. In second descriptor i.e. VGGFace33AB-ReLU (i.e.33AR) for α = 1 , layer34 and layer35 are removed, theReLU at layer33 is replaced with AB-ReLU and the output oflayer33 is considered as the descriptor after reshaping intoa 1-D array. Its other variants are VGGFace33AB-ReLU2(i.e. 33AR2) and VGGFace33AB-ReLU5 (i.e. 33AR5) for α = 2 and α = 5 respectively. In VGGFace33AB-ReLU 35(i.e. 33AR 35) descriptor, the ReLU at layer33 is replacedwith AB-ReLU while the output of layer35 using ReLU isconsidered as the descriptor. AB-ReLU is applied at multiplelayers, i.e. at layer33 and layer35 in VGGFace33,35AB-ReLU(i.e. 33,35AR). The AB-ReLU is also applied at layer30.Two descriptors namely VGGFace30AB-ReLU (i.e. 30AR)and VGGFace30AB-ReLU 35 (i.e. 30AR 35) are consideredfor the experiments. In VGGFace30AB-ReLU, the outputlayer30 (i.e. AB-ReLU) is taken as the descriptor, whereas inVGGFace30AB-ReLU 35, the AB-ReLU is used at layer30and the output of last layer (i.e. layer35) is taken as thedescriptor. In experiment section, the shorthand notations ofdescriptor are used.The effect of AB-ReLU with pre-trained VGGFace(VGGFace35AB-ReLU) is illustrated with an example faceimage in Fig. 3. The example face image displayed in Fig. 3ais considered from the LFW database [33]. This example faceimage is used as the input to the pre-trained VGGFace modeland features are computed before and after layer35. Fig. 3bshows the input signal for last layer (i.e. layer35). The outputsignal of ReLU at layer35 is displayed in Fig. 3c. In Fig. 3d,3e, and 3f, the output signals of AB-ReLU for α =1, 2, and 5respectively are illustrated. For this example, A nv < at layer35, it can be also observed from the Fig. 3 that AB-ReLUpasses more signal as compared to ReLU.IV. E XPERIMENTAL S ETUP
In this paper, the image retrieval framework is adapted forthe experiments. The face retrieval is done using introduced AB-ReLU based VGGFace descriptor. In face retrieval, the topmatching faces are returned from a database for a given queryface based on the description of the faces. The best matchingfaces are decided based on the similarity scores between queryface and database faces. In this work, the similarity scores areconsidered as the distance between the descriptor of query faceand descriptor of database faces. The lower distance betweentwo feature descriptors represents more similarity among thecorresponding face images and vice versa.
A. Distances Measures
In image retrieval, the performance also depends upon thedistance measures used for finding the similarity scores. Inorder to compute the performance, top few numbers of facesare retrieved. The Chi-square (Chisq) distance is used in mostof the experiments in this work. The Euclidean, Cosine, EarthMover Distance (Emd), L1, and D1 distances are also adaptedto find out the more suitable distance in the current scenario[56], [6].
B. Evaluation Criteria
In order to present the result of face retrieval and compari-son, the standard evaluation metrics are used in this paper suchas precision, recall, f-score, and retrieval rank. All the imagesof a database are treated as the query image (i.e. probe) oneby one and rest of the images as gallery to report the averageperformance over full database. The average retrieval precision(ARP) and average retrieval rate (ARR) over full database arecomputed as the average of mean precisions (MP) and meanrecalls (MR) respectively over all categories. The MP and MRfor a category is calculated as the mean of precisions andrecalls respectively by turning all the images of that categoryas the query one by one. The precision (
P r ) and recall ( Re )for a query image is calculated as follows, P r =
Correct Retrieved Images
Retrieved ImagesRe =
Correct Retrieved Images
Similar Images In Database (5)The F-score is calculated from the ARP and ARR values withthe help following equation, F − score = 2 × ARP × ARRARP + ARR (6)In order to test the effective rank of correctly retrieved faces,the average normalized modified retrieval rank (ANMRR)metric is adapted [57]. The better retrieval performance isinferred from the higher values of ARP, ARR and F-Score,and lower value of ANMRR and vice-versa.
C. Databases Used
Six challenging, unconstrained and robust face databasesare used to demonstrate the effectiveness of the proposedAB-ReLU based VGGFace descriptor: PaSC [58], LFW [33],PubFig [59], FERET [60], [61], AR [62], [63], and ExYaleB[64], [65]. Viola Jones object detection method [66] is adaptedto detect and crop the face regions in the images. The faces are
VERAGE BIASED RELU 5 (a) An Example Image fromLFW database [33]
Input Volume Array (Reshaped to 1-D) -50-40-30-20-1001020 V a l u e s Input to Layer35 (ReLU) (b) Input to Layer 35
Output Volume Array (Reshaped to 1-D) V a l u e s Output of Layer35 (ReLU) (c) Output of Layer 35 (ReLU)
Output Volume Array (Reshaped to 1-D) V a l u e s Output of Layer35 (AB-ReLU) (d) Output of Layer 35 (AB-ReLU, α = 1 ) Output Volume Array (Reshaped to 1-D) V a l u e s Output of Layer35 (AB-ReLU2) (e) Output of Layer 35 (AB-ReLU, α = 2 ) Output Volume Array (Reshaped to 1-D) V a l u e s Output of Layer35 (AB-ReLU5) (f) Output of Layer 35 (AB-ReLU, α = 5 ) Fig. 3: An example illustrating the ReLU and AB-ReLU in terms of the final feature of layer 35 of VGGFace model.resized to × and ‘zerocenter’ normalization is appliedbefore feeding to proposed AB-ReLU based VGGFace model.The PaSC still images face database consists 9376 imagesfrom 293 individuals with 32 images per individual [58]. PaSCdatabase has the effects like blur, pose, and illumination andregarded as one of the difficult database. This database finallyhas 8718 faces after face detection using Viola Jones detector.In current scenario, the unconstrained face retrieval is verydemanding due to the increasing number of faces over theInternet. In this paper, LFW [33] and PubFig [59] databasesare considered for this purpose. These two databases havecollected the images from the Internet in an unconstrainedway without subjects cooperations with several variations,such as pose, lighting, expression, scene, camera, etc. In theimage retrieval framework, it is required to retrieve more thanone (typically 5, 10, etc.) top matching images. In that case,the sufficient number of images should be available for eachcategory in the database. By considering this fact, all theindividuals having at least 20 images are taken in the LFWdatabase (i.e. 2984 faces from 62 individuals) [33]. The PublicFigure database (i.e., PubFig) consists 6472 faces from 60individuals [59]. Following the URLs given in the PubFig facedatabase, the images are downloaded directly from the Internetafter removing the dead URLs.In order to experiment with the robustness of the descrip-tor, FERET, AR and Extended Yale B face databases areused. “Portions of the research in this paper use the FERETdatabase of facial images collected under the FERET program,sponsored by the DOD Counterdrug Technology DevelopmentProgram Office” [60], [61]. The cropped version of the Color-FERET database having 4053 faces from 141 people (onlysubjects having at least 20 faces) is considered in this work. Several variations like expression and pose (13 different poses)are present in the FERET database. The cropped version of theAR face database is also used for the experiments [62], [63].The AR database has the masking effect where some portionsof the face are occluded along with the illumination and coloreffect. Total 2600 face images are available from 100 peoplein AR database. Extended Yale B (ExYaleB) database is basedon the severe amount of illumination differences (i.e. 64 typesof illuminations) [64], [65]. Total 2432 cropped faces from 38persons with 64 faces per person are present in the ExYaleBdatabase for the face retrieval.V. R ESULTS AND D ISCUSSIONS
In this work, the content based image retrieval framework isadapted for the experiments and comparison. In this section,first result comparison is presented by fixing the similaritymeasure as Chi-square distance, and then the performanceof proposed VGGFace35AB-ReLU descriptor is tested withdifferent similarity measures.
A. Results Comparison
Several VGGFace descriptor with AB-ReLU at differentlayers such as VGGFace35ReLU (35R), VGGFace35AB-ReLU (35AR), VGGFace35AB-ReLU2 (35AR2),VGGFace35AB-ReLU5 (35AR5), VGGFace33ReLU (33R),VGGFace33AB-ReLU (33AR), VGGFace33AB-ReLU2(33AR)2, VGGFace33AB-ReLU5 (33AR5), VGGFace33AB-ReLU 35 (33AR 35), VGGFace33,35AB-ReLU (33,35AR),VGGFace30AB-ReLU 35 (30AR 35), and VGGFace30AB-ReLU (30AR), etc. are used for the experiments. The averageretrieval precision (ARP) for topmost match (i.e. Rank-1Accuracy) is illustrated in Table II over the PaSC, LFW,
VERAGE BIASED RELU 6
TABLE II: Average Retrieval Precision, ARP(%) for topmost match using AB-ReLU and VGGFace based descriptors overthe PaSC, LFW, PubFig, FERET, AR and ExYaleB databases. It is also equivalent to the rank-1 accuracy. The results for bestperforming descriptor for a database is highlighted in bold.
Database 35R 35AR 35AR2 35AR5 33R 33AR 33AR2 33AR5 33AR 35 33,35AR 30AR 35 30ARPaSC 93.06 93.88
ExYaleB 85.77 86.39 86.27 85.90 86.92 86.55 86.18 85.53 85.90 85.81 86.72
TABLE III: ARP(%) for 5 numbers of retrieved images using AB-ReLU and VGGFace based descriptors over the PaSC, LFW,PubFig, FERET, AR and ExYaleB databases. The results for the best descriptor for a database is highlighted in bold.
Database 35R 35AR 35AR2 35AR5 33R 33AR 33AR2 33AR5 33AR 35 33,35AR 30AR 35 30ARPaSC 87.91 89.33
TABLE IV: ARP(%) for 10 numbers of retrieved images using AB-ReLU and VGGFace based descriptors over the PaSC,LFW, PubFig, FERET, AR and ExYaleB databases. The results for the best descriptor for a database is highlighted in bold.
Database 35R 35AR 35AR2 35AR5 33R 33AR 33AR2 33AR5 33AR 35 33,35AR 30AR 35 30ARPaSC 83.11 85.08
TABLE V: Average Retrieval Rate, ARR(%) for 10 numbers of retrieved images using AB-ReLU and VGGFace baseddescriptors over the PaSC, LFW, PubFig, FERET, AR and ExYaleB databases. The results for the best descriptor for adatabase is highlighted in bold.
Database 35R 35AR 35AR2 35AR5 33R 33AR 33AR2 33AR5 33AR 35 33,35AR 30AR 35 30ARPaSC 28.06 28.74
TABLE VI: F-Score(%) for 10 numbers of retrieved images using AB-ReLU and VGGFace based descriptors over the PaSC,LFW, PubFig, FERET, AR and ExYaleB databases. The results for the best descriptor for a database is highlighted in bold.
Database 35R 35AR 35AR2 35AR5 33R 33AR 33AR2 33AR5 33AR 35 33,35AR 30AR 35 30ARPaSC 41.89 42.89
PubFig, FERET, AR, and ExYaleB databases. It is observedfrom Table II that the performance of 35AR and 35AR2 isbetter, mainly over the unconstrained databases, whereas theperformance of 30AR is better over robust databases likeAR and ExYaleB. It is also noted that the performance of AB-ReLU (35AR) is improved as compared to the ReLU(35R). Table III listed the ARP values when 5 best faces areretrieved. In this result, the performance is generally betterfor parameter α = 2 , i.e. 35AR2 and 33AR2. The picture isclear from Table IV, where ARP is reported for 10 numbers VERAGE BIASED RELU 7
TABLE VII: Average Normalized Modified Retrieval Rank (ANMRR) in % for 10 numbers of retrieved images using AB-ReLU and VGGFace based descriptors over the PaSC, LFW, PubFig, FERET, AR and ExYaleB databases. The results for thebest performing descriptor (i.e. least ANMRR value) for is highlighted in bold.
Database 35R 35AR 35AR2 35AR5 33R 33AR 33AR2 33AR5 33AR 35 33,35AR 30AR 35 30ARPaSC 4.40
TABLE VIII: ARP in % for 10 numbers of retrieved imagesusing VGGFace35AB-ReLU (35AR) descriptor with differentdistance measures. Note that ARP value highlighted in boldrepresents the best distance over a database.
Database Euclidean Cosine L1 D1 ChisquarePaSC 84.89 84.84 85.01 85.01
LFW 97.65 97.64 97.66 97.66
PubFig 95.64 95.63 95.67 95.67
FERET
ExYaleB 71.45 71.34 71.48 71.48 of retrieved images. Descriptors constructed at last layer (i.e.layer35) are superior except over AR database. One possiblereason is that the trained faces of VGGFace database are notmasked. The result in Table IV confirms that AB-ReLU isbetter suited for the descriptor as compared to ReLU at bothlayer35 as well as layer33.The ARR and F-Score are summarized in Table V andTable VI respectively, for 10 numbers of retrieved images.The similar trend is observed in the results of ARR and F-Score that 35AR and 35AR2 are the best performing VGGFacebased descriptors. Some variations can be seen in the ANMRRresults for same 10 best matching retrieved images in Table VIIas compared to the ARP, ARR and F-Score because ANMRRpenalizes the rank heavily for false positive retrieved images.Still 35AR is better over PaSC and FERET databases and35AR2 is better over PubFig and ExYaleB databases. It canbe noticed that the F-Score and ANMRR over LFW databaseis highest for 35AR and 30AR 35 descriptors respectively. Itmeans that while the true positive rate for 30AR 35 descriptorover LFW database is low as compared to 35AR descriptor,the retrieved faces using 30AR 35 are closer to the query facein terms of its ranks.
B. Effect of Similarity Measure
In the comparison results of the previous subsection, Chi-square distance was adapted as the similarity measure. Thisexperiment is conducted to reveal the best suitable similar-ity measure for proposed descriptor. The ARP values us-ing VGGFace35AB-ReLU (i.e. 35AR) descriptor over eachdatabase are presented in Table VIII. In this experiment, 10top matching images are retrieved with different distances.The Euclidean, Cosine, L1, D1 and Chi-square distances areexperimented and reported in Table VIII. It is noticed that the Chi-square distance based similarity measure is better suitedfor each database except the FERET database.VI. C
ONCLUSION
In this paper, an average biased rectified linear unit (AB-ReLU) is proposed for the image representation using CNNmodel. The AB-ReLU is data dependent and adjust the thresh-old based on the positive and negative dominated data. Itconsiders the average of the input volume to adjust the inputvolume itself. The advantage of AB-ReLU is that it allows theimportant negative signals as well as blocks the irrelevant pos-itive signals based on the nature of the input volume. The AB-ReLU is applied over pre-trained VGGFace model at last fewlayers by replacing the conventional ReLU layers. The faceretrieval experiments are conducted to test the performanceof AB-ReLU based VGGFace descriptor. Six challenging facedatabases are considered, including three unconstrained andthree robust databases. Based on the experimental analysis,it is concluded that AB-ReLU layer is better suited at thelast layer instead of the simple ReLU layer for a pre-trainedCNN model based feature description. Favorable performanceis reported in both unconstrained as well as robust scenarios.It is also found that the Chi-square distance is better suitedwith the proposed descriptor for face retrieval.A
CKNOWLEDGEMENT
I gratefully acknowledge the support of NVIDIA Corpora-tion with the donation of the GeForce Titan X Pascal used forthis research. R
EFERENCES[1] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
International journal of computer vision , vol. 60, no. 2, pp. 91–110,2004.[2] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scaleand rotation invariant texture classification with local binary patterns,”
IEEE Transactions on pattern analysis and machine intelligence , vol. 24,no. 7, pp. 971–987, 2002.[3] S. R. Dubey, S. K. Singh, and R. K. Singh, “Rotation and illuminationinvariant interleaved intensity order-based local descriptor,”
IEEE Trans-actions on Image Processing , vol. 23, no. 12, pp. 5323–5333, 2014.[4] T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with localbinary patterns: Application to face recognition,”
IEEE transactions onpattern analysis and machine intelligence , vol. 28, no. 12, pp. 2037–2041, 2006.[5] S. Chakraborty, S. Singh, and P. Chakraborty, “Local gradient hexa pat-tern: A descriptor for face recognition and retrieval,”
IEEE Transactionson Circuits and Systems for Video Technology , 2016.
VERAGE BIASED RELU 8 [6] S. R. Dubey, S. K. Singh, and R. K. Singh, “Multichannel decoded localbinary patterns for content-based image retrieval,”
IEEE Transactions onImage Processing , vol. 25, no. 9, pp. 4018–4032, 2016.[7] S. K. Roy, B. Chanda, B. Chaudhuri, D. K. Ghosh, and S. R. Dubey,“A complete dual-cross pattern for unconstrained texture classification,”in , 2017, pp. 741–746.[8] S. Liao, M. W. Law, and A. C. Chung, “Dominant local binary patternsfor texture classification,”
IEEE transactions on image processing ,vol. 18, no. 5, pp. 1107–1118, 2009.[9] S. K. Roy, B. Chanda, B. B. Chaudhuri, S. Banerjee, D. K. Ghosh,and S. R. Dubey, “Local jet pattern: A robust descriptor for textureclassification,” arXiv preprint arXiv:1711.10921 , 2017.[10] L. Liu, Y. Long, P. W. Fieguth, S. Lao, and G. Zhao, “Brint: binaryrotation invariant and noise tolerant texture classification,”
IEEE Trans-actions on Image Processing , vol. 23, no. 7, pp. 3071–3084, 2014.[11] S. K. Roy, B. Chanda, B. B. Chaudhuri, S. Banerjee, D. K. Ghosh,and S. R. Dubey, “Local directional zigzag pattern: A rotation invariantdescriptor for texture classification,”
Pattern Recognition Letters , 2018.[12] S. R. Dubey, S. K. Singh, and R. K. Singh, “Local wavelet pattern:A new feature descriptor for image retrieval in medical ct databases,”
IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5892–5903,2015.[13] ——, “Local bit-plane decoded pattern: a novel feature descriptor forbiomedical image retrieval,”
IEEE Journal of Biomedical and HealthInformatics , vol. 20, no. 4, pp. 1139–1147, 2016.[14] ——, “Local diagonal extrema pattern: a new and efficient featuredescriptor for ct image retrieval,”
IEEE Signal Processing Letters ,vol. 22, no. 9, pp. 1215–1219, 2015.[15] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in
Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on , vol. 1. IEEE, 2005, pp.886–893.[16] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders,“Selective search for object recognition,”
International journal of com-puter vision , vol. 104, no. 2, pp. 154–171, 2013.[17] S. R. Dubey and S. Mukherjee, “Ldop: Local directional order patternfor robust face retrieval,” arXiv preprint arXiv:1803.07441 , 2018.[18] S. Chakraborty, S. K. Singh, and P. Chakraborty, “Local directionalgradient pattern: a local descriptor for face recognition,”
MultimediaTools and Applications , vol. 76, no. 1, pp. 1201–1216, 2017.[19] ——, “Centre symmetric quadruple pattern: A novel descriptor for facialimage recognition and retrieval,”
Pattern Recognition Letters , 2017.[20] S. R. Dubey, “Local directional relation pattern for unconstrained androbust face retrieval,” arXiv preprint arXiv:1709.09518 , 2017.[21] ——, “Face retrieval using frequency decoded local descriptor,” arXivpreprint arXiv:1709.06508 , 2017.[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in
Computer Vision andPattern Recognition, 2009. CVPR 2009. IEEE Conference on . IEEE,2009, pp. 248–255.[24] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
Proceedings of the IEEE conference on computer vision and patternrecognition , 2015, pp. 1–9.[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[27] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing thegap to human-level performance in face verification,” in
Proceedings ofthe IEEE conference on computer vision and pattern recognition , 2014,pp. 1701–1708.[28] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-ding for face recognition and clustering,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2015, pp. 815–823.[29] O. M. Parkhi, A. Vedaldi, A. Zisserman et al. , “Deep face recognition.”in
BMVC , vol. 1, no. 3, 2015, p. 6.[30] A. R. Chowdhury, T.-Y. Lin, S. Maji, and E. Learned-Miller, “One-to-many face recognition with bilinear cnns,” in
Applications of ComputerVision (WACV), 2016 IEEE Winter Conference on . IEEE, 2016, pp.1–9. [31] J.-C. Chen, V. M. Patel, and R. Chellappa, “Unconstrained face veri-fication using deep cnn features,” in
Applications of Computer Vision(WACV), 2016 IEEE Winter Conference on . IEEE, 2016, pp. 1–9.[32] R. Ranjan, S. Sankaranarayanan, C. D. Castillo, and R. Chellappa, “Anall-in-one convolutional neural network for face analysis,” in
AutomaticFace & Gesture Recognition (FG 2017), 2017 12th IEEE InternationalConference on . IEEE, 2017, pp. 17–24.[33] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “La-beled faces in the wild: A database for studying face recognition inunconstrained environments,” Technical Report 07-49, University ofMassachusetts, Amherst, Tech. Rep., 2007.[34] T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in
Proceedings of the IEEE InternationalConference on Computer Vision , 2015, pp. 1449–1457.[35] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, K. Allen,P. Grother, A. Mah, and A. K. Jain, “Pushing the frontiers of uncon-strained face detection and recognition: Iarpa janus benchmark a,” in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 1931–1939.[36] D. Marmanis, M. Datcu, T. Esch, and U. Stilla, “Deep learning earthobservation classification using imagenet pretrained networks,”
IEEEGeoscience and Remote Sensing Letters , vol. 13, no. 1, pp. 105–109,2016.[37] P. Liu, J.-M. Guo, C.-Y. Wu, and D. Cai, “Fusion of deep learning andcompressed domain features for content-based image retrieval,”
IEEETransactions on Image Processing , vol. 26, no. 12, pp. 5706–5717, 2017.[38] P. Ballester and R. M. de Ara´ujo, “On the performance of googlenetand alexnet applied to sketches.” in
AAAI , 2016, pp. 1124–1128.[39] A. Bansal, C. Castillo, R. Ranjan, and R. Chellappa, “The do’s anddon’ts for cnn-based face verification,” arXiv preprint arXiv:1705.07426 ,2017.[40] G. Carneiro, J. Nascimento, and A. P. Bradley, “Unregistered multi-view mammogram analysis with pre-trained deep learning models,” in
International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2015, pp. 652–660.[41] M. Schwarz, H. Schulz, and S. Behnke, “Rgb-d object recognitionand pose estimation based on pre-trained convolutional neural networkfeatures,” in
Robotics and Automation (ICRA), 2015 IEEE InternationalConference on . IEEE, 2015, pp. 1329–1335.[42] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov,“Exploiting image-trained cnn architectures for unconstrained videoclassification,” in
British Machine Vision Conference (BMVC) , 2015.[43] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments forgenerating image descriptions,” in
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2015, pp. 3128–3137.[44] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd countingvia deep convolutional neural networks,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2015, pp. 833–841.[45] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li, “Deeplearning for content-based image retrieval: A comprehensive study,” in
Proceedings of the 22nd ACM international conference on Multimedia .ACM, 2014, pp. 157–166.[46] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneousdeep transfer across domains and tasks,” in
Proceedings of the IEEEInternational Conference on Computer Vision , 2015, pp. 4068–4076.[47] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domainadaptation with residual transfer networks,” in
Advances in NeuralInformation Processing Systems , 2016, pp. 136–144.[48] Y. Ge, S. Jiang, Q. Xu, C. Jiang, and F. Ye, “Exploiting representationsfrom pre-trained convolutional neural networks for high-resolution re-mote sensing image retrieval,”
Multimedia Tools and Applications , pp.1–27.[49] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative featurelearning approach for deep face recognition,” in
European Conferenceon Computer Vision . Springer, 2016, pp. 499–515.[50] D.-A. Clevert, A. Mayr, T. Unterthiner, and S. Hochreiter, “Rectifiedfactor networks,” in
Advances in neural information processing systems ,2015, pp. 1855–1863.[51] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision ,2015, pp. 1026–1034.[52] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in
Proc. ICML , vol. 30, no. 1,2013.
VERAGE BIASED RELU 9 [53] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectifiedactivations in convolutional network,” arXiv preprint arXiv:1505.00853 ,2015.[54] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accuratedeep network learning by exponential linear units (elus),” arXiv preprintarXiv:1511.07289 , 2015.[55] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networksfor matlab,” in
Proceedings of the 23rd ACM international conferenceon Multimedia . ACM, 2015, pp. 689–692.[56] S. Murala, R. Maheshwari, and R. Balasubramanian, “Local tetrapatterns: a new feature descriptor for content-based image retrieval,”
IEEE Transactions on Image Processing , vol. 21, no. 5, pp. 2874–2886,2012.[57] K. Lu, N. He, J. Xue, J. Dong, and L. Shao, “Learning view-modeljoint relevance for 3d object retrieval,”
IEEE Transactions on ImageProcessing , vol. 24, no. 5, pp. 1449–1459, 2015.[58] J. R. Beveridge, P. J. Phillips, D. S. Bolme, B. A. Draper, G. H.Givens, Y. M. Lui, M. N. Teli, H. Zhang, W. T. Scruggs, K. W. Bowyer et al. , “The challenge of face recognition from digital point-and-shootcameras,” in
Biometrics: Theory, Applications and Systems (BTAS), 2013IEEE Sixth International Conference on . IEEE, 2013, pp. 1–8.[59] N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar, “Attributeand simile classifiers for face verification,” in
Computer Vision, 2009IEEE 12th International Conference on . IEEE, 2009, pp. 365–372.[60] P. J. Phillips, H. Wechsler, J. Huang, and P. J. Rauss, “The feret databaseand evaluation procedure for face-recognition algorithms,”
Image andvision computing , vol. 16, no. 5, pp. 295–306, 1998.[61] P. J. Phillips, H. Moon, S. A. Rizvi, and P. J. Rauss, “The feret evaluationmethodology for face-recognition algorithms,”
IEEE Transactions onpattern analysis and machine intelligence , vol. 22, no. 10, pp. 1090–1104, 2000.[62] A. M. Martinez, “The ar face database,”
CVC technical report , 1998.[63] A. M. Mart´ınez and A. C. Kak, “Pca versus lda,”
IEEE transactions onpattern analysis and machine intelligence , vol. 23, no. 2, pp. 228–233,2001.[64] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From fewto many: Illumination cone models for face recognition under variablelighting and pose,”
IEEE transactions on pattern analysis and machineintelligence , vol. 23, no. 6, pp. 643–660, 2001.[65] K.-C. Lee, J. Ho, and D. J. Kriegman, “Acquiring linear subspaces forface recognition under variable lighting,”
IEEE Transactions on patternanalysis and machine intelligence , vol. 27, no. 5, pp. 684–698, 2005.[66] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in
Computer Vision and Pattern Recognition, 2001.CVPR 2001. Proceedings of the 2001 IEEE Computer Society Confer-ence on , vol. 1. IEEE, 2001, pp. I–I.
Shiv Ram Dubey has been with the Indian In-stitute of Information Technology (IIIT), Sri Citysince June 2016, where he is currently the AssistantProfessor of Computer Science and Engineering. Hereceived the Ph.D. degree in Computer Vision andImage Processing from Indian Institute of Informa-tion Technology (IIIT), Allahabad in 2016. Beforethat, from August 2012-Feb 2013, he was a ProjectOfficer in the Computer Science and EngineeringDepartment at Indian Institute of Technology (IIT),Madras.He was a recipient of several awards including the Best PhD Award inCICT17 PhD Symposium at Indian Institute of Information Technology &Management (IIITM) Gwalior, Early Career Research Award from SERB,Govt. of India and NVIDIA GPU Award from NVIDIA. He received Certifi-cate of Reviewing Awards from Biosystems Engineering and Computers inBiology and Medicine, Elsevier in 2015 and 2016 respectively. He receivedthe Best Paper Award in IEEE UPCON 2015, a prestigious conference ofIEEE UP Section. His research interest includes Computer Vision, DeepLearning, Image Processing, Image Feature Description, Image Matching,Content Based Image Retrieval, Medical Image Analysis and Biometrics.