[PDF] Acceleration of Large Margin Metric Learning for Nearest Neighbor Classification Using Triplet Mining and Stratified Sampling

Abstract

Metric learning is one of the techniques in manifold learning with the goal of finding a projection subspace for increasing and decreasing the inter- and intra-class variances, respectively. Some of the metric learning methods are based on triplet learning with anchor-positive-negative triplets. Large margin metric learning for nearest neighbor classification is one of the fundamental methods to do this. Recently, Siamese networks have been introduced with the triplet loss. Many triplet mining methods have been developed for Siamese networks; however, these techniques have not been applied on the triplets of large margin metric learning for nearest neighbor classification. In this work, inspired by the mining methods for Siamese networks, we propose several triplet mining techniques for large margin metric learning. Moreover, a hierarchical approach is proposed, for acceleration and scalability of optimization, where triplets are selected by stratified sampling in hierarchical hyper-spheres. We analyze the proposed methods on three publicly available datasets, i.e., Fisher Iris, ORL faces, and MNIST datasets.

Full PDF

11 Acceleration of Large Margin Metric Learning forNearest Neighbor Classiﬁcation UsingTriplet Mining and Stratiﬁed Sampling

Parisa Abdolrahim Poorheravi ∗ , Benyamin Ghojogh ∗ , Vincent Gaudet, Fakhri Karray, Mark CrowleyDepartment of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON, CanadaEmails: { pabdolra, bghojogh, vcgaudet, karray, mcrowley } @uwaterloo.ca Abstract —Metric learning is one of the techniques in manifoldlearning with the goal of ﬁnding a projection subspace forincreasing and decreasing the inter- and intra-class variances,respectively. Some of the metric learning methods are basedon triplet learning with anchor-positive-negative triplets. Largemargin metric learning for nearest neighbor classiﬁcation isone of the fundamental methods to do this. Recently, Siamesenetworks have been introduced with the triplet loss. Many tripletmining methods have been developed for Siamese networks;however, these techniques have not been applied on the triplets oflarge margin metric learning for nearest neighbor classiﬁcation.In this work, inspired by the mining methods for Siamesenetworks, we propose several triplet mining techniques for largemargin metric learning. Moreover, a hierarchical approach isproposed, for acceleration and scalability of optimization, wheretriplets are selected by stratiﬁed sampling in hierarchical hyper-spheres. We analyze the proposed methods on three publiclyavailable datasets, i.e., Fisher Iris, ORL faces, and MNISTdatasets.

I. I

NTRODUCTION

Distance metric learning is one of the fundamental andmost competitive techniques in machine and manifold learning[1]. The goal of metric learning is to ﬁnd a proper metricwhose subspace discriminates the classes by increasing anddecreasing the inter- and intra-class variances, respectively [2],[3] (e.g., see Fig. 1). This goal was ﬁrst introduced by FisherDiscriminant Analysis (FDA) [2], [4].Some metric learning methods make use of anchor-positive-negative triplets where the positive and negative instances arethe data points having the same and different class labels withrespect to an anchor instance, respectively. One of the ﬁrstmetric learning methods based on triplets was large marginmetric learning for nearest neighbor classiﬁcation [5], [6]. Thismethod uses Semi-Deﬁnite Programming (SDP) optimization[7] as SDP has been found to be useful for metric learning[5], [6], [8], [9]. Later, the concept of a triplet cost functionwas proposed in the ﬁeld of neural networks by introducingSiamese networks [10]–[12]. The triplet loss can be either inthe form of Hinge loss [11] or softmax [13]. The examplesof the former and latter are [5], [6], [11] and [14]–[16],respectively.Solving SDP problems requires the interior point method[17], which is iterative and slow especially for big data. ∗ The ﬁrst two authors contributed equally to this work. Fig. 1. Metric learning for decreasing and increasing the intra- and inter-classvariances, respectively, by pulling the positives (same class instances) towardthe anchor but pushing the negatives (other class instances) away.

This can be improved and accelerated by selecting the mostimportant data points for embedding [18]. For example, werather care about the nearest or farthest positives and negativesthan selecting all the data points. This technique is referredto as triplet mining in the literature where the positive andnegative instances with respect to an anchor make a triplet[11].After the introduction of Siamese networks in the literature,different triplet mining techniques were developed for Siamesetraining using triplets. However, these mining methods havenot been implemented or proposed for the previously de-veloped concept of large margin metric learning for nearestneighbor classiﬁcation. In this work, inspired by the miningtechniques for Siamese networks, we propose different tripletmining methods for large margin metric learning. By onlyconsidering the most valuable points of the dataset with respectto anchors, the SDP optimization speeds up while preservingan acceptable classiﬁcation accuracy in large margin metriclearning.In addition to proposing triplet mining techniques for theoptimization, we propose a hierarchical approach for furtheracceleration of metric learning. This approach includes itera-tive selection of data subsets by hierarchical stratiﬁed sampling[19] to train the embedding subspace. Not only does thisapproach accelerate the SDP optimization by reducing timecomplexity, but also it improves performance in some cases a r X i v : . [ c s . L G ] S e p due to effectiveness of model averaging [20] and reduction ofestimation variance by stratiﬁed sampling [21]. We also usedthe proposed triplet mining techniques in combination with theproposed hierarchical approach for the sake of acceleration.The remainder of paper is as follows. In Section II, wereview the foundations of large margin metric learning, tripletloss, and Siamese networks. We discuss the triplet miningmethods that have already been proposed for Siamese triplettraining, i.e., batch all [22], batch hard [23], batch semi-hard[11], easiest/hardest positives and easiest/hardest negatives,and negative sampling [18]. Section III proposes how to usethe triplet mining techniques in SDP optimization of largemargin metric learning. The hierarchical approach is proposedin Section IV. We report the experimental results in SectionV. Finally, Section VI concludes the paper and provides thepossible future direction.II. B ACKGROUND

A. Large Margin Metric Learning for Nearest Neighbor Clas-siﬁcation k -Nearest Neighbor ( k -NN) classiﬁcation is highly im-pacted by the distance metric utilized for measuring thedifferences between data points. Euclidean distance does notweight the points and it values them equally. A generaldistance metric can be viewed as the Euclidean distanceafter projection of points onto a discriminative subspace. Thisprojection can be viewed as a linear transformation with aprojection matrix denoted by L [24]. We call this generalmetric the Mahalanobis distance [1], [2]: D := (cid:107) x i − x j (cid:107) M := (cid:107) L (cid:62) ( x i − x j ) (cid:107) = ( x i − x j ) (cid:62) M ( x i − x j ) , (1)where M := LL (cid:62) . The matrix M must be positive semi-deﬁnite, i.e. M (cid:23) , for the metric to satisfy convexity andthe triangle inequality [17].In order to improve the k -NN classiﬁcation performance,we should decrease and increase the intra- and inter-classvariances of data, respectively [3]. As can be seen in Fig.1, one way to achieve this goal is to pull the data points ofthe same class toward one another while pushing the pointsof different classes away.Let y il be one (zero) if the data points x i and x l are (arenot) from the same class. Moreover, let η ij be one if x j isamongst the k -nearest neighbors of x i with the same classlabel; otherwise, it is zero. For tackling the goal of pushingtogether the points of a class and pulling different classesaway, the following cost function can be minimized [5]: (cid:88) i,j η ij (cid:107) L (cid:62) ( x i − x j ) (cid:107) + c (cid:88) i,j,l η ij (1 − y il ) (cid:104) (cid:107) L (cid:62) ( x i − x j ) (cid:107) − (cid:107) L (cid:62) ( x i − x l ) (cid:107) (cid:105) + , (2)where [ . ] + := max( ., is the standard Hinge loss. The ﬁrstterm in Eq. (2) pushes the same-class points towards eachother. The second term, on the other hand, is a triplet loss[11] which increases and decreases the inter- and intra-classvariances, respectively. Inspired by support vector machines, the cost function (2)can be restated using slack variables:minimize M , ξ ijl L := (cid:88) i,j η ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j η ij (1 − y il ) ξ ijl , ∀ l subject to (cid:107) x i − x l (cid:107) M − (cid:107) x i − x j (cid:107) M ≥ − ξ ijl ,ξ ijl ≥ , M (cid:23) , (3)which is a SDP problem [7]. The ﬁrst term in the objectivefunctions of Eqs. (2) and (3) are equivalent because of Eq.(1). The Hinge loss in Eq. (2) can be approximated using non-negative slack variables, denoted by ξ ijl . The second term ofobjective function in Eq. (3), in addition to the ﬁrst and secondconstraints, play the role of Hinge loss. B. Triplet loss and Siamese Network

As explained for Eq. (2), the second term in that equationis the triplet loss which pushes the classes away and pullsthe points of a class together. In Eq. (2), x i , x j , and x l areanchor, positive, and negative instances, respectively. The goalof triplet loss is to make anchor and positive instances closerand push the negative instances away as also seen in Fig. 1.Recently, the triplet loss has been used for training neu-ral networks which are called Siamese or triplet networks[11]. A Siamese network is composed of three sub-networkswhich share their weights. The anchor, positive, and negativeinstances are fed to these sub-networks and the triplet lossis used to tune their weights. Siamese networks are usuallyused for learning a discriminative embedding space. In thiswork, we propose several triplet mining methods inspired bythe triplet mining techniques already existing in the literaturefor the Siamese nets.III. P ROPOSED T RIPLET M INING

The optimization problem in Eq. (3) considers all thenegative instances even in large datasets. The SDP for solvingProblem (3) is very time-consuming and slow [7]. Hence,Problem (3) becomes intractable for large datasets, as hasbeen noted in [5]. This motivated us to use triplet miningon the data for further improvement upon [5], [6]. Thereexist several triplet mining methods which are proposed forSiamese network training. Inspired by those, we propose herethe triplet mining techniques in the objective function of Eq.(3) to facilitate the optimization process. There can be differentways for triplet mining. In the following, we propose k -batchall, k -batch hard, k -batch semi-hard, extreme distances, andnegative sampling for large margin metric learning. A. k -Batch All One of the mining methods to be considered is batch all which takes all the positives and negatives of the data batchinto account for Siamese neural network [22]. The proposedmethod in [5], [6] is a batch-all version which takes only k nearest positives and all the negatives. This makes sensebecause the SDP is slow and cannot handle all possiblepermutations of positive and negative instances. Here, wecall this method k -batch all ( k -BA) where the objective inequation: L = (cid:88) i,j η ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j η ij (1 − y il ) ξ ijl , ∀ l. (4) B. k -Batch Hard Another mining method for Siamese networks is batchhard in which the farthest positive and nearest negative withrespect to the anchor are considered [23]. The farthest positiveis the hardest one to be classiﬁed as a neighbor of theanchor. Likewise, the nearest negative is the hardest one to beseparated from the anchor’s class. In this work, we consider k positive and k negative instances and we call this k -batchhard ( k -BH) where the objective in Eq. (3) becomes: L = (cid:88) i,j γ + ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j,l γ + ij η − il (1 − y il ) ξ ijl , (5)where γ + ij is one (zero) if x j is (is not) amongst the k -farthestneighbors of x i with the same class label. Similarly, η − il is one(zero) if x l is (is not) amongst the k -nearest neighbors of x i with different class label. C. k -Batch Semi-HardBatch semi-hard is another method, for Siamese networks,in which the hardest negatives (closest to the anchor) that arefarther from the positive are taken into account [11]. In ourwork, we have k positive instances and for each, we consider k negatives. This method we call k -batch semi-hard ( k -BSH) in which the cost in Eq. (3) can be modeled as: L = (cid:88) i,j η ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j η ij (cid:88) l η ∓ il (1 − y il ) ξ ijl , (6)where η ij , as deﬁned before, is one (zero) if x j is (is not)amongst the k -nearest neighbors of x i with the same classlabel and η ∓ il is one (zero) if x l is (is not) amongst the k -nearest neighbors of x i , with different class label, and fartherfrom x j to x i . D. Extreme Distances

Considering that every instance could be chosen basedon their distance to the anchor (whether they are nearestor farthest), we have four different cases [25]. Easy andhard positives correspond to the nearest and farthest pos-itives, respectively; easy and hard negatives correspond tothe farthest and nearest negatives, respectively. Easy Positive-Easy Negative (EPEN), Easy Positive-Hard Negative (EPHN),Hard Positive-Easy Negative (HPEN), and Hard Positive-Hard Negative (HPHN) are the four possible cases. HPHNis equivalent to the batch hard method explained in SectionIII-B. Since we are taking k instances from both positive and negative sets, the cost in Eq. (3) for the other three cases areas follows: k -EPEN: L = (cid:88) i,j η ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j,l η ij γ − il (1 − y il ) ξ ijl , (7) k -EPHN: L = (cid:88) i,j η ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j,l η ij η − il (1 − y il ) ξ ijl , (8) k -HPEN: L = (cid:88) i,j γ + ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j,l γ + ij γ − il (1 − y il ) ξ ijl , (9)where γ − il is one (zero) if x l is (is not) amongst the k -farthestneighbors of x i with different class label. The hardest casesare useful due to the concept of opposition learning [26] andthe fact that more difﬁcult separable data points are better tobe emphasized. Moreover, the easiest cases are found to beeffective in the literature [16]. E. Negative Sampling In negative sampling , as another mining method proposedfor Siamese networks, for every positive instance, each nega-tive’s probability of occurrence is calculated using a stochasticprobability distribution. The distribution of pairwise distances,denoted by q ( D ) , of two points can be estimated as [18]: q ( D ) ∝ D d − (1 − . D ) d − , (10)where d is the dimensionality of data and D is deﬁned by Eq.(1). For an anchor x i , the probability of a negative instance x l , with distance D from x i can be calculated as [18]: P ( x l | x i ) ∝ min( λ, q − ( D )) , (11)where λ (e.g., . ) is for giving all the negatives a minimumchance of selection. One can use a roulette wheel strategy forselecting negative instances using the probability in Eq. (11)[27].In this work, we select the k -nearest positives and sample k negatives for every anchor-positive pair. We call this method k -negative sampling ( k -NS) and its cost function in Eq. (3) is: L = (cid:88) i,j η ij (cid:107) x i − x j (cid:107) M + c (cid:88) i,j,l η ij ρ − il (1 − y il ) ξ ijl , (12)where ρ − il is one (zero) if x l is (is not) a sampled negative forthe ( x i , x j ) anchor-positive pair.IV. P ROPOSED H IERARCHICAL L ARGE M ARGIN M ETRIC L EARNING WITH S TRATIFIED S AMPLING

The triplet mining methods, introduced in the previoussection, are promising techniques for better and faster perfor-mance of large margin metric learning; however, they can befurther improved as explained here. We propose a hierarchicalapproach for accelerating the large margin metric learning.

The main idea is to consider portions of data for trainingfor solving the optimization in order to tackle the slow paceof SDP. However, for taking into account the whole trainingdata, portions of data should be introduced to the optimizationproblem hierarchically. This technique has a divide and con-quer manner to accelerate the training phase [28]. It also canimprove the performance of embedding model due to modelaveraging [20], [29] and reduction of estimation variance bystratiﬁed sampling [21].The procedure of this hierarchical approach can be foundin Algorithm 1. As can be seen in this algorithm, this ap-proach is iterative. In every iteration, several hyper-spheres areconsidered in the space of data and the triplets are sampledfrom inside of the hyper-spheres (see Line 10 in Algorithm1). We employ stratiﬁed sampling [19] where classes of dataare considered to be strata. The SDP optimization, Eq. (3),is solved at every iteration using merely the sampled tripletsrather than the whole data (see Line 11 in Algorithm 1). Wefactorize the matrix M in Eq. (1) into LL (cid:62) using eigenvaluedecomposition: M = ΨΣΨ (cid:62) = ΨΣ (1 / Σ (1 / Ψ (cid:62) = LL (cid:62) , (13)which can be done because M (cid:23) . As Eq. (1) shows, metriclearning can be viewed as Euclidean distance after projectiononto a subspace spanned by the columns of L , i.e., the columnspace of L . Hence, the whole data are projected into themetric subspace trained by the sampled triplets (see Line 13in Algorithm 1). Note that for not having data being collapsedin subspaces with low ranks, one can slightly strengthen thediagonal of M which results in larger eigenvalues withouteffecting the projection directions [30].At every iteration, the number of hyper-spheres, denotedby n s , and the radius of them, denoted by r , are determinedby a function decreasing and increasing with respect to theiteration index, respectively. This is because by the progressof algorithm, we want to make the hyper-spheres coarser tosee more of data but at the same time, the number of themshould become less not to have much overlap between thesampling areas. The size of stratiﬁed sampling in every hyper-sphere can also alter by the iteration index because in the lateiterations, there is no need to consider the whole data in thehyper-sphere but a part of them. For the stratiﬁed samplingsize, we sample a portion of each available class (i.e., strata)within the hyper-sphere.We initialize the radius, number of hyper-spheres, and theportion of sampling by r := 0 . σ , n s := (cid:98) . × n (cid:99) (clippedto ≤ n s ≤ ), and p τ := 1 . The updates of these variablesare performed as r := r + ∆ r , n s := max( n s − (cid:100) . × n s (cid:101) , ,and p τ := max( p τ − . , . , where ∆ r := 0 . σ and σ isthe average standard deviation along features.V. E XPERIMENTAL R ESULTS AND A NALYSIS

A. Datasets and Setup

In this work, we use three publicly available datasets. Theﬁrst dataset is the Fisher Iris data [31] which includes 150data points in three classes with dimensionality of 4. Thesecond dataset which we used was ORL faces data [32] with Procedure:

Hierarchical Metric Learning( X , k ) Input: X : dataset, k : number of neighbors Initialize r , n s , and p τ for τ from to T do r := increasing function of τ n s := decreasing function of τ p τ := decreasing function of τ for s from to n s do c s ∼ range ( X ) { x i,a , x i,p , x i,n } n τ i =1 ← draw a stratiﬁed tripletsample with sampling portion p τ within the s -th hypersphere Solve optimization (3) Decompose M = LL (cid:62) using Eq. (13) Project X onto C ol ( L ) : X ← L (cid:62) X Algorithm 1:

Hierarchical Large Margin Metric Learn-ing40 classes each having 10 subjects. The size of facial imagesare × pixels. The third dataset was the MNIST digitsdata [33] with × -pixel images.The Iris dataset was randomly split into train-validation-test sets with portions - - . In the ORL dataset,the ﬁrst six faces of every subject made the training data andthe rest of images were split to test and validation sets. Asubset of MNIST with 400-100-100 images was also taken fortrain-validation-test. Note that the SDP in large margin metriclearning cannot handle very large datasets due to the slowpacing of optimization. The ORL dataset was further projectedonto the 15 leading eigenfaces [34] as pre-processing [24]. Thevalidation set was used for determining the optimal values of k and c . The MNIST data were also projected onto the principalcomponent analysis subspace with dimensionality 30. B. Comparison of Triplet Mining Methods in the Non-Hierarchical and Hierarchical Approaches

For each dataset, we returned the accuracy of the k -nearestclassiﬁcation using the Mahalanobis distance for the differenttriplet mining methods. Table I represents the accuracies andrun-time for Iris, ORL faces, and MNIST datasets, respec-tively.In all datasets, k -BH has obtained the highest accuracyin non-hierarchical approach. However, in hierarchical ap-proaches, k -BSH has obtained a top accuracy. The reason for k -BH and k -BSH to have acceptable performance is usingthe hard (near) negative instances in the training. This helpsavoiding overﬁtting to the training data. In ORL faces data,the best accuracy is for k -BH and k -EPHN. This is becausein both of these methods, the hardest negative instances areused for training, helping to avoid overﬁtting again. For thesame reason, k -BSH has the second best performance in thisdataset. Moreover, we see that the results of k -NS is acceptablein this data which is due to the effectiveness of the probabilitydistribution used for sampling from the negative instances.This distribution was recently proposed for Siamese training TABLE IC

OMPARING ACCURACIES AND RUN - TIME OF THE PROPOSED TRIPLET MINING METHODS IN BOTH NON - HIERARCHICAL AND HIERARCHICAL METRICLEARNING FOR NEAREST NEIGHBOR CLASSIFICATION .Dataset k -BA k -BH k -BSH k -HPEN k -EPEN k -EPHN k -NSIris Non-Hierarchical Accuracy ( % ) 72.73 100 86.36 95.45 81.82 95.45 72.73Time (sec) 832.85 5.51 6.62 4.77 5.34 5.11 5.06Hierarchical Accuracy ( % ) 100 100 100 100 100 100 100Time (sec) 23.73 9.72 4.54 7.25 4.73 5.05 4.64ORL Faces Non-Hierarchical Accuracy ( % ) – 85.00 78.75 72.50 75.00 85.00 77.50Time (sec) – 16.13 18.61 19.59 19.19 16.31 19.05Hierarchical Accuracy 76.25 76.25 81.25 78.75 78.75 81.25 63.75Time (sec) 0.39 0.93 0.79 4.36 1.07 0.95 0.39MNIST Non-Hierarchical Accuracy ( % ) – 82.00 79.00 82.00 78.00 82.00 78.00Time (sec) – 122.21 182.13 152.89 173.18 135.64 170.33Hierarchical Accuracy ( % ) 71.00 77.00 79.00 81.00 75.00 78.00 79.00Time (sec) 27.17 1.56 1.55 0.49 1.02 1.70 1.55 k-BHk-BSHk-HPENk-EPENk-EPHNk-NS Fig. 2. The top ten ghost faces in different triplet mining methods. [18]; however, the results show that it is also effective fortriplet mining in the large margin metric learning.In the case of Iris data, due to the small size and simplicityof dataset, the accuracies are all perfect in the hierarchicalapproach. In this approach, for the ORL and MNIST datasets,the highest accuracies are for k -BSH which can be interpretedas explained above. As obvious in table, the hierarchicalapproach either outperforms the non-hierarchical approach(due to model averaging) or has comparable result with muchless consumed time.In the non-hierarchical approach, we tested the k -BA merelyon the Iris dataset because the two other datasets are too largefor k -BA as it considers all the negative instances. For thesame reason, it is very time consuming; hence, the longest timebelongs to k -BA in Table I. For the ORL and MNIST datasets,the longest time belongs to k -HPEN and k -BSH, respectively,mainly due to handling the hard cases in optimization. As thetable shows, the hierarchical approach is scalable and muchfaster because of sampling. For this reason, we could run k -BA efﬁciently for all three datasets in this approach. Note that the characteristic of computer used for simulations was IntelCore-i7, 1.80GHz, with 16GB RAM. C. Comparison of Triplet Mining Methods By Ghost Faces

As Eq. (13) shows, metric learning can be viewed asEuclidean distance after projection onto a subspace spannedby the columns of L . In the eigenvalue decomposition, theeigenvectors and eigenvalues are sorted from the leading totrailing.Inspired by eigenfaces [34] and Fisherfaces [35], for thelarge margin metric learning, we can visualize the eigen-subspaces (column spaces of L ) for the facial dataset in orderto display the ghost faces. Here, we consider the top tencolumns of L . The ghost faces of the ORL face dataset are de-picted in Fig. 2. As seen in this ﬁgure, k -NS features are morediscriminative which distinguish the different classes usingvarious extracted features including eye, eyebrow, cheeks (foreye glasses), chin, hair, and nose. In second place after k -NS,the k -BSH, k -HPEN, and k -EPEN features are diverse enough(including eye, cheek, nose, and hair) for discriminating the classes. The k -BH and k -EPHN have mostly concentrated onthe eye and eye-brow. This makes sense because many of thesubjects in the ORL face dataset wear eye-glasses.VI. C ONCLUSION AND F UTURE D IRECTION

Large margin metric learning for for nearest neighborclassiﬁcation makes use of SDP optimization which is veryslow and computationally expensive, because of the interiorpoint optimization method, especially when the data scale up.In this paper, inspired by the state-of-the-art triplet miningtechniques for Siamese network training, we proposed andanalyzed several triplet mining methods for large marginmetric learning. These triplet mining methods make the set oftriplets smaller by limiting the instances to the most importantones. This speeds up the optimization and makes it moreefﬁcient. The proposed triplet mining techniques were k -BA, k -BH, k -BSH, k -HPEN, k -EPEN, k -EPHN, and k -NS.Moreover, We suggested a new hierarchical approach which,in combination with the triplet mining methods, reduces thetime of training considerably and makes the method scalable.Our experiments on three public available datasets veriﬁed theeffectiveness of the proposed approaches. A possible futuredirection is to try the proposed hierarchical approach usingstratiﬁed sampling on other subspace learning methods.R EFERENCES[1] B. Kulis et al. , “Metric learning: A survey,”

Foundations and Trends®in Machine Learning , vol. 5, no. 4, pp. 287–364, 2013.[2] A. Globerson and S. T. Roweis, “Metric learning by collapsing classes,”in

Advances in Neural Information Processing Systems , 2006, pp. 451–458.[3] B. Ghojogh, M. Sikaroudi, S. Shaﬁei, H. R. Tizhoosh, F. Karray,and M. Crowley, “Fisher discriminant triplet and contrastive losses fortraining siamese networks,” in

IEEE International Joint Conference onNeural Networks (IJCNN) , 2020.[4] J. Friedman, T. Hastie, and R. Tibshirani,

The elements of statisticallearning . Springer Series in Statistics New York, 2001, vol. 1, no. 10.[5] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learningfor large margin nearest neighbor classiﬁcation,” in

Advances in NeuralInformation Processing Systems , 2006, pp. 1473–1480.[6] K. Q. Weinberger and L. K. Saul, “Distance metric learning for largemargin nearest neighbor classiﬁcation,”

Journal of Machine LearningResearch , vol. 10, no. Feb, pp. 207–244, 2009.[7] L. Vandenberghe and S. Boyd, “Semideﬁnite programming,”

SIAMReview , vol. 38, no. 1, pp. 49–95, 1996.[8] B. Alipanahi, M. Biggs, A. Ghodsi et al. , “Distance metric learningvs. Fisher discriminant analysis,” in

Proceedings of the 23rd NationalConference on Artiﬁcial Intelligence , vol. 2, 2008, pp. 598–603.[9] K. Q. Weinberger and L. K. Saul, “Unsupervised learning of image man-ifolds by semideﬁnite programming,”

International Journal of ComputerVision , vol. 70, no. 1, pp. 77–90, 2006.[10] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction bylearning an invariant mapping,” in

IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition , vol. 2, 2006, pp. 1735–1742.[11] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embed-ding for face recognition and clustering,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 815–823.[12] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,”in

International Workshop on Similarity-Based Pattern Recognition .Springer, 2015, pp. 84–92.[13] M. Ye, X. Zhang, P. C. Yuen, and S.-F. Chang, “Unsupervised em-bedding learning via invariant and spreading instance feature,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 6210–6219. [14] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov,“Neighbourhood components analysis,” in

Advances in Neural Informa-tion Processing Systems , 2005, pp. 513–520.[15] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh,“No fuss distance metric learning using proxies,” in

Proceedings of theIEEE International Conference on Computer Vision , 2017, pp. 360–368.[16] H. Xuan, A. Stylianou, and R. Pless, “Improved embeddings with easypositive triplet mining,” in

The IEEE Winter Conference on Applicationsof Computer Vision , 2020, pp. 2474–2482.[17] S. Boyd, S. P. Boyd, and L. Vandenberghe,

Convex optimization .Cambridge University Press, 2004.[18] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Samplingmatters in deep embedding learning,” in

Proceedings of the IEEEInternational Conference on Computer Vision , 2017, pp. 2840–2848.[19] V. Barnett,

Elements of sampling theory . English Universities Press,London, 1974.[20] G. Claeskens, N. L. Hjort et al. , “Model selection and model averaging,”

Cambridge Books , 2008.[21] B. Ghojogh and M. Crowley, “The theory behind overﬁtting, cross val-idation, regularization, bagging, and boosting: tutorial,” arXiv preprintarXiv:1905.12787 , 2019.[22] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learningwith relative distance comparison for person re-identiﬁcation,”

PatternRecognition , vol. 48, no. 10, pp. 2993–3003, 2015.[23] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss forperson re-identiﬁcation,” arXiv preprint arXiv:1703.07737 , 2017.[24] I. Jolliffe,

Principal component analysis . Springer, 2011.[25] M. Sikaroudi, B. Ghojogh, A. Safarpoor, F. Karray, M. Crowley, andH. Tizhoosh, “Ofﬂine versus online triplet mining based on extremedistances of histopathology patches,” arXiv preprint arXiv:2007.02200 ,2020.[26] H. R. Tizhoosh, “Opposition-based learning: a new scheme for machineintelligence,” in

International Conference on Computational Intelligencefor Modelling, Control and Automation and International Conference onIntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC’06) , vol. 1. IEEE, 2005, pp. 695–701.[27] E.-G. Talbi,

Metaheuristics: from design to implementation . John Wiley& Sons, 2009, vol. 74.[28] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,

Introductionto algorithms . MIT Press, 2009.[29] L. Breiman, “Bagging predictors,”

Machine Learning , vol. 24, no. 2, pp.123–140, 1996.[30] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Mullers,“Fisher discriminant analysis with kernels,” in

Neural networks forsignal processing IX: Proceedings of the 1999 IEEE signal processingsociety workshop . IEEE, 1999, pp. 41–48.[31] D. Dua and C. Graff, “UCI machine learning repository,” 2017.[Online]. Available: http://archive.ics.uci.edu/ml[32] AT&T Laboratories Cambridge, “Orl face dataset,” 2001. [Online].Available: http://cam-orl.co.uk/facedatabase.html[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[34] M. Turk and A. Pentland, “Face recognition using eigenfaces,” in

Proceedings of IEEE Computer Society Conference on Computer Visionand Pattern Recognition , 1991, pp. 586–587.[35] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs.ﬁsherfaces: Recognition using class speciﬁc linear projection,”