[PDF] Hyperspherical embedding for novel class classification

Abstract

Deep learning models have become increasingly useful in many different industries. On the domain of image classification, convolutional neural networks proved the ability to learn robust features for the closed set problem, as shown in many different datasets, such as MNIST FASHIONMNIST, CIFAR10, CIFAR100, and IMAGENET. These approaches use deep neural networks with dense layers with softmax activation functions in order to learn features that can separate classes in a latent space. However, this traditional approach is not useful for identifying classes unseen on the training set, known as the open set problem. A similar problem occurs in scenarios involving learning on small data. To tackle both problems, few-shot learning has been proposed. In particular, metric learning learns features that obey constraints of a metric distance in the latent space in order to perform classification. However, while this approach proves to be useful for the open set problem, current implementation requires pair-wise training, where both positive and negative examples of similar images are presented during the training phase, which limits the applicability of these approaches in large data or large class scenarios given the combinatorial nature of the possible inputs.In this paper, we present a constraint-based approach applied to the representations in the latent space under the normalized softmax loss, proposed by[18]. We experimentally validate the proposed approach for the classification of unseen classes on different datasets using both metric learning and the normalized softmax loss, on disjoint and joint scenarios. Our results show that not only our proposed strategy can be efficiently trained on larger set of classes, as it does not require pairwise learning, but also present better classification results than the metric learning strategies surpassing its accuracy by a significant margin.

Full PDF

HHyperspherical embedding for novel class classiﬁcation

Rafael S. Pereira

DEXL-LNCCPetropolis, Brazil [email protected] Alexis Joly

Zenith-INRIAMontpellier, France [email protected] Patrick Valduriez

Zenith-INRIAMontpellier, France [email protected] Fabio Porto

DEXL-LNCCPetropolis, Brazil [email protected]

1. ABSTRACT

Deep learning models have become increasingly useful inmany diﬀerent industries. On the domain of image classiﬁ-cation, convolutional neural networks proved the ability tolearn robust features for the closed set problem, as shownin many diﬀerent datasets, such as: MNIST, [9], FASHIONMNIST [20], CIFAR10 and CIFAR 100 [8], and IMAGENET[2]. These approaches use deep neural networks with denselayers with softmax activation functions in order to learnfeatures that can separate classes in a latent space. How-ever, this traditional approach is not useful for identifyingclasses unseen on the training set, known as the open setproblem[21]. A similar problem occurs in scenarios involvinglearning on small data. To tackle both problems, few shotlearning[19] has been proposed. In particular, metric learn-ing learns features that obey constraints of a metric distancein the latent space in order to perform classiﬁcation. How-ever, while this approach proves to be useful for the open setproblem, the state-of-the-art implementation requires pair-wise training, where both positive and negative examplesof similar images are presented during the training phase,which limits the applicability of these approaches in largedata or large class scenarios given the combinatorial natureof the possible inputs [17]. In this paper, we present a con-straint based approach applied to the representations in thelatent space under the normalized softmax loss, proposed by[18]. We experimentally validate the proposed approach forthe classiﬁcation of unseen classes on diﬀerent datasets us-ing both metric learning and the normalized softmax loss, ondisjoint and joint scenarios. Our results show that not onlyour proposed strategy can be eﬃciently trained on larger setof classes, as it does not require pairwise learning, but alsopresent better classiﬁcation results than the metric learningstrategies surpassing its accuracy by a signiﬁcant margin.

2. INTRODUCTION

Humans have the ability to identify many diﬀerent typesof objects [3]. Even when we are not able to name a cer- tain object, we can tell its diﬀerences from a second ob-ject, which contributes to the identiﬁcation of objects wehave never seen before and group them into classes basedon prior knowledge.

Metric learning [7] is a well adoptedapproach that identiﬁes novel classes without ﬁne tuning amodel on these classes. The approach applies an optimiza-tion strategy, which guarantees that the classes a model hasseen during optimization form disjoint clusters on the latentspace according to a certain metric distance. Some commonapproaches that use this strategy are: the triplet loss [11];constrative loss [5]; prototypical networks [13]; constellationloss [10]; and matching networks [16]. Another approach inmetric learning is called similarity learning , where the modelreceives pairs of inputs and learns that they are similar ifthey belong to the same class or dissimilar otherwise, as dis-cussed in [14]. During inference on novel classes, the formeruses the distance between labeled points of the novel classthe model was not optimized upon to obtain a representa-tion in the latent space for the novel class and then calculatethe distance between new points and each class representa-tion. In the latter approach, a similarity score is calculatedbetween every (class,query) pair in order to ﬁnd the mostsimilar pair.However while enforcing metric properties on the latentspace leverages the model knowledge to novel classes, it re-quires pairwise learning, which limits the scalability of suchapproaches given the amount of possible pairs.In this paper we take into account the normalized softmaxloss(NSL), proposed by [18], and present how it enforces alatent space that obeys the cosine similarity. Based on this,we then present a methodology to apply the

NSL to thenovel classes classiﬁcation problem. Considering a trainedartiﬁcial neural network, we add a new neuron to its lastlayer and infer the weights that connect the penultimatelayer of the network to this neuron. The connection and thenew neuron are used to classify a novel class by using fewlabeled samples of it. Our approach for the open set prob-lem allows us to classify new classes without ﬁne-tuning themodel , instead we use the same network parameters themodel was optimized upon to classify its seen classes andonly adding a new neuron along with its inferred connec-tion. We evaluate state-of-the-art approaches to solve theopen set problem against our proposed approach, both inthe disjoint and joint scenarios, as deﬁned in section 4.4, fordiﬀerent datasets. The experimental results show that ourapproach outperforms other metric learning strategies andit, additionally, induces a more scalable training process, asit does not require pairwise learning, leveraging the open set1 a r X i v : . [ c s . C V ] F e b roblem technique to deal with large datasets.The remainder of this paper is structured as follows. First,it presents some theoretical background at section 4. Ourmethodology and how to classify new classes is describedin section 5. Next, we present the results on the joint anddisjoint open set problem in section 6. Moreover, we presentthe use of the NSL approach in a more complex dataset inthe ﬁeld of botany, in section 7. Finally, we conclude insection 8.

3. PROBLEM FORMALIZATION

Let there be a dataset D composed of a set of examples x ∈ X , such that X (cid:118) R N , an N is the number of fea-tures describing each instance x . A classiﬁcation task aimsto build a function that takes the features belonging to eachexample and project the object into a space R k , where k isthe number of classes in Y .We want to build a function Y = φ ( X ), Y (cid:118) R M , modelledby a neural network, where features F characterizing exam-ples in X sharing the same class y are clustered together ac-cording to a metric distance in Y . We aim to use this metricproperty for the identiﬁcation of novel classes. Furthermore,we consider the mean point, in a given metric space, of a setof points in a class to be referred as the class prototype [13],which represents a class virtual centroid. To model the func-tion as a neural network and enforce the metric propertieson Y , we deﬁne the network with two modules. The ﬁrst isa embedding module that takes F as an input and returnsa F (cid:48) representation in Y . This representation is fed to aclassiﬁcation module composed of k neurons, where k is thetotal number of classes Y . The modules are connected bya matrix of learnable weights W of size (Mxk). To enforcethe metric space, we constrain the optimization procedureto guarantee that a vector of features extracted from x sharehigh cosine similarity with a vector w of weights for the class y to which x is classiﬁed in.

4. PRELIMINARIES4.1 The Open set problem

The classiﬁcation problem can be formulated as a closedset or open set problem. In the closed set problem con-text, the optimization process trains a model to learn fea-tures that can identify the classes with samples present inthe training set. The approach does not require the iden-tiﬁcation of classes not present in the training set. Thisis commonly tackled using the Softmax-Crossentropy loss[6][12],[15]. In contrast, in the open set problem we are in-terested in not only identifying the classes present in thetraining set, but also to be able to use the model to clas-sify new classes by exploiting properties in the latent spaceyielded during optimization.

Deep neural networks learn a set of transformations andrelations among the inputs in order to obtain the desiredoutput during optimization. The network can be brokeninto two parts: the projection module, which takes the dataand transforms it into a representation in the latent space;and the data representation processing, applied by the layerimplementing the desired task. The latter enforces the latentspace to have some properties that are deﬁned by the task. For example, the regression task is deﬁned as y = (cid:80) i w i x i + b and so given a ﬁxed w if x ≈ x (cid:48) then y ≈ y (cid:48) . When dis-cussing the classiﬁcation problem, the softmax crossentropy approach enforces the inequality w i x i + b i >> w j x i + b j ,where i, j represent diﬀerent classes [18]. A proposed al-ternative that optimizes the latent space directly and canenforce metric properties that allows the model to be usedfor novel classes is know as metric learning .The metric learning approach learns a set of features thatobey a metric distance on the latent space. The model canbe optimized to learn a similarity metric between pairs, asproposed in [14], or can enforce the latent space to obeya predeﬁned metric distance like euclidean distance or co-sine similarity. Some strategies, such as the constrative loss [5],learn on pairs of data, while others learn using tripletslike triplet loss [11].Optimization on these approaches aim to obtain disjointclusters for each class of interest in the latent space, accord-ing to a predeﬁned metric distance. As a consequence of theapproach, classiﬁcation can be performed for novel classesby using the representation of an anchor example and cal-culating the metric distance between a query point and theanchor. Proposed in [18], the NSL (Normalized Softmax Loss) isa modiﬁcation of the softmax loss that enforces a cosinesimilarity metric between classes on the latent space. Itenforces the features that are projected into the latent spaceto be contained in a hypersphere (

N >

3) where each regionof the sphere contains features belonging to a certain class.To discuss it, we ﬁrst present the usual loss function adoptedin classiﬁcation tasks called softmax loss, which correspondsto the use of the softmax activation function along with crossentropy , in Equation 1. The latter presents the valueof the activation to a class i in a classiﬁcation problem of N classes. G i ( x ) = e w i x + b i (cid:80) j N j =1 e w j x + b j (1)Since 0 = < G i ( x ) < = 1, the optimization on this functionaims to obtain w and b so that w i x i + b i >> w j x i + b j for all i (cid:54) = j in order to maximize G i ( x i ) while minimizing G i ( x j ).This approach has been extensively used in the context ofthe closed set problem [15][12][6], but lacks generalizationcapability for the open set problem.Given this, the NSL contribution lies on constraining the softmax function. Those constraints come in the form ofdeﬁning | b | = 0, | w | = 1 and | x | = S , S ∈ R . Those con-straints enforce that the weights | w | must be contained inthe surface of a hypersphere of radius one, while x is con-tained in the surface of a second hypersphere of radius S ,both hyperspheres are of dimension N equal to the amountof neurons in the penultimate layer.Equation 2 presents what those constraints imply in theoptimization process and the nature of its extracted features.A visual example of the features and weights of the NSL canbe seen in ﬁgures 2 and 3.2 i x i + b i >> w j x i + b j w i x i + 0 >> w j x i + 0 | w i || x i | cos ( θ ( w i , x i )) >> | w j || x i | cos ( θ ( w j , x i )) (2) Scos ( θ ( w i , x i )) >> Scos ( θ ( w j , x i ))Since S is a hyperparameter, optimization happens on θ .On its turn, cos ( θ ) is a function ranging in the interval [-1,1].In this context, to maximize the left side of the equation (2)the angle between x i and w i must be close to 0, while tominimize the right side of the equation (2), θ ( w j , x i ) mustbecome closer to π . Those values could be achieved if theloss value reached zero. In this section, we brieﬂy deﬁne the joint and disjointscenarios, as well as the seen and unseen classes for bothscenarios.First, consider the dataset D with a set of classes C . From C we consider a subset of classes C (cid:48) that a model will seeduring its training phase, while there is a complement sub-set C ∗ the model did not see, so that C = C (cid:48) ∪ C ∗ .In the disjoint scenario, a neural network is optimized on C (cid:48) . After optimization, we sample a subset of C ∗ and usethe knowledge obtained from C (cid:48) to classify only examplesfrom this subset. The seen classes are the classes containedin C (cid:48) , while the unseen classes are the subset of C ∗ . Met-rics are calculated on the test set and the method to performthis transfer of knowledge are speciﬁc to each method. Theones we explore in this work have their methods discussedin sections 5.1 and 5.2.The joint scenario follows mostly the same proposed method-ology as the disjoint scenario. The major diﬀerence is thatduring inference we do not discard the classes used duringoptimization, instead the model performs inference betweenthe union of C (cid:48) and the subset of C ∗ as possible answers.This scenario considers we wish to extend a model that isbeing used to novel classes without retraining. The disjointscenario considers that your problem of interest may nothave much data, and its preferable to optimize on a relateddataset with larger amounts of data.

5. METHODOLOGY

In this paper we aim to compare pairwise strategies, com-monly used in metric learning, against the normalized soft-max loss approach for the open set problem. In this mannerwe consider both the problem where during inference seenand unseen classes are disjoint, as well as the scenario wherethe model must identify both the seen and unseen classes to-gether. To guarantee that the extracted features for unseenclasses are robust, the model does not consider them to bethe same as the seen classes.In the ﬁrst scenario, we optimize a model using N classeson the training set. After optimization occurs for a numberof epochs (eg. 5) we evaluate the model accuracy on the N seen classes and the M unseen classes on the test setby using its training set to infer the weights for the novelclasses. The approach that implements the metric learning strate-gies is discussed in section 5.1, while the one applying Nor-malized softmax loss is presented in section 5.2

When tackling the open set problem, we are interested inoptimizing models in which the full knowledge the networkobtains during optimization can be exploited for classes out-side of the training set. The usual softmax crossentropy ap-proach lacks the ability to extract features that obey thisproperty, as the weights between the penultimate layer andthe classiﬁcation layer w are as important as the represen-tation in the latent space of the penultimate layer x as seenin equation 1, and the former is undeﬁned for novel classes.Usual approaches for classifying novel classes are exploredin metric learning [5],[11],[14],[16] where model optimizationoccurs by learning a metric space that creates clusters basedon a metric distance [5],[11],[16], or by learning whether twoexamples are similar or dissimilar [14]. These strategies al-low classiﬁcation to be performed for classes outside of thetraining set as the representation in the latent space can beexploited via a K nearest neighbours approach. However, allthese strategies are based on pairs or triplets, which limitsthe approaches scalability whenever optimizing with largeramounts of data per class, or a large amount of classes in asmall data scenario.In our work, to classify new classes, we will be adoptingthe proposal of [13] that builds a single representation onthe latent space for each class and then builds a K nearestneighbour (KNN) model with the same metric distance themodel was optimized on. To perform predictions on the testset we extract the sample representation in the latent spaceand feed it to the KNN model. As already discussed in section 4.3, the NSL learns a met-ric space that enforces cosine similarity. To minimize itsloss function it must satisfy Equation 2. In order to maxi-mize the left side of equation 2, equation 3 must be satisﬁed,whereas to minimize the right side of equation 2, equation4 must be satisﬁed for all i, j class pairs.Given this, in the limit, when the loss value goes to zero,both equations: 3 and 4 must hold true. In the speciﬁccase where the angle between w i and x i is close to zero thenequation 5 is satisﬁed since x im | x im | ≈ w i when equation 3is satisﬁed where x im is the feature vector of a example m belonging to class i and w i is the weight vector that connectsthe penultimate layer to the neuron associated with class i .Given this relation, we propose the Equation 6 as a formof inferring the weights that connect the penultimate layerof the neural network to a neuron representing a novel class,by using labeled samples from this novel class in a mannersimilar to [13]. A geometrical representation that shows therelationship between these weights and the feature vectorsof the latent space is shown in Figures 2 and 3. One cansee that even when the loss does not reach zero, the centerof the feature vector is aligned with its corresponding classweights. We also show that Equation 6 has a small weightinference error on non zero losses. In order to do so, we op-timized diﬀerent models on subsets of the Cifar100 dataset.Then we calculated the cosine distance matrix between theoriginal class weights and the one from the inferred weightsand present the relative error of the reconstruction in Figure3. In the discussed equations, we use the notation x im asthe example m of class i . w i · x im | w i || x im | ≥ w i · x jn | w i || x jn | ≥ − x im · x in | x in || x im | ≥ w i = 1 N N (cid:88) i =1 x i | w i || x i | (6)A dataﬂow depicting our approach to infer the weights fornovel classes is presented in Figure 1. Figure 1: Diagram presenting the approach to inferweights for the decision layer for new classes

6. RESULTS6.1 Experimental setup

In this section we present the experimental setup. All ex-periments presented for the FASHION MNIST and CIFARdatasets were performed using google collaboratory and itsGPU instances. Experiments using the Pl@ntNet datasetwere performed using a Dell PowerEdge R730 server, with2 CPUs Intel (R) Xeon (R) CPU E5-2690 v3 @ 2.60GHz;768 GB of RAM; and running on a Linux CentOS 7.7.1908kernel version 3.10.0-1062.4.3.e17.x86 64. The machine isequipped with a single NVIDIA Pascal P100 GPU, with16GB RAM. Implementations were performed using Python3.7 along with the Keras deep learning library.

A common scenario in which the identiﬁcation of unseenclasses appears is the one where the amount of availabledata samples for classes of interest is small or the cost of op-timizing another model to include the new classes becomes

Figure 2: Embedding obtained on the cifar10dataset when using a latent space with two dimen-sions, each color represents a diﬀerent class. Innerpoints are the class weights while outer points comefrom the training set, notice how classes from theouter circle are aligned with the inner circleFigure 3: Embedding obtained on the Plantnetdataset discussed in section 7 when using a latentspace with two dimensions ll ll ll Number of classes R e l a t i ve E rr o r Figure 4: Relative error for rebuilding the optimizedweights by using equation 6 on the cifar 100 datasetwhen optimizing on diﬀerent amount of classes on30 diﬀerent runs too costly. Therefore, all strategies discussed in this paperclassify new classes based on labeled examples without re-training. The inﬂuence of the number of samples neededto perform classiﬁcation tasks using the NSL approach isshown in three diﬀerent datasets in Figure 5 for a modeloptimized for 30 epochs. We deﬁne class prototype in thesame way as [13]. ll l l

Cifar10 FashionMnist Mnist C l a ss p r o t o t y pe S i ng l e an c ho r T r a i ned C l a ss p r o t o t y pe S i ng l e an c ho r T r a i ned C l a ss p r o t o t y pe S i ng l e an c ho r T r a i ned Weights F S c o r e Figure 5: F1 score on the test set for three diﬀer-ent datasets where we show results when weightsare: (a) Trained: the ones found during optimiza-tion; (b) Single anchor: weights are inferred usinga single random anchor example; and (c) weightsare inferred using the training set to build a classprototype.

As it can be seen on Figure 5, weights inferred using Equa-tion 6 on the training set (c) maintain the same model accu-racy as by using the weights obtained during optimization.We also can observe that model quality, when inferring viaa single example, decays in relation to the task complexity.On the x axis we present how the class weights were ob-tained where class prototype uses the whole training set toinfer the weights according to our methodology, while singleanchor uses a single random example from the training setfor weight inference. Y axis presents the F1-Score on thetest set.

In this section, we present in Table 1 an ablation studycomparing NSL against softmax loss applied to the samenetwork architecture. We deﬁne the following deep neuralnetworks: an Inception block, as a single Inception moduleas proposed in [15]; a VGG block is a composition of twoConvolutional layers with 3x3 kernel and a 2x2 MaxPoolinglayer; while a Resnet block is built with a 3x3 convolutionwith a residual connection. For the number of ﬁlters weincrease them in the order < , , > values.Architecture Number of blocks SL NSLInception 1 0.473 Inception 2 0.47

Inception 3 0.42

ResNet 1 0.1

ResNet 2 0.1

ResNet 3 0.1

VGG 1 0.175

VGG 2 0.187

VGG 3 0.643

In table 1 we compared the normalized softmax loss vssoftmax loss on diﬀerent architectures. We note that for allarchitecture variations as well as variations in the networkdepth, the normalized softmax loss outperformed the soft-max loss. We also denote that on softmax loss the choice ofbase architecture (Inception,Resnet,VGG) signiﬁcantly in-ﬂuenced the model accuracy as much as depth. In contrast,the normalized softmax loss was more inﬂuenced by modeldepth in this dataset.

In this section we show the results of evaluating the modelaccuracy on the test set from seen and unseen classes byemploying a VGG based model with two blocks.To perform the experiment in a dataset with K classes,we optimize the model with K − N classes, and use thenetwork structure to classify the N unseen classes withoutconsidering the seen ones as possible answers. The approachto do so was discussed in section 5. Results are presented inTables 2 and 3.N NSL Triplet Constrative2 Table 2: Model results for the cifar10 dataset. N refers to the amount of unseen classes. NSL showsmodel accuracy on unseen classes when a model isoptimized on − N classes and only identiﬁes N .Triplet and constrative refers to the same problemusing triplet and constrative loss, respectively. In t]Tables 2 and 3, we compared NSL against two metriclearning strategies in a disjoint settings. In the ﬁrst line,5 NSL Triplet Constrative2

Table 3: Model results for the fashion MNISTdataset. N refers to the amount of unseen classes.The NSL column shows model accuracy on unseenclasses when a model is optimized on − N classesand only identiﬁes N . Triplet and constrative referto the same problem using triplet and constrativeloss, respectively. we present a scenario we train with 8 random classes andevaluate on the other two. The second line trained with 7and so forth. Our results show that in both datasets the NSLoutperformed those metric learning strategies for evaluatingnovel classes in a disjoint scenario. In this section we present results when the novel classesmust be integrated into the classiﬁcation process along withthe classes used for optimization. The model is optimizedwith 10 − N classes and we evaluate its accuracy on these andon the N unseen classes, in a 10 possible classes scenario.Results are presented in Tables 4 and 5N Type NSL Triplet Constrative2 Seen Table 4: Model results for the cifar10 dataset, Nrefers to the amount of unseen classes, Seen showsmodel accuracy for the seen classes after optimiza-tion for 5 epochs and Unseen classes shows modelperformance to identify the new classes using NSL

N Type NSL Triplet Constrative2 Seen

Table 5: Model results for the fashion MNISTdataset. N refers to the amount of unseen classes.Seen shows model accuracy for the seen classes afteroptimization for 5 epochs and Unseen classes showsmodel performance to identify the new classes usingNSL. Tables 4 and 5 depicts the results of comparing our ap-proach using

NSL with metric learning strategies: Triplet loss and constrative loss on both cifar and MNIST datasets,considering a joint scenario. The NSL outperformed bothapproaches on the two evaluated datasets, including seenand unseen classes predictions.

7. CASE STUDY: THE PL@NTNET DATASET

In order to assess on real world data, we evaluate the ap-proach for the closed and open set problems on the

Plantnet dataset. Plantnet [1] is a mobile application which allowsusers to take pictures of diﬀerent plant species, and by theuse of deep learning techniques these pictures are processedreturning the most probable species each plant belongs to.The dataset becomes challenging as available pictures havediﬀerent levels of quality, as well as having multiple speciesfrom many diﬀerent parts of the world as shown in ﬁgure 6.Given this we wish to evaluate how our approach fares in theclosed set compared to traditional crossentropy approaches,as well as evaluate it as an open set problem. The problembecomes relevant to the evaluation of the proposed approachgiven that the scenario is usually represented by a long taildistribution, in which some classes are very common, whileothers are rare and lack signiﬁcant available training data.The subset of the Plantnet data we used was obtainedfrom [4] and it has a total of 182 classes. N u m be r o f e x a m p l e s Figure 6: Distribution of the training dataset. Notehow the data is highly imbalanced as some speciesare very common while others are rare.

As it is clear from Figure 6, there is a high imbalanceamong classes in the Plantnet dataset. Also, there are manyclasses in the training set which have very small amounts ofdata. Since many plant species have few samples, we are in-terested in exploring the performance of NSL, where a modelis optimized only on more common species and weights toclassify uncommon species are inferred, as discussed in sec-tion 5.2. To this end, we perform experiments by optimizingthe model only on classes where the number of samples islarger or equal to N = { , , , } , which results in C = { , , , } , where C is the number of classes, andpresent the results for joint and disjoint settings. Unseenclasses will be selected randomly among those with numberof samples M , M < N , and results will be presented withthe average of 30 runs. All models are optimized on 100epochs and weights that minimize validation loss are usedfor inference. Model architecture is presented in Figure 7.Pre-processing steps on the data only include normalizingit to [0 ,

1] range and reshaping it to a < , , > shape.6umber of classes NSL accuracy SL accuracy10 Table 6: Model accuracy on the test set optimizedfor 100 epochs on weighted crossentropy

Models are compiled with weighted crossentropy to take theclass imbalance into account on the loss functions, ensuringthat solutions that output only the majority class are penal-ized. For the open set tasks, we report balanced accuracyto better take into account class imbalance.

In this section we present results on the model accuracyfor both the disjoint and joint open set problems when themodel was optimized on diﬀerent amount of classes. Asalready discussed, we selected seen classes based on a ﬁlteron the number of samples ≥ N . Given this, the relationbetween the number of samples and the number of classesis as follows: [25 , , [50 , , [100 , , [200 , In this subsection we present the disjoint analysis for thePlantnet dataset. We instantiate our base model withoutthe last layer and then perform classiﬁcation on novel classesrandomly sampled from the total unseen classes by inferringthe weights between the penultimate layer and the decisionlayer. We present balanced accuracy on the model capacityon novel classes by performing 30 runs for each value of N .Results are presented in ﬁgure 8.In ﬁgure 8 we show a comparison between models trainedon diﬀerent amount of classes with the same amount of databy presenting their ability to identify novel classes in a dis-jointed scenario. Our results show that diversity of classesseen during training allowed the model to become more ro-bust for novel classes as the model trained with 10 classesperformed the worst of all for novel classes. However itsalso important to note that optimizing on higher amount ofclasses is a more complex problem, requiring more data to beseen by the model, more updates or a more complex modelto learn robust features for all seen classes. This is shownon the curve obtained from the model optimized upon 43classes, which ranks the second worst. Our best result wasseen on the model with 28 classes, which had a high classdiversity while also learning robust features during training.. In this subsection, we present the joint analysis for thePlantnet dataset. We present results where we instantiatethe base model trained on N classes. Then, we add other M unseen classes so that the model must classify between N + M classes. Weights for the M classes are inferred as described in section 5.2 and we report model accuracy forthe overall model, as well as for the M classes. Results arepresented in ﬁgures 9 and 10.In ﬁgures 9 and 10 we presented our results for a jointedscenario evaluating the overall model quality as new classesare added as well as showing the balanced accuracy calcu-lated only on the unseen classes. Our conclusions on thedisjointed scenario seen in ﬁgure 8 also hold in these results,as we can see for both of these cases the model trainedon 28 classes has the higher balanced accuracy given thesame number of total classes and the model optimized on43 classes shows the worse balanced accuracy. The lack ofdiversity of the model optimized on 10 classes can be seeninﬂuencing its quality as the number of novel classes increasein both scenarios.

8. CONCLUSION

In this paper, we presented how the normalized softmaxloss can be employed on the open set problem. We presentedresults on diﬀerent datasets for both the disjoint and jointopen set problems and compared them to metric learningstrategies. We show that the NSL based approach demon-strates superior results producing more robust features andimplementing a less costly optimization procedure, as it doesnot require pairwise training. Results on real world use caseevaluating on a subset of the Plantnet data shows how ourapproach can be employed to identify classes unseen duringoptimization, with weights associated to the classiﬁcation ofnew data inferred by the approach.

9. ACKNOWLEDGMENTS

The authors would like to thank Petrobras for supportingthis work through the project ”Development of an Intelligentsoftware platform”. We would also like to thank the INRIA-Brazil Associated Team cooperation project HPDaSc.

10. REFERENCES [1] A. Aﬀouard, H. Go¨eau, P. Bonnet, J.-C. Lombardo,and A. Joly. Pl@ntnet app in the era of deep learning.In

ICLR: International Conference on LearningRepresentations , 2017.[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In

Conference on Computer Visionand Pattern Recognition, CVPR09 , 2009.[3] C. Fields. Editorial: How humans recognize objects:Segmentation, categorization and individualidentiﬁcation.

Frontiers in Psychology , 7:400, 2016.[4] C. Garcin. Projects ˆA · garcin camille /plantnet dataset.[5] R. Hadsell, S. Chopra, and Y. LeCun. Dimensionalityreduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer SocietyConference on Computer Vision and PatternRecognition - Volume 2 , CVPR 2006, page 1735-1742,USA, 2006. IEEE Computer Society.[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In , pages 770–778, 2016.[7] M. KAYA and H. ˚A. B ¨A ° LGE. Deep metric learning:A survey.

Symmetry , 11(9), 2019.7 igure 7: Network architecture used to perform thePlantnet experiments

Number of unseen classes B a l an c ed A cc u r a cy Seen Classes

Figure 8: Comparing models with the same architec-ture and optimized upon the same amount of dataand number of epochs, but with diﬀerent amount ofseen classes on their ability to classify novel classes

Total amount of classes B a l an c ed A cc u r a cy Classes

Figure 9: Analysis for the joint scenario showing theresults for diﬀerent models on overall class architec-ture [8] A. Krizhevsky. Learning multiple layers of featuresfrom tiny images. Technical report, 2009.[9] Y. LeCun, C. Cortes, and C. Burges. Mnisthandwritten digit database.

ATT Labs [Online].Available: http://yann.lecun.com/exdb/mnist , 2, 2010.[10] A. Medela and A. Picon. Constellation loss: Improvingthe eﬃciency of deep metric learning loss functions forthe optimal embedding of histopathological images.

Journal of Pathology Informatics , 11(1):38, 2020.[11] F. Schroﬀ, D. Kalenichenko, and J. Philbin. Facenet:A uniﬁed embedding for face recognition andclustering. In .10.20.3 20 40 60 80 Total amount of classes B a l an c ed A cc u r a cy Classes

Figure 10: Analysis for the joint scenario showingthe results for diﬀerent models on the unseen classes

Vision and Pattern Recognition (CVPR) , pages815–823, 2015.[12] K. Simonyan and A. Zisserman. Very deepconvolutional networks for large-scale imagerecognition. In Y. Bengio and Y. LeCun, editors, , 2015.[13] J. Snell, K. Swersky, and R. Zemel. Prototypicalnetworks for few-shot learning. In I. Guyon, U. V.Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett, editors,

Advances in Neural Information Processing Systems , volume 30,pages 4077–4087. Curran Associates, Inc., 2017.[14] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. S. Torr,and T. M. Hospedales. Learning to compare: Relationnetwork for few-shot learning. In , pages 1199–1208, 2018.[15] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet,S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, andA. Rabinovich. Going deeper with convolutions. In , pages 1–9, 2015.[16] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu,and D. Wierstra. Matching networks for one shotlearning. In D. Lee, M. Sugiyama, U. Luxburg,I. Guyon, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems , volume 29, pages3630–3638. Curran Associates, Inc., 2016.[17] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille.Normface: L2 hypersphere embedding for faceveriﬁcation. In

Proceedings of the 25th ACMInternational Conference on Multimedia , MM ’17,page 1041-1049, New York, NY, USA, 2017.Association for Computing Machinery.[18] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou,Z. Li, and W. Liu. Cosface: Large margin cosine lossfor deep face recognition. In , pages 5265–5274, 2018.[19] Y. Wang and Q. Yao. Few-shot learning: A survey.

CoRR , abs/1904.05046, 2019.[20] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: anovel image dataset for benchmarking machinelearning algorithms.

CoRR , abs/1708.07747, 2017.[21] R. Yoshihashi, W. Shao, R. Kawakami, S. You,M. Iida, and T. Naemura. Classiﬁcation-reconstructionlearning for open-set recognition. In2019 IEEE/CVFConference on Computer Vision and PatternRecognition (CVPR)