[PDF] Entropic Out-of-Distribution Detection

Abstract

Out-of-distribution (OOD) detection approaches usually present special requirements (e.g., hyperparameter validation, collection of outlier data) and produce side effects (e.g., classification accuracy drop, slower energy-inefficient inferences). We argue that these issues are a consequence of the SoftMax loss anisotropy and disagreement with the maximum entropy principle. Thus, we propose the IsoMax loss and the entropic score. The seamless drop-in replacement of the SoftMax loss by IsoMax loss requires neither additional data collection nor hyperparameter validation. The trained models do not exhibit classification accuracy drop and produce fast energy-efficient inferences. Moreover, our experiments show that training neural networks with IsoMax loss significantly improves their OOD detection performance. The IsoMax loss exhibits state-of-the-art performance under the mentioned conditions (fast energy-efficient inference, no classification accuracy drop, no collection of outlier data, and no hyperparameter validation), which we call the seamless OOD detection task. In future work, current OOD detection methods may replace the SoftMax loss with the IsoMax loss to improve their performance on the commonly studied non-seamless OOD detection problem.

Full PDF

11 Isotropy Maximization Loss and Entropic Score:Accurate, Fast, Efﬁcient, Scalable, and TurnkeyNeural Networks Out-of-Distribution DetectionBased on The Principle of Maximum Entropy

David Macêdo,

Member, IEEE,

Tsang Ing Ren,

Member, IEEE,

Cleber Zanchettin,

Member, IEEE,

Adriano L. I. Oliveira,

Senior Member, IEEE, and Teresa Ludermir,

Senior Member, IEEE,

Abstract —Current out-of-distribution (OOD) detection approaches require cumbersome procedures that add undesired side effects tothe solution. In this paper, we argue that the low OOD detection performance of neural networks is due to cross-entropy SoftMax lossanisotropy and extreme propensity to produce low entropy (high conﬁdence) posterior probability distributions in frontal disagreement withthe Principle of Maximum Entropy. Consequently, we propose IsoMax, a loss that is isotropic (distance-based) and produces high entropy(low conﬁdence) posterior probability distributions despite still relying on cross-entropy minimization. Additionally, we propose a speedyEntropic Score for OOD detection. IsoMax loss works as a seamless SoftMax loss drop-in replacement that keeps the overall solutionaccurate, fast, efﬁcient, scalable, and turnkey. Our experiments indeed conﬁrmed that neural networks OOD detection performance maybe extremely improved without relying on techniques such as adversarial training or validation, data augmentation, ensembles methods,generative approaches, model architectural changes, metric learning, or additional classiﬁers or regressions. The results also showed thatour straightforward approach is competitive against state-of-the-art solutions besides avoiding previous methods undesired drawbacks.

Index Terms —Isotropic Maximization Loss, Entropic Score, Accurate, Fast, Efﬁcient, Scalable, Turnkey, Neural Networks,Out-of-Distribution Detection, Principle of Maximum Entropy (cid:70)

NTRODUCTION N EURAL networks have been used as classiﬁers ina wide range of applications. Their design usuallyconsiders that the model receives an instance of one ofthe classes at inference. If this holds, the neural networktends to present satisfactory performance. However, in real-world applications, this assumption is not usually fulﬁlled.Additionally, neural networks are known to present overcon-ﬁdent predictions even for objects they were not trained torecognize [1].The ability to detect if an input applied to a neuralnetwork can not be reliably classiﬁed is essential to criticalapplications in medicine, ﬁnance, agriculture, and engi-neering. In such situations, it is better to have a systemthat acknowledges that it is unable to decide. The rapidadoption of neural networks in modern applications makesthe development of such capability a primary necessity froma practical point of view.The mentioned problem has been studied under manysimilar point of views and nomenclatures such as Open SetRecognition [2], [3] and Open World Recognition [4], [5].Recently, [6] deﬁned as out-of-distribution (OOD) detectionthe task of evaluating whether a sample belongs to the in- • David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano L. I. Oliveira,and and Teresa Ludermir are with the Centro de Informática, UniversidadeFederal de Pernambuco. Brazil. E-mail: see [email protected] • David Macêdo was with Montreal Institute for Learning Algorithms,University of Montreal. Canada distribution on which a neural network was trained. [6] alsoestablished baseline datasets and metrics for OOD detection.Additionally, they established the baseline performance forthis task by proposing an OOD detection approach that usesthe maximum predicted probability as the score to detectwhether an example belongs to the in-distribution.Despite being a fundamental task, current OOD detectionapproaches are based on ad-hoc techniques that producesevere side effects on the solution. ODIN [7] and Mahalanobis[8] require input preprocessing , which makes inferences slowand increases computational cost and energy consumption.Additionally, these solutions also present hyperparameters totune using unrealistic access to out-of-distribution samples orthe cumbersome process of generating adversarial examples.Adversarial training methods such as ACET [9] usuallyimplicate in longer training times and reduced scalability forlarge-sized images.Another major drawback present in recent OOD detectionapproaches is the classiﬁcation accuracy drop [10], [11], which isindeed a harmful undesired side effect because classiﬁcationis usually the primary aim of the system. In contrast, OODdetection is an auxiliary task [12].In some cases, OOD detection proposals require unde-sired model structural changes [13] or even ensemble meth-ods [14], [15]. Finally, there are solutions based on uncertaintyor conﬁdence estimation/calibration [16], [17], [18], [19], [20].Despite additional complexity, slower inference, and higherenergy/computation required, they may additionally presentOOD detection performance worse than ODIN [11], [21]. a r X i v : . [ c s . L G ] A ug In this paper, we argue that the low OOD performanceof neural networks is mainly due to two factors. First, theSoftMax anisotropy, which does not concentrate high-levelrepresentations in the feature space, making it difﬁcult forOOD detection [9]. Second, the cross-entropy loss propensityto generate extremely overconﬁdent (low entropy) posteriorprobability distributions, which is in direct conﬂict withthe Principle of Maximum Probability. Throughout thiswork, we further develop those claims with both theoreticalmotivations and experimental results.Hence, we propose IsoMax, a loss that is isotropic(distance-based) and generates inferences with high meanentropy posterior probability distributions in agreement withthe Principle of Maximum Entropy. Our principled approachis accurate, fast, efﬁcient, scalable, and turnkey, besides pro-ducing competitive performance. Additionally, our solutionworks as a seamless SoftMax loss drop-in replacement thatfacilitates its incorporation into current and future projects.Unlike most contemporary approaches, our proposal isviable from an economical and environmental point of view.Furthermore, the detection is a speedy procedure that can beachieved by a straightforward negative entropy calculation.

ACKGROUND

ODIN was proposed in [7] by combining SoftMax inputpreprocessing and temperature calibration . Despite signiﬁcantlyoutperforming the baseline, the input preprocessing intro-duced in ODIN considerably increases the inference delayby requiring a backpropagation operation and a secondinference to perform the ﬁnal prediction on a single sample.Considering that backpropagation is typically slower thaninference, input prepossessing makes the ODIN inferences atleast three times slower. Additionally, input preprocessing alsomakes the inference power-consumption at least three timeshigher, which is a severe limitation from an economical andenvironmental perspective [22]. Several subsequent OODdetection proposals incorporated input preprocessing and itsassociated drawbacks [7], [8], [11], [23].

Temperature calibration consists of changing the scale of the logits of a pretrained model .Both input preprocessing and temperature calibration requireshyperparameter tuning.ODIN required unrealistic access to out-of-distributionsamples to validate the hyperparameters. Even if somesupposed OOD samples are indeed available during design,using those examples to validate hyperparameters makesthe solution overﬁts to detect this particular type of out-distribution [21]. In real-world applications, the system willprobably operate under a different/novel/unknown out-distribution, and the OOD detection performance coulddegrade signiﬁcantly. Therefore, using design time out-of-distribution samples to validate hyperparameters generatesunrealistic OOD detection performance expectations.The seminal work introduced by [8], which we call theMahalanobis method, overcomes the necessity of accessto out-of-distribution samples by validating the requiredhyperparameters in adversarial examples and producingmore realistic performance estimates. Hence, in this work,we only consider validation on adversarial samples forcompeting methods. As our approach is turnkey, it does not require hyperparameters validation. However, validation using adversarial examples has thedisadvantage of adding a cumbersome procedure to theprocess. Even worse, the generation of adversarial samplesitself requires the deﬁnition of hyperparameters as themaximum adversarial perturbations. For research datasets,we may know those, but it may be hard to ﬁnd them for novelreal-world data. The same drawbacks apply to methodsbased on adversarial training such as ACET [9], which alsoimplicates in slower training. Solutions based on adversarialtraining may also present scalability problems when used inapplications dealing with real-world large size images.Moreover, the Mahalanobis approach still requires input-preprocessing , which brings to this solution the previouslymentioned drawbacks associated with the mentioned tech-nique.

Feature ensembles introduced in Mahalanobis alsopresent limitations. Since it requires training/inference of ad-hoc classiﬁcation/regression models on features producedin many neural network layers, this approach may notscale to applications using large-sized images, as it shouldrequire using those shallow models in spaces of thousandsof dimensions.

Contributions.

In this paper, we develop an OOD detec-tion approach that avoids all previously mentioned require-ments and side effects. For this work, we follow the “SoftMaxloss” expression as deﬁned in [24]. The ﬁrst component of oursolution is the Isotropy Maximization (IsoMax) loss, whichworks as a drop-in replacement for SoftMax one, as the swapof SoftMax loss with IsoMax one neither requires model, data,nor training procedure modiﬁcations. IsoMax uses distance-based logits to ﬁx the SoftMax loss anisotropy caused by itsafﬁne transformations. Moreover, we introduced what wecall the Entropic Scale ( E s ), a multiplicative factor applied tothe logits throughout training that is, nevertheless, removedduring inference to achieve high entropy posterior probabilitydistributions. This clever scheme allows us to build highentropy (low conﬁdence) posterior probability distributionscommitted to our prior knowledge as stated by the Principleof Maximum Entropy.The second part of our proposal is the speedy EntropicScore, which is deﬁned as the negative of entropy of theoutput probabilities, used to OOD detection. Furthermore,we provided theoretical motivation to explain why thesolution works based on the Principle of Maximum Entropy.Finally, we present substantial experiments that conﬁrm ourtheoretical assumptions and show that the overall solutionis competitive with approaches that require operating undermore favorable and less restrictive conditions.Indeed, the principled way we construct our approachallowed it to be accurate (no classiﬁcation accuracy drop ), fast,and to present energy/computational efﬁciency (no inputpreprocessing , no adversarial training). Additionally, it isalso scalable (no feature ensemble , no adversarial training)and turnkey (no post-processing for hyperparameter vali-dation, no access to out-of-distribution or the generation ofadversarial examples is required). Modern loss enhancementtechniques such as outlier exposure [25], [26] may be readilyadapted to also work with IsoMax loss. (a) (b) (c) (d)(e)Fig. 1. (a) Cross-entropy SoftMax loss simultaneously minimizes both the cross-entropy and the entropy of the posterior probabilities. (b) IsoMaxloss produces low entropy posterior probabilities for low Entropic Scale ( E s =1 ). (c) IsoMax loss produces medium mean entropy for intermediateEntropic Scale ( E s =3 ). (d) In agreement with the Principle of Maximum Entropy, IsoMax loss can minimize the cross-entropy while producing highmean entropies for high Entropic Scale ( E s =10 ). Entropic Scale equals to ten is enough to produce extremely high entropy posterior probabilitydistribution. (e) Higher values of the Entropic Scale correlate to higher mean entropy and increased OOD detection performance regardless of theout-distribution under consideration. Isotropy enables IsoMax loss to produce higher OOD performance than SoftMax one even for unitary value ofthe Entropic Scale. IsoMax loss classiﬁcation accuracies are similar to SoftMax ones and insensitive to E s . SO M AX L OSS AND E NTROPIC S CORE

Let x represent the input applied to a neural network and f θ ( x ) represent the high-level feature vector produced by it.For this work, the underlying structure of the neural networkdoes not matter. Considering k be the correct class for aparticular training example x , we can write the SoftMax lossassociated with this speciﬁc training sample as: L S (ˆ y ( k ) | x ) = − log  exp( w (cid:62) k f θ ( x )+ b k ) (cid:80) j exp( w (cid:62) j f θ ( x )+ b j ))  (1)In the equation (1), w j and b j represent, respectively,the weights and biases associated with the class j . From ageometric perspective, the term w (cid:62) j f θ ( x )+ b j represents ahyperplane in the high-level feature space. It divides thefeature space into two subspaces that we call the positiveand negative subspaces. The deeper inside the positivesubspace the features f θ ( x ) is located, the more the examplelikely belongs to the considered class. Therefore, to trainneural networks using SoftMax loss does not incentivethe agglomeration of the representations of the examples associated with a particular class into a limited region of thehyperspace. The immediate consequence is the propensity ofSoftMax loss trained neural networks to make high conﬁdentpredictions on examples that stay in regions very far awayfrom the training examples, explaining their low out-of-distribution detection performance [9].The main characteristic of the Mahalanobis distance usedin [8] is to be locally isotropic around the produced proto-types. The fact it achieved high OOD detection performanceindicates that deploying locally isotropic spaces aroundclass prototypes improves the OOD detection performance.SoftMax loss trained neural networks are based on afﬁnetransformations on the last layer, which are essentially inter-nal products. Consequently, the last layer representationsof such networks tend to align in the direction of theweights vector, producing a preferential direction in spaceand anisotropy. Designing a loss that only depends on thedistances of high-level representations to class prototypes is apossible way to avoid the mentioned anisotropy. A distance-based loss forbids the network to learn preferred directionsin the feature space and enforces local isotropy during thenetwork training, avoiding metric learning post-processingor hyperparameters validation.Distance-based losses have been studied in the context of few-shot learning . [27] used metric and transfer learning onpretrained features , while [28] proposed an ofﬂine procedure tocalculate prototypes . In both cases, prototypes are not calculatedseamlessly during the network backpropagation training . Addi-tionally, while [27] used Mahalanobis distance, [28] proposed squared Euclidean one.In IsoMax, to build a straightforward procedure to performOOD detection , distance-based logits are incorporated directlyinto the loss used to train the neural network . Therefore, theprototypes are treated as usual weights and learned during theregular backpropagation procedure . We experimentally observedthat using regular non-squared

Euclidean distance performedbest. Therefore, IsoMax loss is constructed with the negativeof the non-squared

Euclidean distance, which is given bythe expression −(cid:107) f θ ( x ) − p j φ (cid:107) , where p j φ represents theseamlessly learnable prototype associated with the class j .The class prototypes have the same dimension of the lastlayer representations. As there is no bias, IsoMax loss hasfewer parameters than SoftMax one. The Principle of Maximum Entropy, formulated by E. T.Jaynes to unify the statistical mechanics and informationtheory entropy concepts [29], [30], states that when esti-mating probability distributions, we should choose the onewhich produces the maximum entropy consistent with thegiven constraints [31]. Following this principle, we avoidintroducing additional assumptions or bias . In other words,from a set of trial probability distributions that satisfactorilydescribes the prior knowledge available, the one that presentsthe maximal information entropy (the least informativeoption) represents the best choice.The Principle of Maximum Entropy has been studied asa regularization factor in classiﬁcation tasks [32], [33]. In somecases, it has also being used in classiﬁcation tasks as a directoptimization procedure without connection to the mechanism ofcross-entropy minimization or backpropagation . For example, in[34], [35], [36], the Maximization of the Entropy subject toa constraint on the expected classiﬁcation error is showedto be equivalent to solving an unconstrained Lagrangian.Despite being theoretically well-grounded [37], [38], [39], [40],direct entropy maximization presents high computationalcomplexity as it is a NP-complete problem [37], [39].Alternatively, modern neural networks are trained us-ing the computational efﬁcient cross-entropy minimization. However, the mentioned procedure does not prioritize posteriorprobability distributions with high entropy. Actually, exactly theopposite is true.

Indeed, the minimization of cross-entropyhas the undesired side effect of producing overconﬁdent low meanentropy posterior probability distributions . Hence, we use thePrinciple of Maximum Entropy as motivation to construct highentropy posterior probabilities still relying on computationallyefﬁcient cross-entropy minimization. Additionally, we presentsubstantial experimental evidence to show that increased posteriorprobability distribution entropy correlates to improved OODdetection performance.

Unlike the previously mentioned works, we are neitherusing the Principle of Maximum Entropy to motivate the https://mtlsites.mit.edu/Courses/6.050/2003/notes construction of regularization mechanisms (such as labelsmoothing or conﬁdence penalty) nor performing direct Maximum Entropy optimization. Indeed, the entropy isnot even calculated during IsoMax loss training. Sinceour approach is not directly maximizing the entropy, wecan not state that the proposed method is producing thehighest available mean entropy for the posterior probabilitydistribution. Nevertheless, the experiments show that ourapproach’s average entropy is high enough to improve theOOD detection performance signiﬁcantly.

Thus, our approachmay be seen as a computationally efﬁcient procedure to obtainhigh entropy posterior distributions avoiding the extremely highcomputational cost of performing a direct entropy maximization. L SoftMax = − log  exp( L k ) (cid:80) j exp( L j )  → ⇒ P ( y | x ) → ⇒ (cid:88) H SoftMax → (2)The equation (2) explains the behavior of the cross-entropy and entropy for the SoftMax loss. L j representsthe logits associated with the class j , and L k represents thelogits associated with the correct class k . When minimizingthe ﬁrst term of the mentioned equation, extremely highprobabilities are generated. Consequently, very low entropyposterior probability distributions are produced. The usualcross-entropy loss minimization tends to generate unrealis-tic overconﬁdent (low entropy) probabilities distributions.Therefore, we have an opposition between cross-entropy lossminimization and the Principle of Maximum Entropy. L IsoMax = − log  exp( E s × L k ) (cid:80) j exp( E s × L j )  → (cid:54) = ⇒ P ( y | x ) → ⇒ (cid:88) H IsoMax → (3)The IsoMax loss straightforwardly conciliates these ap-parently contradictory objectives by multiplying the logits bywhat we call the Entropic Scale E s , which is presented duringtraining but removed for inference . The equation (3) demon-strates how the Entropic Scale allows the production ofhigh entropy posterior distributions relying on cross-entropyminimization. The Entropic Scale present during trainingallows the argument of the exponential functions E s × L j tobecome high enough to produce low loss without producingextremely high probability for the correct classes as theyare calculated with the Entropic Scale removed. Hence, it ispossible to build posterior probability distributions withhigh mean entropy in agreement with the fundamentalPrinciple of Maximum Entropy despite using cross-entropyminimization. Hence, we can deﬁne the IsoMax loss as: TABLE 1Accurate, fast, efﬁcient, scalable, and turnkey out-of-distribution detection approaches (neither classiﬁcation accuracy drop , adversarial training, inputpreprocessing , temperature calibration , feature ensemble, nor ad-hoc post-processing classiﬁcation/regression). Since there is no hyperparameter totune, no access to out-of-distribution or adversarial examples is required. SoftMax+MPS means training with SoftMax loss and performing OODdetection using the Maximum Probability Score (MPS) [6]. SoftMax+ES means training with SoftMax loss and performing OOD detection usingEntropic Score. IsoMax+ES means training with IsoMax loss and performing OOD detection using Entropic Score. The best results are in bold. To thebest of our knowledge, IsoMax+ES presents state-or-the-art under these assumptions. Model In-Data(training) Out-Data(unseen) Out-of-Distribution Detection:Accurate, Fast, Efﬁcient, Scalable, and TurnkeyTNR@TPR95 (%) AUROC (%) DTACC (%)SoftMax+MPS / SoftMax+ES / IsoMax+ESDenseNet CIFAR10 SVHN 32.2 / 33.2 /

TinyImageNet 55.8 / 59.8 /

LSUN 64.9 / 69.5 /

CIFAR100 SVHN 20.6 / 24.9 /

TinyImageNet 19.4 / 23.7 /

LSUN 18.8 / 24.4 /

SVHN CIFAR10 81.5 / 83.7 /

TinyImageNet 88.2 / 90.0 /

LSUN 86.4 / 88.4 /

ResNet CIFAR10 SVHN 43.1 / 44.5 /

TinyImageNet 46.3 / 48.0 /

LSUN 51.2 / 53.3 /

CIFAR100 SVHN 15.9 / 18.0 /

TinyImageNet 18.5 / 22.4 /

LSUN 18.4 / 22.4 /

SVHN CIFAR10 67.3 / 67.7 /

TinyImageNet 66.9 / 67.3 /

LSUN 62.2 / 62.5 /

TABLE 2

Unfair comparison of approaches with different requirements and side effects . ODIN uses input preprocessing , temperature calibration , andadversarial validation. Mahalanobis uses input preprocessing , feature ensemble, ad-hoc post-processing classiﬁcation/regression models, andadversarial validation. Input preprocessing makes ODIN and Mahalanobis inference three times slower and three times less energy/computationallyefﬁcient. ACET uses adversarial training, which implicates in slower training and reduced scalability for large-scale images. ODIN, Mahalanobis, andACET present hyperparameters that need to be validated for each dataset. IsoMax+ES neither uses those techniques nor presents hyperparametersto tune for novel datasets. The best results are in bold (2% tolerance).

Model In-Data(training) Out-Data(unseen) ODIN/ACET/Mahalanobis present special requirements.ODIN/Mahalanobis produce undesired side effects.AUROC (%) DTACC (%)ODIN / ACET / IsoMax+ES / MahalanobisDenseNet CIFAR10 SVHN 92.8 / NA / / / TinyImageNet 97.2 / NA / / / LSUN 98.5 / NA / / / CIFAR100 SVHN 88.2 / NA / 88.8 / / TinyImageNet 85.3 / NA / 91.1 /

LSUN 85.7 / NA / 93.1 /

SVHN CIFAR10 91.9 / NA / / TinyImageNet 94.8 / NA / / LSUN 94.1 / NA / / ResNet CIFAR10 SVHN 86.5 / / 93.8 / 95.5 77.8 / NA / / TinyImageNet 93.9 / 85.9 / 95.2 /

LSUN 93.7 / 85.8 / 97.3 /

CIFAR100 SVHN 72.0 / / / 84.4 67.7 / NA / / 76.5TinyImageNet 83.6 / 75.2 / / / LSUN 81.9 / 69.8 / / 82.3 74.6 / NA / / 79.7SVHN CIFAR10 92.1 / 97.3 / / / TinyImageNet 92.9 / 97.7 / 97.1 /

LSUN 90.7 / / 96.6 /

TABLE 3Performance metrics of neural networks trained using SoftMax andIsoMax losses for a combination of in-distributions and models. IsoMaxloss produces very similar train and test classiﬁcation accuracy whilepresenting much higher OOD detection performance (Table 1).

Test Accuracy (%)Model Data SoftMax Loss IsoMax LossSVHN 96.7 96.7DenseNet CIFAR10 94.9 95.1CIFAR100 75.7 76.2SVHN 96.7 96.6ResNet CIFAR10 95.4 95.4CIFAR100 75.8 75.3 L I (ˆ y ( k ) | x ) = − log  exp( − E s (cid:107) f θ ( x ) − p k φ (cid:107) ) (cid:80) j exp( − E s (cid:107) f θ ( x ) − p j φ (cid:107) )  = − log  exp( − E s (cid:113) ( f θ ( x ) − p k φ ) · ( f θ ( x ) − p k φ )) (cid:80) j exp( − E s (cid:113) ( f θ ( x ) − p j φ ) · ( f θ ( x ) − p j φ ))  (4)In the previous equation, k is the correct class. Experimen-tally, we observed that using Xavier [41] or Kaiming [42] ini-tialization for prototypes made OOD detection performanceoscillates. Sometimes it improves, sometimes it decreases.Additionally, we experimentally observed classiﬁcation accu-racy drop when using E s with afﬁne transformations usedin SoftMax loss. Hence, we decided always to initialize allprototypes to zero and indeed use non-squared Euclideandistance-based logits.To calculate the cross-entropy loss, deep learning librariesusually combine the logarithm and probability calculationsinto a single computation.

However, we experimentally observedthat sequentially computing these calculations as stand-aloneoperations signiﬁcantly improves IsoMax performance . Sinceprototypes are regular learnable network weights, the weightdecay was applied to them. Finally, we can deﬁned theinference probabilities as: p I ( y ( i ) | x ) = exp( −(cid:107) f θ ( x ) − p i φ (cid:107) ) (cid:80) j exp( −(cid:107) f θ ( x ) − p j φ (cid:107) )= exp( − (cid:113) ( f θ ( x ) − p k φ ) · ( f θ ( x ) − p k φ )) (cid:80) j exp( − (cid:113) ( f θ ( x ) − p j φ ) · ( f θ ( x ) − p j φ )) (5) Out-of-distribution detection approaches typically deﬁnea score to be used during inference to evaluate whetheran example should be considered out-of-distribution. In aseminal work, [43] demonstrated that the entropy presentsthe optimum measure of the randomness of a source ofsymbols. More broadly, we currently understand entropyas a measure of the uncertainty we have about a randomvariable. Therefore, considering the uncertainty in classifying a speciﬁc sample should be an optimum metric to evaluatewhether a particular example is out-of-distribution, we deﬁneour score to perform OOD detection, called Entropic Score,as the negative entropy of the output probabilities: ES = − N (cid:88) i =1 p ( y ( i ) | x ) log p ( y ( i ) | x ) (6)By using the negative entropy as a score to evaluateif a particular sample is out-of-distribution, we considerthe information provided by all available outputs ratherthan relying on a single network output, for example, themaximum probability (as in baseline, ODIN, and ACET)or distance to the nearest prototype (as in Mahalanobis).Additionally, from a practical perspective, using this a prioriscore avoids the need to train an ad-hoc additional regres-sion model to detect out-of-distributions samples, which isrequired, for example, in Mahalanobis. Even more important,since no regression model needs to be trained, there is noneed for unrealistic access to out-of-distribution samples orgenerating adversarial to hyperparameters validation. SinceES is a predeﬁned no-trainable score, it is available as soonas the neural network training ﬁnishes. XPERIMENTAL R ESULTS

The source code to reproduce all the results is available assupplementary material. Considering that outlier exposuremay be integrated and beneﬁt both SoftMax and IsoMaxlosses, all experiments were performed without relying onoutlier data [25], [26]. Similar arguments hold for backgroundsamples based approaches [44]. All datasets, models, trainingprocedures, and metrics followed the baseline established in[6] and subsequently used in major OOD detection papers[7], [8], [9]. In this paper, only approaches that do not present classiﬁcation accuracy drop were compared.In our experiments, we trained from scratch 100 layersDenseNets [45] and 34 layers ResNets [46] on CIFAR10 [47],CIFAR100 [47] and SVHN [48] datasets using SoftMax andIsoMax losses using the same protocol presented in [8] (300epochs, initial learning rate of 0.1 with learning rate decayrate equals ten in the epochs 150, 200, and 250, and a weightdecay of 0.0001).We used resized images from the datasets TinyImageNet[49] , and the Large-scale Scene UNderstanding dataset(LSUN) [50] following the same protocol used in [8] tocreate out-distribution data.To evaluate the OOD detection performance, we addedthese out-of-distribution images to the validation imagespresented in CIFAR10, CIFAR100, and SVHN datasets toform the composed test set. The performance was evaluatedusing three detection metrics. First, we calculate the TrueNegative Rate at 95% True Positive Rate (TNR@TPR95).Besides, we evaluated the Area Under the Receiver OperatingCharacteristic Curve (AUROC) and the Detection Accuracy(DTACC), which corresponds to the maximum classiﬁcationprobability over all possible thresholds δ : − min δ (cid:8) P in ( o ( x ) ≤ δ ) P ( x is from P in )+ P out ( o ( x ) > δ ) P ( x is from P out ) (cid:9) , (7) https://github.com/facebookresearch/odin where o ( x ) is a given OOD detection score. It is assumedthat both positive and negative samples have equal prob-ability of being in the test set, i.e., P ( x is from P in ) = P ( x is from P out ) . All the above metrics follow the calcula-tion detailed in [8].To deﬁne the global hyperparameter E s , we trainedDenseNets on SVHN using E s equals to 1, 3, and 10. Wevalidated these possible values using the TNR@TPR95 metricand CIFAR100 as out-distribution (Fig. 1). It is importantto emphasize that CIFAR100 was never used as an out-distribution in the subsequent experiments.The Fig. 1(a) shows that the SoftMax loss minimizesboth the cross-entropy and the entropy of the posteriordistribution. The Fig. 1(b) shows that IsoMax loss is capableof minimizing the cross-entropy keeping high average pos-terior probability entropy as recommend by the Principle ofMaximum Entropy. The Fig. 1(c) shows that OOD detectionperformance increases for higher Entropic Scales.It was possible to deﬁne the Entropic Scale E s as a global hyperparameter because the experiments showed that oncevalidated in a simple metric, model, in-distribution, andout-distribution; the global value deﬁned to it generalizedwell to all other metrics, models, in-distributions, and out-distributions (Fig. 1(c)). Considering that E s = 10 alreadyproduces very high entropies probability distributions, wesee no reason to increase it even more. Therefore, allexperiments in this paper used E s = 10 and no validationwas performed for each new dataset, making our proposalturnkey. The value of E s = 10 presented the best OODdetection performance. The mentioned value generalizes wellto unseen out-distributions as required for a satisfactory global hyperparameter candidate. Consequently, this same valuewas used for all other experiments (combinations of models,in-distributions, and out-distributions in Tables 1 and 2).Once conﬁrmed E s = 10 as a adequate global hyperparameter,the experiments showed that IsoMax loss trained networkspresent classiﬁcation accuracy performance extremely similarto SoftMax ones to all other datasets and models (Table 3). In Table 1, SoftMax+MPS presents the worst results. TheEntropic Score produces a small positive effect when appliedto SoftMax loss trained networks. However, the combinationof IsoMax loss with the same Entropic Score signiﬁcantlyimproves the OOD detection performance across almost allmetrics, in-distribution, and out-distribution.

Table 2 shows an unfair comparison of approaches thatpresent different requirements and side effects.

Input pre-possessing (and consequently slower and higher power con-sumption inferences) and validation on adversarial samplesare used in both ODIN and Mahalanobis, while temperaturecalibration is required only in ODIN.

Feature ensemble and ad-hoc classiﬁcation/regression models, which may implicate inlimited scalability, are mandatory in Mahalanobis. ACETrequires adversarial training, which may restrict its useto small scale images. ODIN, Mahalanobis, and ACEThave hyperparameters tuned for each in-distribution. The IsoMax+ES presents neither the mentioned drawbacks norside effects.Regardless of the previous considerations, the table showsthat IsoMax+ES signiﬁcantly outperforms ODIN in all evalu-ated scenarios. Additionally, IsoMax+ES usually outperformsACET (sometimes by a large margin). Moreover, in more thanhalf of the cases, even operating under much more favorableconditions, Mahalanobis surpasses IsoMax+ES by less than2%. In some scenarios, the latter even overcomes the formerdespite avoiding hyperparameter validation, being native,scalable, straightforward to implement/use, and presentingat least there times faster and power-efﬁcient inference.IsoMax+ES loss performs particularly well in one of theCIFAR100 cases, which may suggest that the fact ES usesall outputs to decide works even better when many classesare presented. We speculate that recent advances in dataaugmentation techniques may help to improve IsoMax+ESOOD detection performance even further [51], [52].

ONCLUSION

In this paper, we proposed the IsoMax loss and the En-tropic Score to show that neural networks OOD detectionperformance can be signiﬁcantly improved in an accurate,fast, efﬁcient, scalable, and turnkey way simply by replacingthe SoftMax loss and using an predeﬁned, meaningful, andinformation-theoretic well-founded score without relyingon ad-hoc techniques to avoid their associated drawbacks,requirements and side effects. However, if the above-mentioned limitations are not a concern for a particularapplication, those techniques may be combined with IsoMaxloss to achieve even higher OOD detection performance.Another promising approach could be the use of recentspecialized data augmentation techniques. In future works,we intend to make E s a learnable parameter. R EFERENCES [1] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibrationof modern neural networks,”

International Conference on MachineLearning , 2017.[2] W. J. Scheirer, A. Rocha, A. Sapkota, and T. E. Boult, “Towards openset recognition,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , 2013.[3] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models foropen set recognition,”

IEEE Transactions on Pattern Analysis andMachine Intelligence , 2014.[4] A. Bendale and T. Boult, “Towards open world recognition,”

Computer Vision and Pattern Recognition , 2015.[5] E. Rudd, L. P. Jain, W. J. Scheirer, and T. Boult, “The extremevalue machine,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , 2018.[6] D. Hendrycks and K. Gimpel, “A baseline for detecting mis-classiﬁed and out-of-distribution examples in neural networks,”

International Conference on Learning Representations , 2017.[7] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,”

InternationalConference on Learning Representations , 2018.[8] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple uniﬁed frameworkfor detecting out-of-distribution samples and adversarial attacks,”

Neural Information Processing Systems , 2018.[9] M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why ReLUnetworks yield high-conﬁdence predictions far away from thetraining data and how to mitigate the problem,”

Computer Visionand Pattern Recognition , 2018.[10] E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameter-free out-of-distribution detection using softmax of scaled cosinesimilarity,” arXiv preprint arXiv:1905.10628 , 2019. [11] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized ODIN:Detecting out-of-distribution image without learning from out-of-distribution data,” arXiv preprint arXiv:2002.11297 , 2020.[12] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber,D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin, “On evaluat-ing adversarial robustness,” arXiv preprint arXiv:1902.06705 , 2019.[13] Q. Yu and K. Aizawa, “Unsupervised out-of-distribution detectionby maximum classiﬁer discrepancy,”

International Conference onComputer Vision , 2019.[14] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L.Willke, “Out-of-distribution detection using an ensemble of selfsupervised leave-out classiﬁers,”

European Conference on ComputerVision , 2018.[15] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple andscalable predictive uncertainty estimation using deep ensembles,”

Neural Information Processing Systems , 2017.[16] A. Kendall and Y. Gal, “What uncertainties do we need in bayesiandeep learning for computer vision?”

Neural Information ProcessingSystems , 2017.[17] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl, “Leverag-ing uncertainty information from deep neural networks for diseasedetection,”

Scientiﬁc Reports , 2017.[18] A. Malinin and M. Gales, “Predictive uncertainty estimation viaprior networks,”

Neural Information Processing Systems , 2018.[19] V. Kuleshov, N. Fenner, and S. Ermon, “Accurate uncertaintiesfor deep learning using calibrated regression,” arXiv preprintarXiv:1807.00263 , 2018.[20] A. Subramanya, S. Srinivas, and R. V. Babu, “Conﬁdence estimationin deep neural networks via density modelling,” arXiv preprintarXiv:1707.07013 , 2017.[21] A. Shafaei, M. Schmidt, and J. J. Little, “A less biased evaluationof out-of-distribution sample detectors,”

British Machine VisionConference , 2019.[22] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ArtiﬁcialIntelligence,” arXiv preprint arXiv:1907.10597 , 2019.[23] T. DeVries and G. W. Taylor, “Learning conﬁdence for out-of-distribution detection in neural networks,” arXiv preprintarXiv:1802.04865 , 2018.[24] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss forconvolutional neural networks.”

International Conference on MachineLearning , 2016.[25] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomalydetection with outlier exposure,”

International Conference on LearningRepresentations , 2019.[26] A.-A. Papadopoulos, M. R. Rajati, N. Shaikh, and J. Wang, “Outlierexposure with conﬁdence control for out-of-distribution detection,” arXiv preprint arXiv:1906.03509 , 2019.[27] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classiﬁcation: Generalizing to new classes at near-zerocost,”

IEEE Transactions on Pattern Analysis and Machine Intelligence ,2013.[28] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks forfew-shot learning,”

Neural Information Processing Systems , 2017.[29] E. T. Jaynes, “Information Theory and Statistical Mechanics,”

Phys.Rev. , 1957.[30] ——, “Information Theory and Statistical Mechanics. II,”

Phys. Rev. ,1957.[31] T. M. Cover and J. A. Thomas, “Elements of Information Theory,”

Wiley Series in Telecommunications and Signal Processing , 2006.[32] A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum-entropyﬁne grained classiﬁcation,”

Neural Information Processing Systems ,2018.[33] G. Pereyra, G. Tucker, J. Chorowski, Łukasz Kaiser, and G. Hinton,“Regularizing neural networks by penalizing conﬁdent outputdistributions,” arXiv preprint arXiv:1908.05569 , 2017.[34] D. Miller, A. Rao, K. Rose, and A. Gersho, “A maximum entropyapproach for optimal statistical classiﬁcation,”

IEEE Workshop onNeural Networks for Signal Processing , 1995.[35] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra, “A maximumentropy approach to natural language processing,”

ComputationalLinguistics , 1996.[36] J. Shawe-Taylor and D. Hardoon, “Pac-bayes analysis of maxi-mum entropy classiﬁcation,”

International Conference on ArtiﬁcialIntelligence and Statistics , 2009.[37] J. Pearl, “Probabilistic reasoning in intelligent systems: Networksof plausible inference,”

Morgan Kaufmann Publishers Inc. , 1988.[38] J. Williamson, “Objective bayesian nets,”

We Will Show Them! , 2005. [39] ——, “Philosophies of probability,”

Philosophy of Mathematics:Handbook of the Philosophy of Science , 2009.[40] ——, “In defence of objective bayesianism,”

Oxford University Press ,2013.[41] X. Glorot and Y. Bengio, “Understanding the difﬁculty of trainingdeep feedforward neural networks,”

International Artiﬁcial Intelli-gence and Statistics , 2010.[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation,”

International Conference on Computer Vision , 2016.[43] C. E. Shannon, “A Mathematical Theory of Communication,”

BellSystem Technical Journal , 1948.[44] A. R. Dhamija, M. Günther, and T. Boult, “Reducing networkagnostophobia,”

Neural Information Processing Systems , 2018.[45] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,”

Computer Vision and PatternRecognition , 2017.[46] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deepresidual networks,”

Lecture Notes in Computer Science , 2016.[47] A. Krizhevsky, “Learning multiple layers of features from tinyimages,”

Science Department, University of Toronto , 2009.[48] Y. Netzer and T. Wang, “Reading digits in natural images with un-supervised feature learning,”

Neural Information Processing Systems ,2011.[49] J. D. J. Deng, W. D. W. Dong, R. Socher, L.-J. L. L.-J. Li, K. L. K. Li,and L. F.-F. L. Fei-Fei, “ImageNet: A large-scale hierarchical imagedatabase,”

Computer Vision and Pattern Recognition , 2009.[50] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “LSUN: Constructionof a large-scale image dataset using deep learning with humans inthe loop,” arXiv preprint arXiv:1506.03365 , 2015.[51] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, andS. Michalak, “On mixup training: Improved calibration and pre-dictive uncertainty for deep neural networks,”

Neural InformationProcessing Systems , 2019.[52] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix:Regularization strategy to train strong classiﬁers with localizablefeatures,”