Entropic Out-of-Distribution Detection
David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano L. I. Oliveira, Teresa Ludermir
11 Isotropy Maximization Loss and Entropic Score:Accurate, Fast, Efficient, Scalable, and TurnkeyNeural Networks Out-of-Distribution DetectionBased on The Principle of Maximum Entropy
David Macêdo,
Member, IEEE,
Tsang Ing Ren,
Member, IEEE,
Cleber Zanchettin,
Member, IEEE,
Adriano L. I. Oliveira,
Senior Member, IEEE, and Teresa Ludermir,
Senior Member, IEEE,
Abstract —Current out-of-distribution (OOD) detection approaches require cumbersome procedures that add undesired side effects tothe solution. In this paper, we argue that the low OOD detection performance of neural networks is due to cross-entropy SoftMax lossanisotropy and extreme propensity to produce low entropy (high confidence) posterior probability distributions in frontal disagreement withthe Principle of Maximum Entropy. Consequently, we propose IsoMax, a loss that is isotropic (distance-based) and produces high entropy(low confidence) posterior probability distributions despite still relying on cross-entropy minimization. Additionally, we propose a speedyEntropic Score for OOD detection. IsoMax loss works as a seamless SoftMax loss drop-in replacement that keeps the overall solutionaccurate, fast, efficient, scalable, and turnkey. Our experiments indeed confirmed that neural networks OOD detection performance maybe extremely improved without relying on techniques such as adversarial training or validation, data augmentation, ensembles methods,generative approaches, model architectural changes, metric learning, or additional classifiers or regressions. The results also showed thatour straightforward approach is competitive against state-of-the-art solutions besides avoiding previous methods undesired drawbacks.
Index Terms —Isotropic Maximization Loss, Entropic Score, Accurate, Fast, Efficient, Scalable, Turnkey, Neural Networks,Out-of-Distribution Detection, Principle of Maximum Entropy (cid:70)
NTRODUCTION N EURAL networks have been used as classifiers ina wide range of applications. Their design usuallyconsiders that the model receives an instance of one ofthe classes at inference. If this holds, the neural networktends to present satisfactory performance. However, in real-world applications, this assumption is not usually fulfilled.Additionally, neural networks are known to present overcon-fident predictions even for objects they were not trained torecognize [1].The ability to detect if an input applied to a neuralnetwork can not be reliably classified is essential to criticalapplications in medicine, finance, agriculture, and engi-neering. In such situations, it is better to have a systemthat acknowledges that it is unable to decide. The rapidadoption of neural networks in modern applications makesthe development of such capability a primary necessity froma practical point of view.The mentioned problem has been studied under manysimilar point of views and nomenclatures such as Open SetRecognition [2], [3] and Open World Recognition [4], [5].Recently, [6] defined as out-of-distribution (OOD) detectionthe task of evaluating whether a sample belongs to the in- • David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano L. I. Oliveira,and and Teresa Ludermir are with the Centro de Informática, UniversidadeFederal de Pernambuco. Brazil. E-mail: see [email protected] • David Macêdo was with Montreal Institute for Learning Algorithms,University of Montreal. Canada distribution on which a neural network was trained. [6] alsoestablished baseline datasets and metrics for OOD detection.Additionally, they established the baseline performance forthis task by proposing an OOD detection approach that usesthe maximum predicted probability as the score to detectwhether an example belongs to the in-distribution.Despite being a fundamental task, current OOD detectionapproaches are based on ad-hoc techniques that producesevere side effects on the solution. ODIN [7] and Mahalanobis[8] require input preprocessing , which makes inferences slowand increases computational cost and energy consumption.Additionally, these solutions also present hyperparameters totune using unrealistic access to out-of-distribution samples orthe cumbersome process of generating adversarial examples.Adversarial training methods such as ACET [9] usuallyimplicate in longer training times and reduced scalability forlarge-sized images.Another major drawback present in recent OOD detectionapproaches is the classification accuracy drop [10], [11], which isindeed a harmful undesired side effect because classificationis usually the primary aim of the system. In contrast, OODdetection is an auxiliary task [12].In some cases, OOD detection proposals require unde-sired model structural changes [13] or even ensemble meth-ods [14], [15]. Finally, there are solutions based on uncertaintyor confidence estimation/calibration [16], [17], [18], [19], [20].Despite additional complexity, slower inference, and higherenergy/computation required, they may additionally presentOOD detection performance worse than ODIN [11], [21]. a r X i v : . [ c s . L G ] A ug In this paper, we argue that the low OOD performanceof neural networks is mainly due to two factors. First, theSoftMax anisotropy, which does not concentrate high-levelrepresentations in the feature space, making it difficult forOOD detection [9]. Second, the cross-entropy loss propensityto generate extremely overconfident (low entropy) posteriorprobability distributions, which is in direct conflict withthe Principle of Maximum Probability. Throughout thiswork, we further develop those claims with both theoreticalmotivations and experimental results.Hence, we propose IsoMax, a loss that is isotropic(distance-based) and generates inferences with high meanentropy posterior probability distributions in agreement withthe Principle of Maximum Entropy. Our principled approachis accurate, fast, efficient, scalable, and turnkey, besides pro-ducing competitive performance. Additionally, our solutionworks as a seamless SoftMax loss drop-in replacement thatfacilitates its incorporation into current and future projects.Unlike most contemporary approaches, our proposal isviable from an economical and environmental point of view.Furthermore, the detection is a speedy procedure that can beachieved by a straightforward negative entropy calculation.
ACKGROUND
ODIN was proposed in [7] by combining SoftMax inputpreprocessing and temperature calibration . Despite significantlyoutperforming the baseline, the input preprocessing intro-duced in ODIN considerably increases the inference delayby requiring a backpropagation operation and a secondinference to perform the final prediction on a single sample.Considering that backpropagation is typically slower thaninference, input prepossessing makes the ODIN inferences atleast three times slower. Additionally, input preprocessing alsomakes the inference power-consumption at least three timeshigher, which is a severe limitation from an economical andenvironmental perspective [22]. Several subsequent OODdetection proposals incorporated input preprocessing and itsassociated drawbacks [7], [8], [11], [23].
Temperature calibration consists of changing the scale of the logits of a pretrained model .Both input preprocessing and temperature calibration requireshyperparameter tuning.ODIN required unrealistic access to out-of-distributionsamples to validate the hyperparameters. Even if somesupposed OOD samples are indeed available during design,using those examples to validate hyperparameters makesthe solution overfits to detect this particular type of out-distribution [21]. In real-world applications, the system willprobably operate under a different/novel/unknown out-distribution, and the OOD detection performance coulddegrade significantly. Therefore, using design time out-of-distribution samples to validate hyperparameters generatesunrealistic OOD detection performance expectations.The seminal work introduced by [8], which we call theMahalanobis method, overcomes the necessity of accessto out-of-distribution samples by validating the requiredhyperparameters in adversarial examples and producingmore realistic performance estimates. Hence, in this work,we only consider validation on adversarial samples forcompeting methods. As our approach is turnkey, it does not require hyperparameters validation. However, validation using adversarial examples has thedisadvantage of adding a cumbersome procedure to theprocess. Even worse, the generation of adversarial samplesitself requires the definition of hyperparameters as themaximum adversarial perturbations. For research datasets,we may know those, but it may be hard to find them for novelreal-world data. The same drawbacks apply to methodsbased on adversarial training such as ACET [9], which alsoimplicates in slower training. Solutions based on adversarialtraining may also present scalability problems when used inapplications dealing with real-world large size images.Moreover, the Mahalanobis approach still requires input-preprocessing , which brings to this solution the previouslymentioned drawbacks associated with the mentioned tech-nique.
Feature ensembles introduced in Mahalanobis alsopresent limitations. Since it requires training/inference of ad-hoc classification/regression models on features producedin many neural network layers, this approach may notscale to applications using large-sized images, as it shouldrequire using those shallow models in spaces of thousandsof dimensions.
Contributions.
In this paper, we develop an OOD detec-tion approach that avoids all previously mentioned require-ments and side effects. For this work, we follow the “SoftMaxloss” expression as defined in [24]. The first component of oursolution is the Isotropy Maximization (IsoMax) loss, whichworks as a drop-in replacement for SoftMax one, as the swapof SoftMax loss with IsoMax one neither requires model, data,nor training procedure modifications. IsoMax uses distance-based logits to fix the SoftMax loss anisotropy caused by itsaffine transformations. Moreover, we introduced what wecall the Entropic Scale ( E s ), a multiplicative factor applied tothe logits throughout training that is, nevertheless, removedduring inference to achieve high entropy posterior probabilitydistributions. This clever scheme allows us to build highentropy (low confidence) posterior probability distributionscommitted to our prior knowledge as stated by the Principleof Maximum Entropy.The second part of our proposal is the speedy EntropicScore, which is defined as the negative of entropy of theoutput probabilities, used to OOD detection. Furthermore,we provided theoretical motivation to explain why thesolution works based on the Principle of Maximum Entropy.Finally, we present substantial experiments that confirm ourtheoretical assumptions and show that the overall solutionis competitive with approaches that require operating undermore favorable and less restrictive conditions.Indeed, the principled way we construct our approachallowed it to be accurate (no classification accuracy drop ), fast,and to present energy/computational efficiency (no inputpreprocessing , no adversarial training). Additionally, it isalso scalable (no feature ensemble , no adversarial training)and turnkey (no post-processing for hyperparameter vali-dation, no access to out-of-distribution or the generation ofadversarial examples is required). Modern loss enhancementtechniques such as outlier exposure [25], [26] may be readilyadapted to also work with IsoMax loss. (a) (b) (c) (d)(e)Fig. 1. (a) Cross-entropy SoftMax loss simultaneously minimizes both the cross-entropy and the entropy of the posterior probabilities. (b) IsoMaxloss produces low entropy posterior probabilities for low Entropic Scale ( E s =1 ). (c) IsoMax loss produces medium mean entropy for intermediateEntropic Scale ( E s =3 ). (d) In agreement with the Principle of Maximum Entropy, IsoMax loss can minimize the cross-entropy while producing highmean entropies for high Entropic Scale ( E s =10 ). Entropic Scale equals to ten is enough to produce extremely high entropy posterior probabilitydistribution. (e) Higher values of the Entropic Scale correlate to higher mean entropy and increased OOD detection performance regardless of theout-distribution under consideration. Isotropy enables IsoMax loss to produce higher OOD performance than SoftMax one even for unitary value ofthe Entropic Scale. IsoMax loss classification accuracies are similar to SoftMax ones and insensitive to E s . SO M AX L OSS AND E NTROPIC S CORE
Let x represent the input applied to a neural network and f θ ( x ) represent the high-level feature vector produced by it.For this work, the underlying structure of the neural networkdoes not matter. Considering k be the correct class for aparticular training example x , we can write the SoftMax lossassociated with this specific training sample as: L S (ˆ y ( k ) | x ) = − log exp( w (cid:62) k f θ ( x )+ b k ) (cid:80) j exp( w (cid:62) j f θ ( x )+ b j )) (1)In the equation (1), w j and b j represent, respectively,the weights and biases associated with the class j . From ageometric perspective, the term w (cid:62) j f θ ( x )+ b j represents ahyperplane in the high-level feature space. It divides thefeature space into two subspaces that we call the positiveand negative subspaces. The deeper inside the positivesubspace the features f θ ( x ) is located, the more the examplelikely belongs to the considered class. Therefore, to trainneural networks using SoftMax loss does not incentivethe agglomeration of the representations of the examples associated with a particular class into a limited region of thehyperspace. The immediate consequence is the propensity ofSoftMax loss trained neural networks to make high confidentpredictions on examples that stay in regions very far awayfrom the training examples, explaining their low out-of-distribution detection performance [9].The main characteristic of the Mahalanobis distance usedin [8] is to be locally isotropic around the produced proto-types. The fact it achieved high OOD detection performanceindicates that deploying locally isotropic spaces aroundclass prototypes improves the OOD detection performance.SoftMax loss trained neural networks are based on affinetransformations on the last layer, which are essentially inter-nal products. Consequently, the last layer representationsof such networks tend to align in the direction of theweights vector, producing a preferential direction in spaceand anisotropy. Designing a loss that only depends on thedistances of high-level representations to class prototypes is apossible way to avoid the mentioned anisotropy. A distance-based loss forbids the network to learn preferred directionsin the feature space and enforces local isotropy during thenetwork training, avoiding metric learning post-processingor hyperparameters validation.Distance-based losses have been studied in the context of few-shot learning . [27] used metric and transfer learning onpretrained features , while [28] proposed an offline procedure tocalculate prototypes . In both cases, prototypes are not calculatedseamlessly during the network backpropagation training . Addi-tionally, while [27] used Mahalanobis distance, [28] proposed squared Euclidean one.In IsoMax, to build a straightforward procedure to performOOD detection , distance-based logits are incorporated directlyinto the loss used to train the neural network . Therefore, theprototypes are treated as usual weights and learned during theregular backpropagation procedure . We experimentally observedthat using regular non-squared
Euclidean distance performedbest. Therefore, IsoMax loss is constructed with the negativeof the non-squared
Euclidean distance, which is given bythe expression −(cid:107) f θ ( x ) − p j φ (cid:107) , where p j φ represents theseamlessly learnable prototype associated with the class j .The class prototypes have the same dimension of the lastlayer representations. As there is no bias, IsoMax loss hasfewer parameters than SoftMax one. The Principle of Maximum Entropy, formulated by E. T.Jaynes to unify the statistical mechanics and informationtheory entropy concepts [29], [30], states that when esti-mating probability distributions, we should choose the onewhich produces the maximum entropy consistent with thegiven constraints [31]. Following this principle, we avoidintroducing additional assumptions or bias . In other words,from a set of trial probability distributions that satisfactorilydescribes the prior knowledge available, the one that presentsthe maximal information entropy (the least informativeoption) represents the best choice.The Principle of Maximum Entropy has been studied asa regularization factor in classification tasks [32], [33]. In somecases, it has also being used in classification tasks as a directoptimization procedure without connection to the mechanism ofcross-entropy minimization or backpropagation . For example, in[34], [35], [36], the Maximization of the Entropy subject toa constraint on the expected classification error is showedto be equivalent to solving an unconstrained Lagrangian.Despite being theoretically well-grounded [37], [38], [39], [40],direct entropy maximization presents high computationalcomplexity as it is a NP-complete problem [37], [39].Alternatively, modern neural networks are trained us-ing the computational efficient cross-entropy minimization. However, the mentioned procedure does not prioritize posteriorprobability distributions with high entropy. Actually, exactly theopposite is true.
Indeed, the minimization of cross-entropyhas the undesired side effect of producing overconfident low meanentropy posterior probability distributions . Hence, we use thePrinciple of Maximum Entropy as motivation to construct highentropy posterior probabilities still relying on computationallyefficient cross-entropy minimization. Additionally, we presentsubstantial experimental evidence to show that increased posteriorprobability distribution entropy correlates to improved OODdetection performance.
Unlike the previously mentioned works, we are neitherusing the Principle of Maximum Entropy to motivate the https://mtlsites.mit.edu/Courses/6.050/2003/notes construction of regularization mechanisms (such as labelsmoothing or confidence penalty) nor performing direct Maximum Entropy optimization. Indeed, the entropy isnot even calculated during IsoMax loss training. Sinceour approach is not directly maximizing the entropy, wecan not state that the proposed method is producing thehighest available mean entropy for the posterior probabilitydistribution. Nevertheless, the experiments show that ourapproach’s average entropy is high enough to improve theOOD detection performance significantly.
Thus, our approachmay be seen as a computationally efficient procedure to obtainhigh entropy posterior distributions avoiding the extremely highcomputational cost of performing a direct entropy maximization. L SoftMax = − log exp( L k ) (cid:80) j exp( L j ) → ⇒ P ( y | x ) → ⇒ (cid:88) H SoftMax → (2)The equation (2) explains the behavior of the cross-entropy and entropy for the SoftMax loss. L j representsthe logits associated with the class j , and L k represents thelogits associated with the correct class k . When minimizingthe first term of the mentioned equation, extremely highprobabilities are generated. Consequently, very low entropyposterior probability distributions are produced. The usualcross-entropy loss minimization tends to generate unrealis-tic overconfident (low entropy) probabilities distributions.Therefore, we have an opposition between cross-entropy lossminimization and the Principle of Maximum Entropy. L IsoMax = − log exp( E s × L k ) (cid:80) j exp( E s × L j ) → (cid:54) = ⇒ P ( y | x ) → ⇒ (cid:88) H IsoMax → (3)The IsoMax loss straightforwardly conciliates these ap-parently contradictory objectives by multiplying the logits bywhat we call the Entropic Scale E s , which is presented duringtraining but removed for inference . The equation (3) demon-strates how the Entropic Scale allows the production ofhigh entropy posterior distributions relying on cross-entropyminimization. The Entropic Scale present during trainingallows the argument of the exponential functions E s × L j tobecome high enough to produce low loss without producingextremely high probability for the correct classes as theyare calculated with the Entropic Scale removed. Hence, it ispossible to build posterior probability distributions withhigh mean entropy in agreement with the fundamentalPrinciple of Maximum Entropy despite using cross-entropyminimization. Hence, we can define the IsoMax loss as: TABLE 1Accurate, fast, efficient, scalable, and turnkey out-of-distribution detection approaches (neither classification accuracy drop , adversarial training, inputpreprocessing , temperature calibration , feature ensemble, nor ad-hoc post-processing classification/regression). Since there is no hyperparameter totune, no access to out-of-distribution or adversarial examples is required. SoftMax+MPS means training with SoftMax loss and performing OODdetection using the Maximum Probability Score (MPS) [6]. SoftMax+ES means training with SoftMax loss and performing OOD detection usingEntropic Score. IsoMax+ES means training with IsoMax loss and performing OOD detection using Entropic Score. The best results are in bold. To thebest of our knowledge, IsoMax+ES presents state-or-the-art under these assumptions. Model In-Data(training) Out-Data(unseen) Out-of-Distribution Detection:Accurate, Fast, Efficient, Scalable, and TurnkeyTNR@TPR95 (%) AUROC (%) DTACC (%)SoftMax+MPS / SoftMax+ES / IsoMax+ESDenseNet CIFAR10 SVHN 32.2 / 33.2 /
TinyImageNet 55.8 / 59.8 /
LSUN 64.9 / 69.5 /
CIFAR100 SVHN 20.6 / 24.9 /
TinyImageNet 19.4 / 23.7 /
LSUN 18.8 / 24.4 /
SVHN CIFAR10 81.5 / 83.7 /
TinyImageNet 88.2 / 90.0 /
LSUN 86.4 / 88.4 /
ResNet CIFAR10 SVHN 43.1 / 44.5 /
TinyImageNet 46.3 / 48.0 /
LSUN 51.2 / 53.3 /
CIFAR100 SVHN 15.9 / 18.0 /
TinyImageNet 18.5 / 22.4 /
LSUN 18.4 / 22.4 /
SVHN CIFAR10 67.3 / 67.7 /
TinyImageNet 66.9 / 67.3 /
LSUN 62.2 / 62.5 /
TABLE 2
Unfair comparison of approaches with different requirements and side effects . ODIN uses input preprocessing , temperature calibration , andadversarial validation. Mahalanobis uses input preprocessing , feature ensemble, ad-hoc post-processing classification/regression models, andadversarial validation. Input preprocessing makes ODIN and Mahalanobis inference three times slower and three times less energy/computationallyefficient. ACET uses adversarial training, which implicates in slower training and reduced scalability for large-scale images. ODIN, Mahalanobis, andACET present hyperparameters that need to be validated for each dataset. IsoMax+ES neither uses those techniques nor presents hyperparametersto tune for novel datasets. The best results are in bold (2% tolerance).
Model In-Data(training) Out-Data(unseen) ODIN/ACET/Mahalanobis present special requirements.ODIN/Mahalanobis produce undesired side effects.AUROC (%) DTACC (%)ODIN / ACET / IsoMax+ES / MahalanobisDenseNet CIFAR10 SVHN 92.8 / NA / / / TinyImageNet 97.2 / NA / / / LSUN 98.5 / NA / / / CIFAR100 SVHN 88.2 / NA / 88.8 / / TinyImageNet 85.3 / NA / 91.1 /
LSUN 85.7 / NA / 93.1 /
SVHN CIFAR10 91.9 / NA / / TinyImageNet 94.8 / NA / / LSUN 94.1 / NA / / ResNet CIFAR10 SVHN 86.5 / / 93.8 / 95.5 77.8 / NA / / TinyImageNet 93.9 / 85.9 / 95.2 /
LSUN 93.7 / 85.8 / 97.3 /
CIFAR100 SVHN 72.0 / / / 84.4 67.7 / NA / / 76.5TinyImageNet 83.6 / 75.2 / / / LSUN 81.9 / 69.8 / / 82.3 74.6 / NA / / 79.7SVHN CIFAR10 92.1 / 97.3 / / / TinyImageNet 92.9 / 97.7 / 97.1 /
LSUN 90.7 / / 96.6 /
TABLE 3Performance metrics of neural networks trained using SoftMax andIsoMax losses for a combination of in-distributions and models. IsoMaxloss produces very similar train and test classification accuracy whilepresenting much higher OOD detection performance (Table 1).
Test Accuracy (%)Model Data SoftMax Loss IsoMax LossSVHN 96.7 96.7DenseNet CIFAR10 94.9 95.1CIFAR100 75.7 76.2SVHN 96.7 96.6ResNet CIFAR10 95.4 95.4CIFAR100 75.8 75.3 L I (ˆ y ( k ) | x ) = − log exp( − E s (cid:107) f θ ( x ) − p k φ (cid:107) ) (cid:80) j exp( − E s (cid:107) f θ ( x ) − p j φ (cid:107) ) = − log exp( − E s (cid:113) ( f θ ( x ) − p k φ ) · ( f θ ( x ) − p k φ )) (cid:80) j exp( − E s (cid:113) ( f θ ( x ) − p j φ ) · ( f θ ( x ) − p j φ )) (4)In the previous equation, k is the correct class. Experimen-tally, we observed that using Xavier [41] or Kaiming [42] ini-tialization for prototypes made OOD detection performanceoscillates. Sometimes it improves, sometimes it decreases.Additionally, we experimentally observed classification accu-racy drop when using E s with affine transformations usedin SoftMax loss. Hence, we decided always to initialize allprototypes to zero and indeed use non-squared Euclideandistance-based logits.To calculate the cross-entropy loss, deep learning librariesusually combine the logarithm and probability calculationsinto a single computation.
However, we experimentally observedthat sequentially computing these calculations as stand-aloneoperations significantly improves IsoMax performance . Sinceprototypes are regular learnable network weights, the weightdecay was applied to them. Finally, we can defined theinference probabilities as: p I ( y ( i ) | x ) = exp( −(cid:107) f θ ( x ) − p i φ (cid:107) ) (cid:80) j exp( −(cid:107) f θ ( x ) − p j φ (cid:107) )= exp( − (cid:113) ( f θ ( x ) − p k φ ) · ( f θ ( x ) − p k φ )) (cid:80) j exp( − (cid:113) ( f θ ( x ) − p j φ ) · ( f θ ( x ) − p j φ )) (5) Out-of-distribution detection approaches typically definea score to be used during inference to evaluate whetheran example should be considered out-of-distribution. In aseminal work, [43] demonstrated that the entropy presentsthe optimum measure of the randomness of a source ofsymbols. More broadly, we currently understand entropyas a measure of the uncertainty we have about a randomvariable. Therefore, considering the uncertainty in classifying a specific sample should be an optimum metric to evaluatewhether a particular example is out-of-distribution, we defineour score to perform OOD detection, called Entropic Score,as the negative entropy of the output probabilities: ES = − N (cid:88) i =1 p ( y ( i ) | x ) log p ( y ( i ) | x ) (6)By using the negative entropy as a score to evaluateif a particular sample is out-of-distribution, we considerthe information provided by all available outputs ratherthan relying on a single network output, for example, themaximum probability (as in baseline, ODIN, and ACET)or distance to the nearest prototype (as in Mahalanobis).Additionally, from a practical perspective, using this a prioriscore avoids the need to train an ad-hoc additional regres-sion model to detect out-of-distributions samples, which isrequired, for example, in Mahalanobis. Even more important,since no regression model needs to be trained, there is noneed for unrealistic access to out-of-distribution samples orgenerating adversarial to hyperparameters validation. SinceES is a predefined no-trainable score, it is available as soonas the neural network training finishes. XPERIMENTAL R ESULTS
The source code to reproduce all the results is available assupplementary material. Considering that outlier exposuremay be integrated and benefit both SoftMax and IsoMaxlosses, all experiments were performed without relying onoutlier data [25], [26]. Similar arguments hold for backgroundsamples based approaches [44]. All datasets, models, trainingprocedures, and metrics followed the baseline established in[6] and subsequently used in major OOD detection papers[7], [8], [9]. In this paper, only approaches that do not present classification accuracy drop were compared.In our experiments, we trained from scratch 100 layersDenseNets [45] and 34 layers ResNets [46] on CIFAR10 [47],CIFAR100 [47] and SVHN [48] datasets using SoftMax andIsoMax losses using the same protocol presented in [8] (300epochs, initial learning rate of 0.1 with learning rate decayrate equals ten in the epochs 150, 200, and 250, and a weightdecay of 0.0001).We used resized images from the datasets TinyImageNet[49] , and the Large-scale Scene UNderstanding dataset(LSUN) [50] following the same protocol used in [8] tocreate out-distribution data.To evaluate the OOD detection performance, we addedthese out-of-distribution images to the validation imagespresented in CIFAR10, CIFAR100, and SVHN datasets toform the composed test set. The performance was evaluatedusing three detection metrics. First, we calculate the TrueNegative Rate at 95% True Positive Rate (TNR@TPR95).Besides, we evaluated the Area Under the Receiver OperatingCharacteristic Curve (AUROC) and the Detection Accuracy(DTACC), which corresponds to the maximum classificationprobability over all possible thresholds δ : − min δ (cid:8) P in ( o ( x ) ≤ δ ) P ( x is from P in )+ P out ( o ( x ) > δ ) P ( x is from P out ) (cid:9) , (7) https://github.com/facebookresearch/odin where o ( x ) is a given OOD detection score. It is assumedthat both positive and negative samples have equal prob-ability of being in the test set, i.e., P ( x is from P in ) = P ( x is from P out ) . All the above metrics follow the calcula-tion detailed in [8].To define the global hyperparameter E s , we trainedDenseNets on SVHN using E s equals to 1, 3, and 10. Wevalidated these possible values using the TNR@TPR95 metricand CIFAR100 as out-distribution (Fig. 1). It is importantto emphasize that CIFAR100 was never used as an out-distribution in the subsequent experiments.The Fig. 1(a) shows that the SoftMax loss minimizesboth the cross-entropy and the entropy of the posteriordistribution. The Fig. 1(b) shows that IsoMax loss is capableof minimizing the cross-entropy keeping high average pos-terior probability entropy as recommend by the Principle ofMaximum Entropy. The Fig. 1(c) shows that OOD detectionperformance increases for higher Entropic Scales.It was possible to define the Entropic Scale E s as a global hyperparameter because the experiments showed that oncevalidated in a simple metric, model, in-distribution, andout-distribution; the global value defined to it generalizedwell to all other metrics, models, in-distributions, and out-distributions (Fig. 1(c)). Considering that E s = 10 alreadyproduces very high entropies probability distributions, wesee no reason to increase it even more. Therefore, allexperiments in this paper used E s = 10 and no validationwas performed for each new dataset, making our proposalturnkey. The value of E s = 10 presented the best OODdetection performance. The mentioned value generalizes wellto unseen out-distributions as required for a satisfactory global hyperparameter candidate. Consequently, this same valuewas used for all other experiments (combinations of models,in-distributions, and out-distributions in Tables 1 and 2).Once confirmed E s = 10 as a adequate global hyperparameter,the experiments showed that IsoMax loss trained networkspresent classification accuracy performance extremely similarto SoftMax ones to all other datasets and models (Table 3). In Table 1, SoftMax+MPS presents the worst results. TheEntropic Score produces a small positive effect when appliedto SoftMax loss trained networks. However, the combinationof IsoMax loss with the same Entropic Score significantlyimproves the OOD detection performance across almost allmetrics, in-distribution, and out-distribution.
Table 2 shows an unfair comparison of approaches thatpresent different requirements and side effects.
Input pre-possessing (and consequently slower and higher power con-sumption inferences) and validation on adversarial samplesare used in both ODIN and Mahalanobis, while temperaturecalibration is required only in ODIN.
Feature ensemble and ad-hoc classification/regression models, which may implicate inlimited scalability, are mandatory in Mahalanobis. ACETrequires adversarial training, which may restrict its useto small scale images. ODIN, Mahalanobis, and ACEThave hyperparameters tuned for each in-distribution. The IsoMax+ES presents neither the mentioned drawbacks norside effects.Regardless of the previous considerations, the table showsthat IsoMax+ES significantly outperforms ODIN in all evalu-ated scenarios. Additionally, IsoMax+ES usually outperformsACET (sometimes by a large margin). Moreover, in more thanhalf of the cases, even operating under much more favorableconditions, Mahalanobis surpasses IsoMax+ES by less than2%. In some scenarios, the latter even overcomes the formerdespite avoiding hyperparameter validation, being native,scalable, straightforward to implement/use, and presentingat least there times faster and power-efficient inference.IsoMax+ES loss performs particularly well in one of theCIFAR100 cases, which may suggest that the fact ES usesall outputs to decide works even better when many classesare presented. We speculate that recent advances in dataaugmentation techniques may help to improve IsoMax+ESOOD detection performance even further [51], [52].
ONCLUSION
In this paper, we proposed the IsoMax loss and the En-tropic Score to show that neural networks OOD detectionperformance can be significantly improved in an accurate,fast, efficient, scalable, and turnkey way simply by replacingthe SoftMax loss and using an predefined, meaningful, andinformation-theoretic well-founded score without relyingon ad-hoc techniques to avoid their associated drawbacks,requirements and side effects. However, if the above-mentioned limitations are not a concern for a particularapplication, those techniques may be combined with IsoMaxloss to achieve even higher OOD detection performance.Another promising approach could be the use of recentspecialized data augmentation techniques. In future works,we intend to make E s a learnable parameter. R EFERENCES [1] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibrationof modern neural networks,”
International Conference on MachineLearning , 2017.[2] W. J. Scheirer, A. Rocha, A. Sapkota, and T. E. Boult, “Towards openset recognition,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , 2013.[3] W. J. Scheirer, L. P. Jain, and T. E. Boult, “Probability models foropen set recognition,”
IEEE Transactions on Pattern Analysis andMachine Intelligence , 2014.[4] A. Bendale and T. Boult, “Towards open world recognition,”
Computer Vision and Pattern Recognition , 2015.[5] E. Rudd, L. P. Jain, W. J. Scheirer, and T. Boult, “The extremevalue machine,”
IEEE Transactions on Pattern Analysis and MachineIntelligence , 2018.[6] D. Hendrycks and K. Gimpel, “A baseline for detecting mis-classified and out-of-distribution examples in neural networks,”
International Conference on Learning Representations , 2017.[7] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,”
InternationalConference on Learning Representations , 2018.[8] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified frameworkfor detecting out-of-distribution samples and adversarial attacks,”
Neural Information Processing Systems , 2018.[9] M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why ReLUnetworks yield high-confidence predictions far away from thetraining data and how to mitigate the problem,”
Computer Visionand Pattern Recognition , 2018.[10] E. Techapanurak, M. Suganuma, and T. Okatani, “Hyperparameter-free out-of-distribution detection using softmax of scaled cosinesimilarity,” arXiv preprint arXiv:1905.10628 , 2019. [11] Y.-C. Hsu, Y. Shen, H. Jin, and Z. Kira, “Generalized ODIN:Detecting out-of-distribution image without learning from out-of-distribution data,” arXiv preprint arXiv:2002.11297 , 2020.[12] N. Carlini, A. Athalye, N. Papernot, W. Brendel, J. Rauber,D. Tsipras, I. Goodfellow, A. Madry, and A. Kurakin, “On evaluat-ing adversarial robustness,” arXiv preprint arXiv:1902.06705 , 2019.[13] Q. Yu and K. Aizawa, “Unsupervised out-of-distribution detectionby maximum classifier discrepancy,”
International Conference onComputer Vision , 2019.[14] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L.Willke, “Out-of-distribution detection using an ensemble of selfsupervised leave-out classifiers,”
European Conference on ComputerVision , 2018.[15] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple andscalable predictive uncertainty estimation using deep ensembles,”
Neural Information Processing Systems , 2017.[16] A. Kendall and Y. Gal, “What uncertainties do we need in bayesiandeep learning for computer vision?”
Neural Information ProcessingSystems , 2017.[17] C. Leibig, V. Allken, M. S. Ayhan, P. Berens, and S. Wahl, “Leverag-ing uncertainty information from deep neural networks for diseasedetection,”
Scientific Reports , 2017.[18] A. Malinin and M. Gales, “Predictive uncertainty estimation viaprior networks,”
Neural Information Processing Systems , 2018.[19] V. Kuleshov, N. Fenner, and S. Ermon, “Accurate uncertaintiesfor deep learning using calibrated regression,” arXiv preprintarXiv:1807.00263 , 2018.[20] A. Subramanya, S. Srinivas, and R. V. Babu, “Confidence estimationin deep neural networks via density modelling,” arXiv preprintarXiv:1707.07013 , 2017.[21] A. Shafaei, M. Schmidt, and J. J. Little, “A less biased evaluationof out-of-distribution sample detectors,”
British Machine VisionConference , 2019.[22] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ArtificialIntelligence,” arXiv preprint arXiv:1907.10597 , 2019.[23] T. DeVries and G. W. Taylor, “Learning confidence for out-of-distribution detection in neural networks,” arXiv preprintarXiv:1802.04865 , 2018.[24] W. Liu, Y. Wen, Z. Yu, and M. Yang, “Large-margin softmax loss forconvolutional neural networks.”
International Conference on MachineLearning , 2016.[25] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomalydetection with outlier exposure,”
International Conference on LearningRepresentations , 2019.[26] A.-A. Papadopoulos, M. R. Rajati, N. Shaikh, and J. Wang, “Outlierexposure with confidence control for out-of-distribution detection,” arXiv preprint arXiv:1906.03509 , 2019.[27] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka, “Distance-based image classification: Generalizing to new classes at near-zerocost,”
IEEE Transactions on Pattern Analysis and Machine Intelligence ,2013.[28] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks forfew-shot learning,”
Neural Information Processing Systems , 2017.[29] E. T. Jaynes, “Information Theory and Statistical Mechanics,”
Phys.Rev. , 1957.[30] ——, “Information Theory and Statistical Mechanics. II,”
Phys. Rev. ,1957.[31] T. M. Cover and J. A. Thomas, “Elements of Information Theory,”
Wiley Series in Telecommunications and Signal Processing , 2006.[32] A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum-entropyfine grained classification,”
Neural Information Processing Systems ,2018.[33] G. Pereyra, G. Tucker, J. Chorowski, Łukasz Kaiser, and G. Hinton,“Regularizing neural networks by penalizing confident outputdistributions,” arXiv preprint arXiv:1908.05569 , 2017.[34] D. Miller, A. Rao, K. Rose, and A. Gersho, “A maximum entropyapproach for optimal statistical classification,”
IEEE Workshop onNeural Networks for Signal Processing , 1995.[35] A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra, “A maximumentropy approach to natural language processing,”
ComputationalLinguistics , 1996.[36] J. Shawe-Taylor and D. Hardoon, “Pac-bayes analysis of maxi-mum entropy classification,”
International Conference on ArtificialIntelligence and Statistics , 2009.[37] J. Pearl, “Probabilistic reasoning in intelligent systems: Networksof plausible inference,”
Morgan Kaufmann Publishers Inc. , 1988.[38] J. Williamson, “Objective bayesian nets,”
We Will Show Them! , 2005. [39] ——, “Philosophies of probability,”
Philosophy of Mathematics:Handbook of the Philosophy of Science , 2009.[40] ——, “In defence of objective bayesianism,”
Oxford University Press ,2013.[41] X. Glorot and Y. Bengio, “Understanding the difficulty of trainingdeep feedforward neural networks,”
International Artificial Intelli-gence and Statistics , 2010.[42] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:Surpassing human-level performance on imagenet classification,”
International Conference on Computer Vision , 2016.[43] C. E. Shannon, “A Mathematical Theory of Communication,”
BellSystem Technical Journal , 1948.[44] A. R. Dhamija, M. Günther, and T. Boult, “Reducing networkagnostophobia,”
Neural Information Processing Systems , 2018.[45] G. Huang, Z. Liu, L. v. d. Maaten, and K. Q. Weinberger, “Denselyconnected convolutional networks,”
Computer Vision and PatternRecognition , 2017.[46] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deepresidual networks,”
Lecture Notes in Computer Science , 2016.[47] A. Krizhevsky, “Learning multiple layers of features from tinyimages,”
Science Department, University of Toronto , 2009.[48] Y. Netzer and T. Wang, “Reading digits in natural images with un-supervised feature learning,”
Neural Information Processing Systems ,2011.[49] J. D. J. Deng, W. D. W. Dong, R. Socher, L.-J. L. L.-J. Li, K. L. K. Li,and L. F.-F. L. Fei-Fei, “ImageNet: A large-scale hierarchical imagedatabase,”
Computer Vision and Pattern Recognition , 2009.[50] F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, “LSUN: Constructionof a large-scale image dataset using deep learning with humans inthe loop,” arXiv preprint arXiv:1506.03365 , 2015.[51] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, andS. Michalak, “On mixup training: Improved calibration and pre-dictive uncertainty for deep neural networks,”
Neural InformationProcessing Systems , 2019.[52] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix:Regularization strategy to train strong classifiers with localizablefeatures,”