[PDF] Natively Interpretable Machine Learning and Artificial Intelligence: Preliminary Results and Future Directions

Abstract

Machine learning models have become more and more complex in order to better approximate complex functions. Although fruitful in many domains, the added complexity has come at the cost of model interpretability. The once popular k-nearest neighbors (kNN) approach, which finds and uses the most similar data for reasoning, has received much less attention in recent decades due to numerous problems when compared to other techniques. We show that many of these historical problems with kNN can be overcome, and our contribution has applications not only in machine learning but also in online learning, data synthesis, anomaly detection, model compression, and reinforcement learning, without sacrificing interpretability. We introduce a synthesis between kNN and information theory that we hope will provide a clear path towards models that are innately interpretable and auditable. Through this work we hope to gather interest in combining kNN with information theory as a promising path to fully auditable machine learning and artificial intelligence.

Full PDF

aa r X i v : . [ c s . L G ] J a n Natively Interpretable Machine Learning and Artiﬁcial Intelligence:Preliminary Results and Future Directions

Christopher J. Hazard ∗ , Christopher Fusting ∗ , Michael Resnick ∗ , Michael Auerbach ∗ ,Michael Meehan ∗ , Valeri Korobov ∗ January 21, 2019

Machine learning models have become more and morecomplex in order to better approximate complexfunctions. Although fruitful in many domains, theadded complexity has come at the cost of model in-terpretability. The once popular k-nearest neighbors(kNN) approach, which ﬁnds and uses the most sim-ilar data for reasoning, has received much less at-tention in recent decades due to numerous problemswhen compared to other techniques. We show thatmany of these historical problems with kNN can beovercome, and our contribution has applications notonly in machine learning but also in online learning,data synthesis, anomaly detection, model compres-sion, and reinforcement learning, without sacriﬁcinginterpretability. We introduce a synthesis betweenkNN and information theory that we hope will pro-vide a clear path towards models that are innatelyinterpretable and auditable. Through this work wehope to gather interest in combining kNN with infor-mation theory as a promising path to fully auditablemachine learning and artiﬁcial intelligence.

As machine learning has matured the need to un-derstand, interpret and explain models has becomeincreasingly important [Alpaydin, 2016, Mohri et al.,2012, Goodfellow et al., 2016]. Machine learningmodels are interpreted in a variety of ways includ-ing exploring the internals of a model [Skapura, 1996,Poerner et al., 2018], creating ex post rationaliza-tions [Ribeiro et al., 2016, Google LLC, 2018] or us-ing models that are interpretable from the begin- ∗ Diveplane Corporation. If you are interested in using ourtechnology, please contact [email protected] . The authorswould like to thank the investors, employees, and supportersof Diveplane Corporation for making this work possible. ning and maximize their accuracy [Wang and Rudin,2015]. There is a perception and some supporting ev-idence that there exists a trade oﬀ between accuracyand interpretability [Cano et al., 2006].The motivating philosophy behind our work is thatmodels should be innately interpretable. Speciﬁcally,our motivations are: • decisions should be directly traceable to thetraining data that caused the decision to bemade; • the regions of the model should be easily char-acterized directly from the training data; and • assumptions are minimal.To achieve the aforementioned goals we combine k-Nearest Neighbors (kNN) with the principle of max-imum entropy to create models that are easy to un-derstand, make minimal assumptions, and are non-parametric.k-Nearest Neighbors is one of the oldest, simplest,and most accurate algorithms for pattern classiﬁca-tion and regression models [Hastie et al., 2001]. Itis a simple technique that is easily implementable[Alpaydin, 1997]. The accuracy of kNN-based clas-siﬁcation, prediction, and recommendation dependssolely on a data model. Outputs from the modelare usually traceable back to the exact data that in-ﬂuenced each decision. This traceability enables de-tailed analysis of the decision inputs and characteri-zation of the data local to the decision.k-Nearest Neighbors was previously a dominantmachine learning technology [Coomans and Massart,1982, Breiman et al., 1984, Altman, 1992, Alpaydin,1997] but was largely abandoned with the growingsize of data and the computational complexity ofﬁnding the nearest k points [Raikwal and Saxena,2012, Schuh et al., 2014, Hmeidi et al., 2008]. Manyoptimizations have been proposed over the years,they generally seek to reduce the number of dis-tances actually computed [Pedregosa et al., 2011].The optimizations include linear scan, Kd-trees, ball-Copyright 2018-2019 Diveplane Corporation. 1rees, etc. [Pedregosa et al., 2011]. The curse of di-mensionality has also been known to adversely af-fect kNN [Hastie et al., 2001, Indyk and Motwani,1998, Schuh et al., 2013, Tao et al., 2009] and theselection of a distance function can be challenging[V. B. Surya et al., 2017]. Additionally features mayhave to be scaled or standardized to prevent distancemeasures from being dominated by one of the fea-tures. The accuracy of kNN can be severely degradedby the presence of noisy or irrelevant features, or ifthe features scales are not consistent with their rel-evance. Finally, kNN requires a value of parameterk. If k is too small, the model may have low biasbut be sensitive to noisy points and have too highof variance. If k is too large, the neighborhood mayinclude points from other classes and may have toolittle variance.Our contributions in this paper are several. First,we bring numerous well-studied techniques togetherto improve the eﬃcacy of kNN. Second, we connectkNN with information theory and describe numer-ous ways this can be applied to machine learning.Third, we illustrate how the ﬁrst two contributionsopen the door to interpretable reinforcement learn-ing. Throughout this paper we discuss targetlessmodels (models in which we are interested in predict-ing all the features using all the others), their utilityin understanding the data, and introduce an imputa-tion method that naturally arises from such models.The purpose of this paper is to oﬀer a glimpse as towhat the combination of kNN and information theorycan oﬀer in advancing the state of the art of machinelearning and artiﬁcial intelligence. We introduce term targetless learning to describe ourapproach to kNN. Instead of the traditional approachof building a model that learns the mapping froma set of input features variables to a set of targetvariables, or building multiple models to learn mul-tiple mappings, our models consist of the relevanttraining data stored in a data structure that can bequickly queried. We may wish to predict and char-acterize any set of variables given any other set ofvariables. This ﬂexibility is generally not emphasizedin the related literature outside of a subset of work onsemi-supervised learning and imputation [Tan et al.,2018, Zhao and Guo, 2015], and so we deﬁne twomore terms to help us describe inputs and outputsin targetless learning.

Context features are the fea-ture variables being used as inputs for a particularquery.

Action features are the feature variables that are being labeled, actioned upon, or predicted; in tra-ditional targeted machine learning, these are usuallyreferred to as target variables, labels, or responseswith regard to a targeted machine learning model.These terms further reﬂect the origin and potentialfor online learning applications of this approach.

When determining the value of an unknown actionfeature, the action features of the k most similarpoints are averaged or their values voted upon to de-termine the most likely value. In general similarityis determined by a distance metric. Unfortunately asthe number of dimensions increases it becomes dif-ﬁcult to distinguish points [Hinneburg et al., 2000,Beyer et al., 1999]. One proposed solution to thisproblem is to use the number of shared nearestneighbors as a similarity measure [Houle et al., 2010].There is also evidence that fractional norms headingtowards zero enable points to be distinguished moreeasily in high dimensional space [Aggarwal et al.,2001]. Fractional norms are represented as || x || p as || x || p = X i ∈ Ξ w i x pi ! /p , (1)where p is the parameter for the Lebesgue space, Ξ isthe feature set, and w i is the weight for each feature,often where w i = n .Motivated by this result we derived the Minkowskidistance as p → p → d p ( x, y ) = Y i ∈ Ξ | x i − y i | ! | Ξ | . (2)When feature weights w i satisfy P i ∈ Ξ w i = 1 we havelim p → d p ( x, y ) = Y i ∈ Ξ | x i − y i | w i . (3)Note equations 2 and 3 are geometric means andhave the useful property of being scale invariant.Scale invariance means that scaling a feature by anyfactor will not aﬀect the ordering of proximity, as theresult is the same as multiplying all of the distancesby a constant. Thus using p = 0 enables the data tobe stored in its original form, not scaled, standard-ized, or normalized, which improves the transparencyof the model and removes the need for that aspect offeature engineering. As lim p →

0, the scale of fea-tures matters less, meaning that the Minkowski dis-tance is approximately scale invariant with regard to p values that are relatively close to 0.Copyright 2018-2019 Diveplane Corporation. 2 .1.1 A Probabilistic Approach We are unaware of any prior work that has investi-gated using p = 0 as a distance function. This isunsurprising, as p = 0 causes signiﬁcant problemsfor any data set that has categorical data or otherdata that may be exactly equal. Consider a data setthat has two features. If two points have equal val-ues for the ﬁrst feature, then the distance betweenthe points will be zero regardless of the distance be-tween the other values of the other feature, due tothe multiplication by zero.Instead, consider distance probabilistically, in thesense that each feature “distance” is the probabilitythat the two values are diﬀerent given the observa-tions or measurements of the feature values (insteadof being the absolute value of their diﬀerence) as away of handling uncertainty [Agarwal et al., 2016]. Ifwe assume that each observation is independent, thensimply multiplying the probabilities that each featurevalue is the same yields the same distance measureas Equation 2 when determining the conjunction ofthe probabilities. Solving the problems that comefrom exact matches using p = 0 means we can workdirectly with probabilities or deal with features onwildly diﬀerent scales without having to standardizeor otherwise scale them. Using the geometric meanto combine measurements of achieving diﬀerent goalshas been shown to be an eﬀective objective functionfor multicriteria optimization [Harrington, 1965], andso using it to provide contrast between diﬀerent sim-ilarities is a natural use. Suppose we have made two observations of a value, x and y respectively, and we would like to know the“distance” between them. The obvious distance of | x − y | yields the maximum likelihood value of thedistance, but does not yield the expected value of thedistance. Consider that there is considerable devia-tion among observations of the same value, meaningthat there is likely to be a relatively large expecteddiﬀerence between observations of the same value.We use the term deviation to encompass both theerror, which pertains the diﬀerence between actualand measurement, and the residual, which pertains Using p = 0 is in not a metric, and arguably not a distancefunction, as it fails the triangle inequality. Further, it is nottechnically p = 0 but rather lim p → , but we use this abuse ofnotation for simplicity. We note that this realization was inspired by early blogpost drafts of work done by Leinster and Meckes [2016] in thatthe generalized diversity index, which can be parameterizedto measure the Shannon entropy, is nothing more than thereciprocal of the generalized mean when substituting p − p when dealing with probabilities [Tuomisto, 2010], andthe Minkowski distance is just the generalized mean of thediﬀerences. to the diﬀerence between actual and estimated. Thisgeneric use of deviation applies regardless of whetherthe observation is measured, predicted, or inferred,and regardless of whether the deviation is due to ran-domness or lack of additional information that wouldreduce the deviation.Consider two observations, x and y , with consid-erable deviation. If x = 100 ±

10 and y = 100 ± x and y islikely to be greater than 0 even though the expectedvalues is the same, yet a simple subtraction yields 0.Further, consider that x and y are feature vectors oflength two of x = { . , } and y = { . , } . Ifwe have a third observation that is z = { . , . } ,using p = 0 for measuring the similarity between z and both x and y will yield x as inﬁnitely closer than y because the diﬀerence between the ﬁrst terms iszero and the multiplication makes the distance zero.Though this may sometimes be desirable, larger de-viations for the ﬁrst feature and smaller deviationsfor the second feature should yield y as more similarto z than x .To solve the problem of zero expected distance foridentical measurements despite deviation, and to ad-dress the high sensitivity of L p with close or exactmatches with a low p value, we employ the Lukaszyk-Karmowski metric [ Lukaszyk, 2003, 2004]. Given aprobability density function of x , f , and a probabil-ity density function of y , g , the expected diﬀerencebetween them becomes d ( x, y ) = Z ∞−∞ Z ∞−∞ | x − y | f ( x ) g ( y ) dx dy. (4)We assume that if both points are near enough tobe worth determining the distance between them,then the distributions and parameters for the prob-ability density functions should represent the localdata. The two simple maximum entropy distribu-tions on ( −∞ , ∞ ) given a point and a distance aroundthe point are the Laplace distribution (double ex-ponential) and Gaussian distribution, depending onwhether the distance is represented as mean absoluteerror or standard deviation respectively. The Gaus-sian or normal distribution has a clean closed formsolution. Letting µ xy ≡ | x − y | , the expected dis-tance for two normal identical distributions becomes d NN ( x, y ) = µ xy + 2 σ √ π exp − µ xy σ ! − µ xy erfc (cid:16) µ xy σ (cid:17) . (5)For the previous example of x = 100 ±

10 and y =100 ±

10, the expected distance is approximately 11 . r , for predicting each feature, i ,as r i . We have found that using the residuals in thekNN system with the Lukaszyk-Karmowski metric,calculating new residuals, and then feeding these backin, generally yields convergence of the residual valueswith notable convergence after only 3 or 4 iterations.Measuring a distance value for each feature furtherenables parameterization regarding the type of data afeature holds. For example, nominal data can resultin a distance of 1 if the values are not equal and 0 ifthey are equal. Thus, one-hot encoding, the expan-sion of nominal values into multiple features, is notneeded. Ordinal data can use a distance of 1 betweeneach ordinal type. Cyclical data can perform appro-priate subtractions while keeping the data on a singledimension, which keeps the feature in one dimensionand directly understandable rather than having tosplit the feature into two using trigonometry. We now quantify the amount of information in a kNNmodel. Because our formulation of kNN uses a sim-ilarity measure based on distance, we ﬁrst quantifyeach point x by the amount of distance it contributesto the k nearest points. In general, we deﬁne convic-tion as a normalized measure of how much surprisal one would expect for a given situation relative to thesurprisal observed. If we have some form of priordistribution of data given all of the information ob-served up to that point, the surprisal is the amountof information gained when we observe a new sample,event, case, or state change and update the prior dis-tribution to form a new posterior distribution afterthe event. The surprisal of an event of observing a random variable x ∼ X is deﬁned as I ( x ) = − ln p ( x ).Thus, the conviction, π , can be expressed as π ( x ) = E [ I ( X )] I ( x ) . (6)This ratio results in conviction values π ∈ [0 , ∞ ),where • π = 0 means this point has an inﬁnite amountof surprisal, that is, the point was previouslythought to be impossible to exist within thedataset; • π = 1 means this point has an average amountof surprisal, that is, it adds an average amountof information to the model; and • π = ∞ means this point is not at all surprising,that is, it is so redundant that the point couldbe discarded without aﬀecting the model at all.This ratio is indicative of how much informationis required to encode one aspect of the model rel-ative to another, whether dealing with cases or fea-tures. In some cases conviction can act as a proxy forseveral matters, such as how conﬁdent we are aboutour data, whether the data is correct or anomalous,whether the data belongs together, or whether thedata is useful in making predictions. In other cases,conviction can inform how the model will be harmedif data is removed, and additionally can be used tocontrol the surprisal when performing data synthesis.Conviction can be computed in a targeted or tar-getless manner. In a targeted manner, each feature orcase is compared against another set of target casesor features one on one. In an untargeted manner,each case or feature is held out one by one and com-pared against the rest of the data in the model. Whenholding one out, the change in probability impact onother elements of the model indicates a measure ofhubness or centrality of the data which, when iso-lated, has been found to be of signiﬁcant importancefor determining the inﬂuence of data on the model[Tomaˇsev and Mladeni´c, 2014].If the probability space over which conviction isnormalized is broadened (or if surprisal is used with-out normalization), then even the model impact ofcombinations of features can be compared to that ofcombinations of observations.In the following sub-sections, we discuss diﬀer-ent forms of conviction that can be derived. Moreforms of surprisal and conviction can be conditionedand computed, opening up a rich area for diﬀerentkinds of informativeness about various aspects of themodel.We note that conviction is related to, but not ex-actly the same as, feature importance or case inﬂu-Copyright 2018-2019 Diveplane Corporation. 4nce and so care must be taken when comparing thetwo. We deﬁne prediction conviction is the amount of sur-prisal required to predict a value given a model giventhe model’s uncertainty. To characterize the model’suncertainty, we use residuals.

Deﬁnition 1

Let ξ be the number of features in amodel and n the number of observations. We deﬁnethe residual function , r : X → R ξ , on the trainingdata X as r ( x ) = J Ω1 ( x ) , J Ω2 ( x ) , . . . , J Ω ξ ( x ) , (7) where J Ω i is the residual of the model on feature i atpoint x , parameterized by a set of hyperparameters Ω .We will refer to the residual function evaluated on allof the model data as r M . Typically, the feature residuals will be calculatedas mean absolute error or standard deviation. Fur-ther, subsets of features may be used to compute theresidual, particularly when performing targeted op-erations.

Deﬁnition 2

Given a point x ∈ X and the set K of its k nearest neighbors, a distance function d : R z × Z → R , and a distance exponent α , the distancecontribution of x is the harmonic mean φ ( x ) = | K | X k ∈ K d ( x, k ) α ! − . (8)The distance contribution reﬂects how much “dis-tance” a point contributes to a graph connecting thenearest neighbors, which is the inverse of the densityof points over a unit of distance in the Lebesgue space.The harmonic mean of the distance contribution re-ﬂects the inverse of the inverse distance weightingoften employed with kNN, though other techniquesmay be substituted if inverse distance weighting isnot employed.We can quantify the information needed to expressa distance contribution φ ( x ) by transforming it intoa probability. We begin by selecting the exponentialdistribution to describe the distribution of residualsas it is the maximum entropy distribution constrained It is possible to compare the results of estimating valuesfor a feature, and then compute a conviction ratio comparingthe mutual information among a set of estimated values as aninformation theoretic representation of mean decrease in accu-racy or Shapley value. This would be an information theoreticresult that is much closer to feature importance. by the ﬁrst moment. We represent this in typicalnomenclature for the exponential distribution usingthe norm from Equation 1 as1 λ = || r ( x ) || p . (9)We can directly compare the distance contributionand p-normed magnitude of the residual. This is be-cause the distance contribution and the norm of theresidual are both on the same scale, with the distancecontribution being the expected distance of new in-formation that the point adds to the model, and thenorm of the residual is the expected distance of devi-ation. Given the entropy maximizing assumption ofthe exponential distribution of the distances, we canthen determine the probability that a distance con-tribution is greater than or equal to the magnitude ofthe residual || r ( x ) || p in the form of cumulative resid-ual entropy [Rao et al., 2004] as P ( φ ( x ) ≥ || r ( x ) || p ) = e − || r ( x ) || p · φ ( x ) . (10)We then convert the probability to self-informationas I ( x ) = − ln P ( φ ( x ) ≥ || r ( x ) || p ) , (11)which simpliﬁes to I ( x ) = φ ( x ) || r ( x ) || p . (12)As the distance contribution decreases, or as theresidual vector magnitude increases, less informationis needed to represent this point. We can then com-pare this to the expected value in regular convictionform, yielding a prediction conviction of π p = E II ( x ) , (13)where I is the self-information calculated for eachpoint in the model. Feature prediction contribution is motivated by meandecrease in accuracy (MDA) [Archer and Kimes,2008]. In MDA, scores are established for models with Other distributions may be selected by adjusting the as-sumptions slightly, such as the log-normal distribution. Thelog-normal distribution is the maximum entropy distributionassuming that we know the standard deviation rather thanthe mean, but this distribution assumes that something closeris less likely, and may be better suited for familiarity convic-tion. Further, the exact distribution of the distance contri-bution may be solved if the distributions of the features areknown.

Copyright 2018-2019 Diveplane Corporation. 5ll the features, M , and models with each feature heldout, M − f i , i = 1 . . . ξ . The diﬀerence | M − M f i | is theimportance of each feature, where the result’s sign isaltered depending on whether the goal is to maximizeor minimize score. Feature prediction contributiondiﬀers from MDA in that feature prediction contri-bution measures the conditional entropy of addinga feature. This means using prediction conviction onfeatures with signiﬁcant information may yield highercontribution values even if the feature is independent.Prediction contribution information, π c , is corre-lated with accuracy and thus can be used as a surro-gate. The expected self-information required to ex-press a feature is given by E I ( M ) = 1 ξ ξ X i =0 I ( x i ) , and the expected self-information to express a featurewithout feature i is E I ( M − i ) = 1 ξ ξ X j =0 I − i ( x j ) . From these equations, we can more formally deﬁneprediction contribution of a feature and predictionconviction of a feature.

Deﬁnition 3

The prediction contribution of a fea-ture , π c , of feature i is π c ( i ) = E I ( M ) − E I ( M − f i ) E I ( M ) . Deﬁnition 4

The prediction conviction of a feature , π p , of feature i is π p ( i ) = ξ P ξi =0 E I ( M − f i ) E I ( M − f i ) . Familiarity conviction is a metric for describing sur-prisal of points in a model relative to the trainingdata. This diﬀers fundamentally from prediction con-viction. Consider a data set that has data points atregular intervals, such as a data point for each cornerin a grid. Given this grid data, prediction convictionwill indicate that a data point very close to an ex-isting data point will not be surprising and that itshould be easy to predict given the low level of un-certainty. However, familiarity conviction would in-dicate a higher surprisal for such a point even thoughit is easy to label because the point is unusual withregard to the even distribution of the rest of the data points. This new point does not form another cornerof the grid. The pair of prediction conviction and fa-miliarity prediction can be used together to ﬁnd andremove data points that are easy to predict but un-usual with regard to uniqueness of data. These prop-erties make familiarity conviction valuable for sani-tizing data and reducing data as well as extractingpatterns and anomalies, as is discussed in other sec-tions.Familiarity conviction is based on the similaritymetrics as described in Section 3.1. As long as alow or zero value of p is used in L p space metrics forsimilarity, familiarity conviction is independent of thescale of the data and provided and does not overre-act to feature dominance based on feature scale andrange. Deﬁnition 5

Given a set of points X ⊂ R z for every x ∈ X and an integer ≤ k < | X | we deﬁne the distance contribution probability distribution , C of X to be the set C = (cid:26) φ ( x ) P ni =1 φ ( x i ) , φ ( x ) P ni =1 φ ( x i ) , . . . , φ ( x n ) P ni =1 φ ( x i ) (cid:27) (14) for a function φ : X → R that returns the distancecontribution. Note that because φ (0) = ∞ may be true undersome circumstances, multiple identical points mayneed special consideration, such as splitting the dis-tance contribution among those points. Remark 1

Clearly C is a valid probability distribu-tion. We will use this fact to compute the amount ofinformation in C . Deﬁnition 6

The point probability of a point x i , i = 1 , , . . . , n is l ( i ) = φ ( x i ) P i φ ( x i ) , (15) where we see the index i is assigned the probability ofthe indexed point’s distance contribution. Deﬁnition 7

We the set of random variables thatcharacterize the discrete distribution of point proba-bilities , L , is the set of L = { l (1) , l (2) , . . . , l ( n ) } . Remark 2

Because we have no additional knowledgeof the distribution of points other than they follow thedistribution of the data, we assume L is uniform asthe distance probabilities have no trend or correlation. Remark 3

A distance contribution is a discrete dis-tribution of point probabilities.

The familiarity conviction of a point x i ∈ X is π f ( x i ) = | X | P i KL ( L || L − { i } ∪ E l ( i )) KL ( L || L − { x i } ∪ E l ( i )) , (16) where KL is the Kullback-Leibler divergence. Sincewe assume L is uniform, we have that the expectedprobability E l ( i ) = n . Equation 16 can thus be used to compute familiar-ity conviction.

Having deﬁned two methods (convictions) to mea-sure surprisal in the space of points X , we introducetechniques that naturally fall out of the information-based representation. The ability to add and re-move points to a model easily without retraining,coupled with familiarity and prediction conviction,enable numerous applications in model compression,online learning, anomaly detection, model to modelcomparison, reinforcement learning, synthetic datageneration, and likely more techniques we have notconsidered. We detail some of these in the subsec-tions following. Accuracy is best from models with more data, butusing kNN with additional data comes at a com-putational cost. Even for models where that rela-tionship does not hold, the memory needed to storethe data can be expensive. This important problemhas received some attention with some promising re-sults despite the diﬃculty of being an NP-hard prob-lem [Gottlieb et al., 2014, Kontorovich et al., 2017].Our probability and entropy based approach oﬀerssome new ways of looking at this problem in an in-terpretable manner, relaxing some constraints withregard to strict metric spaces.The entropy measures we discuss can be employedfor pruning the model of cases as well as for targetlessfeature pruning, thereby reducing data size while re-taining information in the model. Overall, perform-ing feature or case pruning can beneﬁt any systemby reducing the memory and possibly computationalresources needed to use the model. Further, thesetechniques can be used to help direct training, thatis, which parts of the model may beneﬁt from hav-ing more data. We note that model reduction may be targetless, and in such cases is not a substitute for fea-ture engineering for a targeted application. Rather,it is useful for removing redundant data to focus onthe most or least surprising relationships.Our entropy-based techniques are generally appli-cable to either feature or case pruning based on theamount of information that each feature or case pro-vides to the model. A ﬁrst step in data reduction canbe to detect and remove anomalies from the model.Anomalous cases, arguably, reduce the average use-fulness of decisions the model makes. Therefore, re-moving anomalies can actually improve the usefulnessof the model while slightly reducing the model size.A next step to reduce model size would be to use ourtechniques to remove those cases or features that donot provide signiﬁcant-enough amounts of informa-tion to the model.There are a few approaches we use to determinewhich cases or features to prune. We can look atthe surprisal of the case relative to the rest of themodel. If the surprisal is low, the case may be re-moved from the model for being redundant, whichmay be done dynamically for online learning. Insome circumstances, we keep the instances with thehighest surprisal (i.e., the most informative) in or-der to cap the model size, or remove the case withthe lowest surprisal in order to reduce the model sizeby a set amount. We can also keep the cases witha surprisal above a certain threshold, and perhapsvary that threshold, balancing model size and infor-mation. This also applies to features in the model.Those features with high surprisal can be kept (or al-ternatively, those features with low surprisal can beremoved), thereby reducing the model size in the fea-ture space. Further, both case pruning and featurepruning could be performed, reducing the model inboth dimensions.Though we have outlined a few surprisal and con-viction measures in this paper for model reduction,they can be combined other ways to perform modelreduction, or used with other feature engineeringtechniques. For example, feature prediction convic-tion could be used to determine whether to includefeatures in a targetless model, based on high or lowsurprisal for predictions. Feature familiarity convic-tion could be used to determine which features havetraining data that most conforms to typical trainingdata reﬂected in the rest of the model.

In that kNN models are data, we can compare themusing prediction and familiarity conviction directly.The use of such a comparison is of particular inter-Copyright 2018-2019 Diveplane Corporation. 7st in online learning where one could detect a non-stationary process by comparing models built dur-ing diﬀerent times. In reinforcement learning onecould compare models to establish propensity for ex-ploration verses exploitation.Further, by measuring the amount of surprisalthat one model contributes to another, some sparsemachine learning problems can be transformed intodense problems. Consider numerous individuals thatoccasionally take some action of a certain type. Amodel may be trained for each individual, and thencompared with surprisal against other individuals oragainst larger aggregated models to ﬁnd which largermodel would be least surprising for the individualto be included in. We have found promising initialresults in commercial settings with these techniques.Combining these untargeted techniques with targetedmodel selection techniques, such as Bayesian informa-tion criterion [Schwarz et al., 1978] or Akaike infor-mation criterion [Akaike, 1973] is another potentialfuture direction.

Sparse data is an issue for many machine learning sys-tems, which fail to perform or perform poorly whentrained with sparse data, especially if the data issparse across multiple features. The sparse data cancome from sporadic unobservability (i.e., failure tocapture the feature at particular times) and unavail-ability (e.g., no collection of certain feature at certaintimes, a common issue with historical datasets). Re-gardless of the cause, the combination of targetlesskNN with entropy can be used to overcome sparsityby imputing missing features in sparse data, and do-ing so in a way that reﬂects the dataset.In past approaches, generically solving for miss-ing values in a data set has presented many chal-lenging problems, such as Boolean satisﬁability andother NP-complete and NP-hard problems. Whenthe rules that determine the missing values are alsounknown, the problem becomes even more diﬃcult.Semi-supervised learning has been a rich area of studyto attempt to tackle this problem [Triguero et al.,2015]. Our targetless approach, when combined withconviction, oﬀers a novel technique for imputationthat can ﬁll in missing data across all of the features We are referring to the limited semisupervised learningtechniques that are typically available to by data scientists.There are a number of statistics techniques available if suﬃ-cient information about a distribution is known, and domainspeciﬁc imputation techniques, particularly for language pro-cessing, such as that of Gemmeke and Cranen [2008]. and scaﬀold knowledge from cases with missing dataas it ﬁlls the missing data in.The algorithm ﬁnds the cases with missing datathat are least surprising, labels them, inserts the la-bel into the model for the case, and repeats the pro-cess until no more missing data exists. Grouping thecases into batches (i.e. imputing multiple missingdata points in a batch) can improve performance sothat conviction only needs to be computed occasion-ally.More speciﬁcally, the algorithm orders the caseswith missing data by their surprisal for each feature,conditioned only on the data known for the case, fromlowest to highest. For particularly sparse data, wehave had success with ﬁrst ordering the cases by thosewith the fewest nulls prior to ordering by entropy. Starting with the case with the lowest entropy, wedetermine a value for the missing data using kNNbased on the known features. We can determinevalues for multiple features at once (in batch), or de-termine just one missing value, and then redeterminethe entropy for the cases, and return to impute moredata. This process continues until all of the data iscomplete. When we want to merely reduce the spar-sity of the data (vs. ﬁlling in all of the values), wecan continue until we hit a termination condition orsparsity threshold. Additional work is needed, but webelieve a viable termination condition is ceasing dataimputation when the entropy for the lowest-entropycase exceeds a threshold. We believe that, after thatpoint, the additional informational value of the im-puted data may be low, and, as such, we cease im-puting missing data.

The choice of a kNN model allows us number of inter-esting opportunities related to providing explanationsof the data. With kNN, the model is the data. Themodel is only complemented by kernel parameters todeﬁne nearest neighbors, and with any weights, ad-justments or removal of data. In both the assessmentof the model for decision or action suggestion, thedata points that are within the kernel, or that areabove a certain threshold for how much they impact Currently, we use entropy for the cases to determine whatdata to impute ﬁrst, but we are also exploring the use of en-tropy for features. We ﬁrst order by feature entropy, and thenby entropy of the cases missing the lowest entropy feature. Wewould impute that data ﬁrst, then recalculate entropies for themodel, similar to the manner described above. We can also use the techniques described in this paperfor synthetic data generation to determine the value for themissing feature as described in Section 3.3.6.

Copyright 2018-2019 Diveplane Corporation. 8he kernel values, are the data points that caused thedecision. This is compelling from a machine deci-sion auditability perspective because it means thatthe data relevant to each decision is directly identi-ﬁed, and that auditing, editing or removing the train-ing data identiﬁed as being associated with a decisionwould have had a direct eﬀect on the decision.Below, we discuss numerous types of explanationdata that can be derived from a kNN model. Foreach, we give a short description of how it is gen-erated, and what a system or operator can do withit. As an overview, for each type of explanation data(or based on combinations of explanation data), thatdata can be passed to a system or human operatorfor review. The system or human operator can usethe explanation data to decide whether to performthe action in question, perform a diﬀerent action, orperform no action at all. Additionally, this data canbe used after-the-fact to audit decisions that weremade. Auditability begins, as alluded to above, withthe fact that we can determine what data was usedto suggested an action. The explanation data canalso be generated either at the time of suggesting theaction or at the time of the audit.One of the measures we use as part of our explana-tion data is conviction, which can be seen, broadly,as the ratio of the amount of surprisal of a particularcase to the actual surprisal. Numerous types of con-viction are discussed in this paper, such as familiarityconviction and prediction conviction. We also discussformulating each in a targeted or untargeted man-ner. Any of these formulations for conviction maybe used and provided as an explanation data alongwith answers or suggested actions. As an example,targeted or untargeted familiarity conviction may becalculated and provided to along with a suggested de-cision. An overly low (or high) conviction score canbe a cause for concern, and that suggested decisionmay be ﬂagged for further reviewed or ignored com-pletely. In contrast, a suggested decision or actionwith a conviction score that is not concerning (e.g., amoderate conviction score, or, in some circumstances,a high conviction score), may be performed or actedupon without further review. A low conviction scoremay be associated with the data being outside theusual pattern of the data, and therefore be of con-cern for systems that are trying to perform “usual”actions. High conviction may be of concern whenthere is a desire to not perform actions that are too“usual.” A high and low pass ﬁlter can be used whenthere is a desire to perform actions that are neitheroverly usual or unusual, but instead are somewherein the middle in terms of how expected they are.We can also use feature prediction contribution or feature prediction contribution to ﬁnd bias in deci-sion making. Many models and data sets containmany features, and are used for many decisions. Asdescribed elsewhere herein, our techniques can helpprune a model of cases and features. Nevertheless,there may always be features in models that, in anideal world, would contribute little to nothing to cer-tain decisions. For example, in the context of ﬁnanc-ing decisions to individuals, there are certain fea-tures that, if decisions were made based on them,would constitute undesirable bias. These may includegender or race. The feature prediction contributioncan be provided to the system or human operator tohelp determine whether making the decision based onthat feature would constitute unpermitted bias. Thiscould lead to pruning that feature from the model ortaking other steps to reduce or eliminate its impacton the decision, such as the use of feature weights.The local region of the model (which we will re-fer to as the local model ) comes with no added costwith kNN and therefore we can perform analysis onthe local model and make other queries on relevantdata. As a few examples, we can easily ﬁnd counter-factual cases [Wachter et al., 2018] and the boundaryconditions they yield by performing a query on thedata that maximizes the ratio of the action featuresto the total set of features. We can also ﬁnd out ifany of the features in a case associated with the ac-tions suggested by the model are outside the rangeof the corresponding features of the cases in the lo-cal model. Features being outside the range may because for concern and may cause the suggested actionto not be performed.As a counter-point to the counterfactual cases, wecan also determine archetype cases, cases that havethe same action as the suggested action, and are fur-thest from other cases with diﬀerent actions. Thedistance from the case with the suggested action tothe archetype case can be used as explanation data,audit data, and to determine whether to perform asuggested action without further review.We can ﬁnd ratios of diﬀerent types of convictionin the local model to those of the total model, whichcan indicate which cases and features add signiﬁcantinformation to a particular region, and which addless information. If the information is not correlatedwith accuracy (e.g., has low conviction in the localmodel and high conviction in the model as a whole),then it may be a measure of noise. If the cases thatcontribute to the suggestion of an action are abovea noisiness threshold, then it may be inadvisable toperform a suggested action. On the other hand, ifthe noisiness of the cases used to suggest an actionis low, the system or operator may be conﬁdent inCopyright 2018-2019 Diveplane Corporation. 9erforming the action. There are many other ratiosof conviction that may be used as explanation andaudit data, and diﬀerent of them will be useful indiﬀerent scenarios.A “less similar” model can also be determined,where the closest k cases (by distance, by count, bydensity threshold) are excluded, and the distance tothe next closest cases is determined. That distancecan be used as explanation data, audit data, andto determine whether to perform a suggested actionwithout further review. For example, a higher dis-tance to the “less similar” cases can be an indicatorthat the suggested action is in a sparsely populatedpart of the model, and therefore should be reviewedbefore being acted upon.We can also deﬁne feature residuals based on thelocal model (or a regional model that is the N nearestneighbors, where N can be the same, higher, or lowerthan the k used to deﬁne the local model). Here, wecan use the mean absolute error, variance, or othermoments or measures to predict how well the modelpredicts each feature when it is removed. We can alsodetermine action probabilities, which, in the case ofcategorical actions, can be measured as the percent-age of cases in the local model that have that cate-gorical action. For continuous or ordinal actions, theaction probability can be a probability measure basedon the conﬁdence interval of the suggested actions fora given tolerance (e.g., an action value for 250 maybe 67% (the action probability) likely to be within+/-5 (the tolerance) of 250). Feature residuals andconviction can be used in conjunction; a predictionwith high prediction conviction but also wide residu-als may not be reliable, but a prediction with low pre-diction conviction but wide residuals may potentiallybe improved if further training data is added similarto the prediction. We can also determine local or re-gional model complexity (i.e., whether the varianceis high, whether the accuracy is low, whether cor-relations among variables are low, etc.) and fractaldimensionality (i.e., placing a shape over the modeland shrinking the scale of the shape and counting thenumber of smaller shapes needed to cover the extentsof the model).

Given that prediction conviction is a method to ex-press how surprising an observation is, we can re-verse the math and use conviction to generate a newsample of data for a given amount of surprisal. Thegeneral approach is to randomly select or predict afeature of a case from the training data and then re- sample it based on the new condition. This approachis related to Gibbs sampling [Martino et al., 2015,Efros and Leung, 1999] in that it incrementally ob-tains new values for each feature conditioned on theprevious values, though the conditioning and sam-pling is based on our approach to kNN.Using the conditioned local residuals for a part ofthe model, as described in section 3.2.1, we can pa-rameterize the random number distribution to gener-ate a new value for a given feature. Our resamplingmethod is related to the approach used by the Mann-Whitney test [Mann and Whitney, 1947], a power-ful and widely used nonparametric test to determinewhether two sets of samples were drawn from thesame distribution. In the Mann-Whitney test, sam-ples are randomly checked against one another to seewhich is greater, and if both sets of samples weredrawn from the same distribution then the expecta-tion is that both sets of samples should have an equalchance of having a higher value when randomly cho-sen samples are compared against each other. Ourapproach for resampling a point is to randomly chosewhether the new sample is greater or less than theother point and then draw a sample from the distri-bution using the feature’s residual as the expectedvalue. Just as the exponential distribution is en-tropy maximizing given the sole constraint of a posi-tive mean, the double-sided exponential distribution(also known as the Laplace distribution) is the en-tropy maximizing distribution given a positive meandistance about a point. The log-normal and otherdistributions may be used as well, depending on thetypes of residuals computed and assumptions madeabout the local distributions.If a feature is not continuous but rather nominal,then the local residuals can populate a confusion ma-trix, and an appropriate sample can be drawn basedon the probabilities for drawing a new sample giventhe previous value.Suppose we would like to generate synthetic datawith features i ∈ Ξ. If there are no conditions placedon the new synthetic data, then we start with a ran-dom feature i. Because the observations within themodel are representative of the observations made sofar, a random instance is chosen from the observa-tions using the uniform distribution over all observa-tions. Then the value for feature i of this observa-tion is resampled via the methods mentioned above.The value for feature i then become a condition onsubsequently-generated features.Next, suppose that we would like to generate fea-ture j , given that features i ∈ Ξ have correspondingvalues x i . The model labels feature j conditionedby all x i to ﬁnd some value t . This new value t be-Copyright 2018-2019 Diveplane Corporation. 10omes the expected value for the resampling processdescribed above, and the local residual (or confusionmatrix) becomes the appropriate parameter or pa-rameters for the expected deviation to ﬁnd the valuefor x j .The process for ﬁlling in the features for an instancemay begin with no feature values subject to condi-tions, or some feature values may have been speciﬁedas conditions for the data to generate. Either way,the remaining features may be ordered (i.e., selectedfor determination of a new value) randomly or may beordered via a feature conviction value. When a newvalue is generated, then the process restarts with thenew value as an additional condition. Continuing with the double-sided exponential distri-bution as a maximum entropy distribution of distancein L p space, we can derive a closed form solution forhow to scale the exponential distributions based on aprediction conviction value.Starting with Equation 13, we specify a value, ν ,for the prediction conviction as ν = π p ( x ) = E II ( x ) (17)which can be rearranged as I ( x ) = E Iν . (18)Substituting in the self-information from Equa-tion 12, we have φ ( x ) || r ( x ) || p = E Iν . (19)Note that the units on both sides of Equation 19match. This is because of the natural logarithm andexponential in the derivation of Equation 19 cancelout, but leave the resultant in nats. We can rearrangein terms of distance contribution as φ ( x ) = || r ( x ) || p · E Iν . (20)To proceed further we need to make an assump-tion about the distribution of distance contributions φ . Seeking to minimize the complexity of our as-sumptions we simply observe that distances are sup-ported by the positive reals. Constraining the ﬁrstor ﬁrst and second moments and maximizing the en-tropy gives us the exponential and log normal distri-butions respectively. For simplicity sake we proceed with e ( ζ ) but note that in practice ln N ( µ, σ ) is oftenobserved. One may distinguish among the distribu-tions using likelihood curvature tools such as FisherInformation.If we let p = 0, which is desirable for convictionand other aspects of the similarity measure, we canrewrite the distance contribution in terms of a normof the values observed for the number of features, ξ , each with an expected mean of ζ i . Taking theexpected value of both sides we ﬁnd (cid:18) Π i ζ i (cid:19) ξ = (Π i r i ) ξ E Iν . (21)Due to the number of ways of assigning surprisalacross the features, many solutions may exist. How-ever, unless otherwise speciﬁed or conditioned, wewould want to distribute surprisal across the featuresholding expected proportionality constant. This al-lows us to write the distance contribution, which be-comes the mean absolute error for the exponentialdistribution, as 1 /ζ i = r i (cid:18) E Iν (cid:19) ξ . (22)and solving for the ζ i to parameterize the exponentialdistributions, we ﬁnd ζ i = 1 r i (cid:16) ν E I (cid:17) ξ . (23)Equation 23, taken with the value of the feature, be-comes the distribution by which to generate a newrandom number under the maximum entropy as-sumption of exponentially distributed distance fromthe value. The ability to randomly generate data with a con-trolled amount of surprisal is a novel way to charac-terize the classic exploration versus exploitation tradeoﬀ in searching for an optimal solution to a goal.Currently, pairing a means to search, such as MonteCarlo tree search [Abramson, 1987], with a universalfunction approximator, such as neural networks, isthe most successful approach to solving diﬃcult rein-forcement learning problems without domain knowl-edge [Silver et al., 2017]. Because our data synthesistechnique comes from the universal function approx-imator model (kNN) itself, we can create a reinforce-ment learning architecture that is similar and tightlycoupled.Because the synthetic data generation can be con-ditioned, we can condition the search on both the cur-rent state of the system, as it is currently observed,Copyright 2018-2019 Diveplane Corporation. 11nd a set of goal values for features. As the systemis being trained, it can be continuously updated withthe new training data. Once states are evaluated fortheir ultimate outcome, a new set of features or fea-ture values can be updated or added to all of the ob-servations indicating the ﬁnal scores or measures ofoutcomes. Keeping track of which observations be-long to which training sessions (or games) is a conve-nient way to track and update this data. Given thatthe ﬁnal score or multiple goal metrics are already inthe kNN database, the synthetic data generation canquery for new data conditioned upon having a highscore or winning conditions, with a speciﬁed amountof conviction.This results in a reinforcement learning algorithmthat can be queried for the relevant training datafor every decision, as described in Section 3.3.4.The commonality among the similar cases, bound-ary cases, archetypes, etc. can be combined to ﬁndwhen certain decisions are likely to yield a positiveoutcome, negative outcome, or a larger amount ofsurprisal thus improving the quality of the model.By seeking high surprisal moves, the system will im-prove the breadth of its observations and learning,though it may not perform well. Setting the convic-tion of the data synthesis to 1 yields a balanced tradeoﬀ between exploration versus exploitation. As moreinformation is learned, this conviction value may bereduced to focus on achieving goals.The interpretability of reinforcement learning mayhelp overcome many of the data-availability issues.For example, when data is needed for dangerous,expensive, or otherwise diﬃcult-to-produce trainingdata, we can generate synthetic data conditioned invalue and conviction to match those diﬃcult circum-stances. As such, our method can provide the sam-pling strategy necessary for reinforcement learningwith more control than with current techniques.

Although providing a rigorous review of our resultsand methods is out of the scope of this paper, wesummarize a few here to motivate and encourage ad-ditional exploration. We tested classic kNN as imple-mented in scikit-learn [Pedregosa et al., 2011], stan-dardizing the scale of the features as is common prac-tice in machine learning, against kNN with fractionalp-values and lim p → using uncertainty in distance asmentioned in Section 3.1.1 without standardization.We compared the results across a robust suite of 97regression datasets and 78 classiﬁcation datasets se- lected from among the benchmark data published byOlson et al. [2017].On the classiﬁcation datasets, scikit-learn’s ran-dom forest implementation averaged an accuracy of0.79. kNN already performs well on classiﬁcationproblems, averaging an accuracy of 0.76. However,using fractional p values, we saw the accuracy in-crease to 0.77, and allowing the p value of 0 basedon uncertainty we saw the average accuracy improveto 0.78. Though the accuracy improvements of ourtechniques are slight, the use of low or zero p valuesmeans that we can maintain the data directly with-out scaling and that we have made a step towardaccurate probability-based reasoning on data usingconjunctions as described in Section 3.1.1.On the regression datasets, scikit-learn’s randomforest implementation averaged an r-squared score of0.77. In many situations, such as those involving ex-trapolation, kNN regression does not perform as wellas other methods, and we saw this with the scikit-learn implementation resulting in an r-squared scoreof 0.53. However, our improvements yielded consider-able gain in kNN’s regression scores. Using fractionalp values yielded an r-squared score of 0.57, which isa signiﬁcant (p ≪ .001) improvement based on theWilcoxon signed rank test. Further allowing the useof a zero p value with uncertainty in distance as men-tioned in Section 3.1.1, the r-squared score improvedto 0.66 which is also a signiﬁcant result (p ≪ .001).We believe that kNN’s regression scores can also beimproved to be competitive with other cutting edgealgorithms. In Section 3.3.3, we showed how a tar-getless approach to data can ﬁll in missing data, andfrom an auditability perspective it is easy to trackthe history of data imputation. Conversely, we areinvestigating exputation approaches can be employedto synthesize likely data points outside the bounds ofthe training data. Knowledge of the features, such astheir bounds, can help when reﬂecting or amplifyingor synthesizing exputed data points, which can thenbe used for interpolation.The reason for our belief that this is a core problemlies with the topology of data as the dimensionalitygrows. As the number of dimensions increases fora given set of data, many intuitive analytical tech-niques such as Euclidian norms and Gaussian kernelsbecome inappropriate as the unit radius hypervol-ume goes toward zero and the probability that datapoints falling in sharp corners of a hypervolume goestoward one [Verleysen and Fran¸cois, 2005]. This im-plies that nearly all data points will be at or beyondthe periphery, requiring extrapolation. Dealing withany kind of cost or value function to perform opti-mization will mean that nearly all points are ParetoCopyright 2018-2019 Diveplane Corporation. 12ptimal, meaning that it becomes increasingly morediﬃcult to deﬁne “good” because nearly every pointhas some unique quality. In many cases, the largernumber of dimensions can be helpful, but primarilywhen the structure is extracted and the dimensional-ity is reduced [Kittler, 1986, Kohavi and John, 1997,Stoppiglia and Dreyfus, 2003]. As nearly all new ob-servations will be on the periphery, we believe extrap-olation techniques, such as exputation, are likely toimprove results while maintaining interpretability.A dimensionality bottleneck is little diﬀerent thanthe information bottlenecks used for generalizationand variance reduction across other areas of machinelearning [Tishby and Zaslavsky, 2015]. By employingclustering techniques in conjunction with model re-duction techniques as mentioned in Section 3.3.1, se-lecting prototypes or archetypes to represent a clusterin a hierarchical fashion, we may be able to character-ize the entropy ﬂux between parts of the model, andhierarchical models and hierarchical explanations arenatural consequences. We note the striking common-ality of zero p value Lebesgue space as depicted in Ap-pendix A, the conjunction of probabilities of indepen-dent distributions, and the core of the no-ﬂatteningtheorem by Lin et al. [2017] that relates hierarchicalarchitectures for neural networks and performance.Our future work will include further eﬀorts to useprobability throughout all parts of kNN such thatany form of entropy or probability can be calculated,and assumptions can be clearly interpreted.Additional future work will be to characterize ourwork in the performance and scalability of targetlesskNN queries with fractional and zero p values, whichis outside the scope of this paper.Maximizing the interpretability of artiﬁcial intelli-gence leads to either understanding the generalizedrelationships of the data, such as symbolic or tree-based models, or to understand the data itself. Withthe improved performance of computing and the ad-vances in kNN, we conclude that using kNN providesa promising foundation for the future of interpretableartiﬁcial intelligence and machine learning. References

B. D. Abramson.

The Expected-outcome Model ofTwo-player Games . PhD thesis, Columbia Univer-sity, New York, NY, USA, 1987.P. K. Agarwal, B. Aronov, S. Har-Peled, J. M.Phillips, K. Yi, and W. Zhang. Nearest-neighborsearching under uncertainty ii.

ACM Transactionson Algorithms (TALG) , 13(1):3, 2016. C. C. Aggarwal, A. Hinneburg, and D. A. Keim. Onthe surprising behavior of distance metrics in highdimensional space. In

International conference ondatabase theory , pages 420–434. Springer, 2001.H. Akaike. Information theory and an extension ofthe maximum likelihood principle.

Proceedings ofthe 2nd International Symposium on InformationTheory , pages 267–281, 1973.E. Alpaydin. Voting over multiple condensed nearestneighbor.

Artiﬁcial Intelligence Review , pages 115–132, 1997.E. Alpaydin. Machine learning: The new AI. pages1013–1022, 2016.N. Altman. An introduction to kernel and nearest-neighbor nonparametric regression.

The AmericanStatistician , 46(3):175–185, 1992.K. J. Archer and R. V. Kimes. Empirical character-ization of random forest variable importance mea-sures.

Computational Statistics & Data Analysis ,52(4):2249–2260, 2008.K. Beyer, J. Goldstein, R. Ramakrishnan, andU. Shaft. When is “nearest neighbor” meaning-ful? In

International conference on database the-ory , pages 217–235. Springer, 1999.L. Breiman, J. Friedman, C. Stone, and R. Olshen.Classiﬁcation and regression trees. 1984.J. R. Cano, F. Herrera, and M. Lozano. Evo-lutionary stratiﬁed training set selection for ex-tracting classiﬁcation rules with tradeoﬀ precision-interpretability.

Data and Knowledge Engineering ,60:90–108, 2006.D. Coomans and D. Massart. Alternative k-nearestneighbour rules in supervised pattern recognition :Part 1. k-nearest neighbour classiﬁcation by usingalternative voting rules.

Analytica Chimica Acta ,136:15–27, 1982.A. A. Efros and T. K. Leung. Texture synthesisby non-parametric sampling. In iccv , page 1033.IEEE, 1999.J. F. Gemmeke and B. Cranen. Using sparse represen-tations for missing data imputation in noise robustspeech recognition. In

Signal Processing Confer-ence, 2008 16th European , pages 1–5. IEEE, 2008.I. Goodfellow, Y. Bengio, and A. Courville. Deeplearning. pages 1013–1022, 2016.Copyright 2018-2019 Diveplane Corporation. 13oogle LLC. What-if tool. https://pair-code.github.io/what-if-tool/ ,2018.L.-A. Gottlieb, A. Kontorovich, and P. Nisnevitch.Near-optimal sample compression for nearestneighbors. In

Advances in Neural Information Pro-cessing Systems , pages 370–378, 2014.E. C. Harrington. The desirability function.

Indus-trial quality control , 21(10):494–498, 1965.T. Hastie, R. Tibshirani, and J. Friedman. The ele-ments of statistical learning. page 63, 2001.A. Hinneburg, C. C. Aggarwal, and D. A. Keim.What is the nearest neighbor in high dimensionalspaces? In , pages 506–515, 2000.I. Hmeidi, B. B Hawashin, and E. El-Qawasmeh. Per-formance of knn and svm classiﬁers on full wordarabic articles.

Advanced Engineering Informatics ,22(1):106–111, 2008.M. E. Houle, H.-P. Kriegel, P. Kr¨oger, E. Schubert,and A. Zimek. Can shared-neighbor distances de-feat the curse of dimensionality? In

InternationalConference on Scientiﬁc and Statistical DatabaseManagement , pages 482–500. Springer, 2010.P. Indyk and R. Motwani. Approximate nearestneighbors: towards removing the curse of dimen-sionality.

In Proc. of the 30th ACM Sym.on Theoryof Computing , pages 604–613, 1998.J. Kittler. Feature selection and extraction. pages115–132, 1986.R. Kohavi and G. John. Wrappers for feature subsetselection.

Artiﬁcial Intelligence , 1-2:273–323, 1997.A. Kontorovich, S. Sabato, and R. Weiss. Nearest-neighbor sample compression: Eﬃciency, consis-tency, inﬁnite dimensions. In

Advances in NeuralInformation Processing Systems , pages 1573–1583,2017.T. Leinster and M. W. Meckes. Maximizing diversityin biology and beyond.

Entropy , 18(3):88, 2016.H. W. Lin, M. Tegmark, and D. Rolnick. Why doesdeep and cheap learning work so well?

Journal ofStatistical Physics , 168(6):1223–1247, 2017.S. Lukaszyk.

Probability metric, examples of ap-proximation applications in experimental mechan-ics . PhD thesis, Cracow University of Technology,2003. S. Lukaszyk. A new concept of probability metricand its applications in approximation of scattereddata sets.

Computational Mechanics , 33(4):299–304, 2004.H. B. Mann and D. R. Whitney. On a test of whetherone of two random variables is stochastically largerthan the other.

The annals of mathematical statis-tics , pages 50–60, 1947.L. Martino, H. Yang, D. Luengo, J. Kanniainen, andJ. Corander. A fast universal self-tuned samplerwithin gibbs sampling.

Digital Signal Processing ,47:68–83, 2015.M. Mohri, R. Afshin, and T. Ameet. Foundations ofmachine learning. pages 1013–1022, 2012.R. S. Olson, W. La Cava, P. Orzechowski, R. J. Ur-banowicz, and J. H. Moore. Pmlb: a large bench-mark suite for machine learning evaluation andcomparison.

BioData mining , 10(1):36, 2017.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning inPython.

Journal of Machine Learning Research ,12:2825–2830, 2011.N. Poerner, H. Sch¨utze, and B. Roth. Evaluating neu-ral network explanation methods using hybrid doc-uments and morphosyntactic agreement. In

Pro-ceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume 1:Long Papers) , volume 1, pages 340–350, 2018.J. Raikwal and K. Saxena. Performance evaluation ofsvm and k-nearest neighbor algorithm over medi-cal data set.

International Journal of ComputerApplications , 50(14):975–985, 2012.M. Rao, Y. Chen, B. C. Vemuri, and F. Wang. Cumu-lative residual entropy: a new measure of informa-tion.

IEEE transactions on Information Theory ,50(6):1220–1228, 2004.M. T. Ribeiro, S. Singh, and C. Guestrin. “Whyshould I trust you”: Explaining the predictions ofany classiﬁer. In

Proceedings of the 22nd ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, San Francisco, CA,USA, August 13-17, 2016 , pages 1135–1144, 2016.Copyright 2018-2019 Diveplane Corporation. 14. Schuh, W. T., and R. Angryk. Mitigatingthe curse of dimensionality for exact knn re-trieval.

Proceedings of the Twenty-Seventh Interna-tional Florida Artiﬁcial Intelligence Research Soci-ety Conference , pages 363–368, 2014.M. A. Schuh, T. Wylie, , and R. A. Angryk. Im-proving the performance of high-dimensional knnretrieval through localized dataspace segmentationand hybrid indexing.

In Proc. of the 17th ADBISConf. , page 344357, 2013.G. Schwarz et al. Estimating the dimension of amodel.

The annals of statistics , 6(2):461–464, 1978.D. Silver, J. Schrittwieser, K. Simonyan,I. Antonoglou, A. Huang, A. Guez, T. Hu-bert, L. Baker, M. Lai, A. Bolton, et al. Masteringthe game of go without human knowledge.

Nature ,550(7676):354, 2017.D. M. Skapura. Building neural networks. page 63,1996.H. Stoppiglia and G. Dreyfus. Ranking a random fea-ture for variable and feature selection.

Journal ofMachine Learning Research, Special Issue on Vari-able/Feature Selection , 1-2, 2003.Q. Tan, G. Yu, C. Domeniconi, J. Wang, andZ. Zhang. Incomplete multi-view weak-label learn-ing. In

IJCAI , pages 2703–2709, 2018.Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Qualityand eﬃciency in high dimensional nearest neighborsearch.

In Proc. of the ACM SIGMOD Inter. Conf.on Mgmt. of Data , pages 563–576, 2009.N. Tishby and N. Zaslavsky. Deep learning and theinformation bottleneck principle. In

InformationTheory Workshop (ITW), 2015 IEEE , pages 1–5.IEEE, 2015.N. Tomaˇsev and D. Mladeni´c. Hubness-aware sharedneighbor distances for high-dimensional k-nearestneighbor classiﬁcation.

Knowledge and informationsystems , 39(1):89–122, 2014.I. Triguero, S. Garc´ıa, and F. Herrera. Self-labeledtechniques for semi-supervised learning: taxonomy,software and empirical study.

Knowledge and In-formation systems , 42(2):245–284, 2015.H. Tuomisto. A consistent terminology for quantify-ing species diversity? yes, it does exist.

Oecologia ,164(4):853–860, 2010. V. V. B. Surya, H. Prasath, A. Arafat, O. Lasass-meh, and A. Hassanat. Distance and similaritymeasures eﬀect on the performance of k-nearestneighbor classiﬁer. arXiv:1708.04321 , 2017.M. Verleysen and D. Fran¸cois. The curse of dimen-sionality in data mining and time series predic-tion. In

International Work-Conference on Arti-ﬁcial Neural Networks , pages 758–770. Springer,2005.S. Wachter, B. Mittelstadt, and C. Russell. Coun-terfactual explanations without opening the blackbox: Automated decisions and the gdpr.

HarvardJournal of Law and Technolog , 31(2), 2018.F. Wang and C. Rudin. Falling rule lists. In

ArtiﬁcialIntelligence and Statistics , pages 1013–1022, 2015.F. Zhao and Y. Guo. Semi-supervised multi-labellearning with incomplete labels. In

IJCAI , pages4062–4068, 2015.

A Geometric Mean Derivation

The geometric mean can be derived from the gener-alized mean aslim p → n X i =1 w i x pi ! /p = lim p → exp  ln  n X i =1 w i x pi ! /p  = lim p → exp (cid:18) ln ( P ni =1 w i x pi ) p (cid:19) . Then using L’Hˆopital’s rule and the chain rule onthe inner part of this equation, we can simplify aslim p → ln ( P ni =1 w i x pi ) p = lim p → P ni =1 w i x pi ln x i P ni =1 w i x pi

1= lim p → P ni =1 w i x pi ln x i P ni =1 w i x pi = P ni =1 w i ln x i P ni =1 w i = ln ( Q ni =1 x w i i ) P ni =1 w i . Therefore substituting back in the previous resultCopyright 2018-2019 Diveplane Corporation. 15ieldslim p → n X i =1 w i x pi ! /p = lim p → exp (cid:18) ln ( Q ni =1 x w i i ) P ni =1 w i (cid:19) = (cid:16) e ( ln ( Q ni =1 x wii )) (cid:17) (cid:16) P ni =1 wi (cid:17) = n Y i =1 x w i i ! (cid:16) P ni =1 wi (cid:17) . Setting all w i = n yieldslim p → n X i =1 n x pi ! /p = n Y i =1 x i ! n ..