Natively Interpretable Machine Learning and Artificial Intelligence: Preliminary Results and Future Directions
Christopher J. Hazard, Christopher Fusting, Michael Resnick, Michael Auerbach, Michael Meehan, Valeri Korobov
aa r X i v : . [ c s . L G ] J a n Natively Interpretable Machine Learning and Artificial Intelligence:Preliminary Results and Future Directions
Christopher J. Hazard ∗ , Christopher Fusting ∗ , Michael Resnick ∗ , Michael Auerbach ∗ ,Michael Meehan ∗ , Valeri Korobov ∗ January 21, 2019
Machine learning models have become more and morecomplex in order to better approximate complexfunctions. Although fruitful in many domains, theadded complexity has come at the cost of model in-terpretability. The once popular k-nearest neighbors(kNN) approach, which finds and uses the most sim-ilar data for reasoning, has received much less at-tention in recent decades due to numerous problemswhen compared to other techniques. We show thatmany of these historical problems with kNN can beovercome, and our contribution has applications notonly in machine learning but also in online learning,data synthesis, anomaly detection, model compres-sion, and reinforcement learning, without sacrificinginterpretability. We introduce a synthesis betweenkNN and information theory that we hope will pro-vide a clear path towards models that are innatelyinterpretable and auditable. Through this work wehope to gather interest in combining kNN with infor-mation theory as a promising path to fully auditablemachine learning and artificial intelligence.
As machine learning has matured the need to un-derstand, interpret and explain models has becomeincreasingly important [Alpaydin, 2016, Mohri et al.,2012, Goodfellow et al., 2016]. Machine learningmodels are interpreted in a variety of ways includ-ing exploring the internals of a model [Skapura, 1996,Poerner et al., 2018], creating ex post rationaliza-tions [Ribeiro et al., 2016, Google LLC, 2018] or us-ing models that are interpretable from the begin- ∗ Diveplane Corporation. If you are interested in using ourtechnology, please contact [email protected] . The authorswould like to thank the investors, employees, and supportersof Diveplane Corporation for making this work possible. ning and maximize their accuracy [Wang and Rudin,2015]. There is a perception and some supporting ev-idence that there exists a trade off between accuracyand interpretability [Cano et al., 2006].The motivating philosophy behind our work is thatmodels should be innately interpretable. Specifically,our motivations are: • decisions should be directly traceable to thetraining data that caused the decision to bemade; • the regions of the model should be easily char-acterized directly from the training data; and • assumptions are minimal.To achieve the aforementioned goals we combine k-Nearest Neighbors (kNN) with the principle of max-imum entropy to create models that are easy to un-derstand, make minimal assumptions, and are non-parametric.k-Nearest Neighbors is one of the oldest, simplest,and most accurate algorithms for pattern classifica-tion and regression models [Hastie et al., 2001]. Itis a simple technique that is easily implementable[Alpaydin, 1997]. The accuracy of kNN-based clas-sification, prediction, and recommendation dependssolely on a data model. Outputs from the modelare usually traceable back to the exact data that in-fluenced each decision. This traceability enables de-tailed analysis of the decision inputs and characteri-zation of the data local to the decision.k-Nearest Neighbors was previously a dominantmachine learning technology [Coomans and Massart,1982, Breiman et al., 1984, Altman, 1992, Alpaydin,1997] but was largely abandoned with the growingsize of data and the computational complexity offinding the nearest k points [Raikwal and Saxena,2012, Schuh et al., 2014, Hmeidi et al., 2008]. Manyoptimizations have been proposed over the years,they generally seek to reduce the number of dis-tances actually computed [Pedregosa et al., 2011].The optimizations include linear scan, Kd-trees, ball-Copyright 2018-2019 Diveplane Corporation. 1rees, etc. [Pedregosa et al., 2011]. The curse of di-mensionality has also been known to adversely af-fect kNN [Hastie et al., 2001, Indyk and Motwani,1998, Schuh et al., 2013, Tao et al., 2009] and theselection of a distance function can be challenging[V. B. Surya et al., 2017]. Additionally features mayhave to be scaled or standardized to prevent distancemeasures from being dominated by one of the fea-tures. The accuracy of kNN can be severely degradedby the presence of noisy or irrelevant features, or ifthe features scales are not consistent with their rel-evance. Finally, kNN requires a value of parameterk. If k is too small, the model may have low biasbut be sensitive to noisy points and have too highof variance. If k is too large, the neighborhood mayinclude points from other classes and may have toolittle variance.Our contributions in this paper are several. First,we bring numerous well-studied techniques togetherto improve the efficacy of kNN. Second, we connectkNN with information theory and describe numer-ous ways this can be applied to machine learning.Third, we illustrate how the first two contributionsopen the door to interpretable reinforcement learn-ing. Throughout this paper we discuss targetlessmodels (models in which we are interested in predict-ing all the features using all the others), their utilityin understanding the data, and introduce an imputa-tion method that naturally arises from such models.The purpose of this paper is to offer a glimpse as towhat the combination of kNN and information theorycan offer in advancing the state of the art of machinelearning and artificial intelligence. We introduce term targetless learning to describe ourapproach to kNN. Instead of the traditional approachof building a model that learns the mapping froma set of input features variables to a set of targetvariables, or building multiple models to learn mul-tiple mappings, our models consist of the relevanttraining data stored in a data structure that can bequickly queried. We may wish to predict and char-acterize any set of variables given any other set ofvariables. This flexibility is generally not emphasizedin the related literature outside of a subset of work onsemi-supervised learning and imputation [Tan et al.,2018, Zhao and Guo, 2015], and so we define twomore terms to help us describe inputs and outputsin targetless learning.
Context features are the fea-ture variables being used as inputs for a particularquery.
Action features are the feature variables that are being labeled, actioned upon, or predicted; in tra-ditional targeted machine learning, these are usuallyreferred to as target variables, labels, or responseswith regard to a targeted machine learning model.These terms further reflect the origin and potentialfor online learning applications of this approach.
When determining the value of an unknown actionfeature, the action features of the k most similarpoints are averaged or their values voted upon to de-termine the most likely value. In general similarityis determined by a distance metric. Unfortunately asthe number of dimensions increases it becomes dif-ficult to distinguish points [Hinneburg et al., 2000,Beyer et al., 1999]. One proposed solution to thisproblem is to use the number of shared nearestneighbors as a similarity measure [Houle et al., 2010].There is also evidence that fractional norms headingtowards zero enable points to be distinguished moreeasily in high dimensional space [Aggarwal et al.,2001]. Fractional norms are represented as || x || p as || x || p = X i ∈ Ξ w i x pi ! /p , (1)where p is the parameter for the Lebesgue space, Ξ isthe feature set, and w i is the weight for each feature,often where w i = n .Motivated by this result we derived the Minkowskidistance as p → p → d p ( x, y ) = Y i ∈ Ξ | x i − y i | ! | Ξ | . (2)When feature weights w i satisfy P i ∈ Ξ w i = 1 we havelim p → d p ( x, y ) = Y i ∈ Ξ | x i − y i | w i . (3)Note equations 2 and 3 are geometric means andhave the useful property of being scale invariant.Scale invariance means that scaling a feature by anyfactor will not affect the ordering of proximity, as theresult is the same as multiplying all of the distancesby a constant. Thus using p = 0 enables the data tobe stored in its original form, not scaled, standard-ized, or normalized, which improves the transparencyof the model and removes the need for that aspect offeature engineering. As lim p →
0, the scale of fea-tures matters less, meaning that the Minkowski dis-tance is approximately scale invariant with regard to p values that are relatively close to 0.Copyright 2018-2019 Diveplane Corporation. 2 .1.1 A Probabilistic Approach We are unaware of any prior work that has investi-gated using p = 0 as a distance function. This isunsurprising, as p = 0 causes significant problemsfor any data set that has categorical data or otherdata that may be exactly equal. Consider a data setthat has two features. If two points have equal val-ues for the first feature, then the distance betweenthe points will be zero regardless of the distance be-tween the other values of the other feature, due tothe multiplication by zero.Instead, consider distance probabilistically, in thesense that each feature “distance” is the probabilitythat the two values are different given the observa-tions or measurements of the feature values (insteadof being the absolute value of their difference) as away of handling uncertainty [Agarwal et al., 2016]. Ifwe assume that each observation is independent, thensimply multiplying the probabilities that each featurevalue is the same yields the same distance measureas Equation 2 when determining the conjunction ofthe probabilities. Solving the problems that comefrom exact matches using p = 0 means we can workdirectly with probabilities or deal with features onwildly different scales without having to standardizeor otherwise scale them. Using the geometric meanto combine measurements of achieving different goalshas been shown to be an effective objective functionfor multicriteria optimization [Harrington, 1965], andso using it to provide contrast between different sim-ilarities is a natural use. Suppose we have made two observations of a value, x and y respectively, and we would like to know the“distance” between them. The obvious distance of | x − y | yields the maximum likelihood value of thedistance, but does not yield the expected value of thedistance. Consider that there is considerable devia-tion among observations of the same value, meaningthat there is likely to be a relatively large expecteddifference between observations of the same value.We use the term deviation to encompass both theerror, which pertains the difference between actualand measurement, and the residual, which pertains Using p = 0 is in not a metric, and arguably not a distancefunction, as it fails the triangle inequality. Further, it is nottechnically p = 0 but rather lim p → , but we use this abuse ofnotation for simplicity. We note that this realization was inspired by early blogpost drafts of work done by Leinster and Meckes [2016] in thatthe generalized diversity index, which can be parameterizedto measure the Shannon entropy, is nothing more than thereciprocal of the generalized mean when substituting p − p when dealing with probabilities [Tuomisto, 2010], andthe Minkowski distance is just the generalized mean of thedifferences. to the difference between actual and estimated. Thisgeneric use of deviation applies regardless of whetherthe observation is measured, predicted, or inferred,and regardless of whether the deviation is due to ran-domness or lack of additional information that wouldreduce the deviation.Consider two observations, x and y , with consid-erable deviation. If x = 100 ±
10 and y = 100 ± x and y islikely to be greater than 0 even though the expectedvalues is the same, yet a simple subtraction yields 0.Further, consider that x and y are feature vectors oflength two of x = { . , } and y = { . , } . Ifwe have a third observation that is z = { . , . } ,using p = 0 for measuring the similarity between z and both x and y will yield x as infinitely closer than y because the difference between the first terms iszero and the multiplication makes the distance zero.Though this may sometimes be desirable, larger de-viations for the first feature and smaller deviationsfor the second feature should yield y as more similarto z than x .To solve the problem of zero expected distance foridentical measurements despite deviation, and to ad-dress the high sensitivity of L p with close or exactmatches with a low p value, we employ the Lukaszyk-Karmowski metric [ Lukaszyk, 2003, 2004]. Given aprobability density function of x , f , and a probabil-ity density function of y , g , the expected differencebetween them becomes d ( x, y ) = Z ∞−∞ Z ∞−∞ | x − y | f ( x ) g ( y ) dx dy. (4)We assume that if both points are near enough tobe worth determining the distance between them,then the distributions and parameters for the prob-ability density functions should represent the localdata. The two simple maximum entropy distribu-tions on ( −∞ , ∞ ) given a point and a distance aroundthe point are the Laplace distribution (double ex-ponential) and Gaussian distribution, depending onwhether the distance is represented as mean absoluteerror or standard deviation respectively. The Gaus-sian or normal distribution has a clean closed formsolution. Letting µ xy ≡ | x − y | , the expected dis-tance for two normal identical distributions becomes d NN ( x, y ) = µ xy + 2 σ √ π exp − µ xy σ ! − µ xy erfc (cid:16) µ xy σ (cid:17) . (5)For the previous example of x = 100 ±
10 and y =100 ±
10, the expected distance is approximately 11 . r , for predicting each feature, i ,as r i . We have found that using the residuals in thekNN system with the Lukaszyk-Karmowski metric,calculating new residuals, and then feeding these backin, generally yields convergence of the residual valueswith notable convergence after only 3 or 4 iterations.Measuring a distance value for each feature furtherenables parameterization regarding the type of data afeature holds. For example, nominal data can resultin a distance of 1 if the values are not equal and 0 ifthey are equal. Thus, one-hot encoding, the expan-sion of nominal values into multiple features, is notneeded. Ordinal data can use a distance of 1 betweeneach ordinal type. Cyclical data can perform appro-priate subtractions while keeping the data on a singledimension, which keeps the feature in one dimensionand directly understandable rather than having tosplit the feature into two using trigonometry. We now quantify the amount of information in a kNNmodel. Because our formulation of kNN uses a sim-ilarity measure based on distance, we first quantifyeach point x by the amount of distance it contributesto the k nearest points. In general, we define convic-tion as a normalized measure of how much surprisal one would expect for a given situation relative to thesurprisal observed. If we have some form of priordistribution of data given all of the information ob-served up to that point, the surprisal is the amountof information gained when we observe a new sample,event, case, or state change and update the prior dis-tribution to form a new posterior distribution afterthe event. The surprisal of an event of observing a random variable x ∼ X is defined as I ( x ) = − ln p ( x ).Thus, the conviction, π , can be expressed as π ( x ) = E [ I ( X )] I ( x ) . (6)This ratio results in conviction values π ∈ [0 , ∞ ),where • π = 0 means this point has an infinite amountof surprisal, that is, the point was previouslythought to be impossible to exist within thedataset; • π = 1 means this point has an average amountof surprisal, that is, it adds an average amountof information to the model; and • π = ∞ means this point is not at all surprising,that is, it is so redundant that the point couldbe discarded without affecting the model at all.This ratio is indicative of how much informationis required to encode one aspect of the model rel-ative to another, whether dealing with cases or fea-tures. In some cases conviction can act as a proxy forseveral matters, such as how confident we are aboutour data, whether the data is correct or anomalous,whether the data belongs together, or whether thedata is useful in making predictions. In other cases,conviction can inform how the model will be harmedif data is removed, and additionally can be used tocontrol the surprisal when performing data synthesis.Conviction can be computed in a targeted or tar-getless manner. In a targeted manner, each feature orcase is compared against another set of target casesor features one on one. In an untargeted manner,each case or feature is held out one by one and com-pared against the rest of the data in the model. Whenholding one out, the change in probability impact onother elements of the model indicates a measure ofhubness or centrality of the data which, when iso-lated, has been found to be of significant importancefor determining the influence of data on the model[Tomaˇsev and Mladeni´c, 2014].If the probability space over which conviction isnormalized is broadened (or if surprisal is used with-out normalization), then even the model impact ofcombinations of features can be compared to that ofcombinations of observations.In the following sub-sections, we discuss differ-ent forms of conviction that can be derived. Moreforms of surprisal and conviction can be conditionedand computed, opening up a rich area for differentkinds of informativeness about various aspects of themodel.We note that conviction is related to, but not ex-actly the same as, feature importance or case influ-Copyright 2018-2019 Diveplane Corporation. 4nce and so care must be taken when comparing thetwo. We define prediction conviction is the amount of sur-prisal required to predict a value given a model giventhe model’s uncertainty. To characterize the model’suncertainty, we use residuals.
Definition 1
Let ξ be the number of features in amodel and n the number of observations. We definethe residual function , r : X → R ξ , on the trainingdata X as r ( x ) = J Ω1 ( x ) , J Ω2 ( x ) , . . . , J Ω ξ ( x ) , (7) where J Ω i is the residual of the model on feature i atpoint x , parameterized by a set of hyperparameters Ω .We will refer to the residual function evaluated on allof the model data as r M . Typically, the feature residuals will be calculatedas mean absolute error or standard deviation. Fur-ther, subsets of features may be used to compute theresidual, particularly when performing targeted op-erations.
Definition 2
Given a point x ∈ X and the set K of its k nearest neighbors, a distance function d : R z × Z → R , and a distance exponent α , the distancecontribution of x is the harmonic mean φ ( x ) = | K | X k ∈ K d ( x, k ) α ! − . (8)The distance contribution reflects how much “dis-tance” a point contributes to a graph connecting thenearest neighbors, which is the inverse of the densityof points over a unit of distance in the Lebesgue space.The harmonic mean of the distance contribution re-flects the inverse of the inverse distance weightingoften employed with kNN, though other techniquesmay be substituted if inverse distance weighting isnot employed.We can quantify the information needed to expressa distance contribution φ ( x ) by transforming it intoa probability. We begin by selecting the exponentialdistribution to describe the distribution of residualsas it is the maximum entropy distribution constrained It is possible to compare the results of estimating valuesfor a feature, and then compute a conviction ratio comparingthe mutual information among a set of estimated values as aninformation theoretic representation of mean decrease in accu-racy or Shapley value. This would be an information theoreticresult that is much closer to feature importance. by the first moment. We represent this in typicalnomenclature for the exponential distribution usingthe norm from Equation 1 as1 λ = || r ( x ) || p . (9)We can directly compare the distance contributionand p-normed magnitude of the residual. This is be-cause the distance contribution and the norm of theresidual are both on the same scale, with the distancecontribution being the expected distance of new in-formation that the point adds to the model, and thenorm of the residual is the expected distance of devi-ation. Given the entropy maximizing assumption ofthe exponential distribution of the distances, we canthen determine the probability that a distance con-tribution is greater than or equal to the magnitude ofthe residual || r ( x ) || p in the form of cumulative resid-ual entropy [Rao et al., 2004] as P ( φ ( x ) ≥ || r ( x ) || p ) = e − || r ( x ) || p · φ ( x ) . (10)We then convert the probability to self-informationas I ( x ) = − ln P ( φ ( x ) ≥ || r ( x ) || p ) , (11)which simplifies to I ( x ) = φ ( x ) || r ( x ) || p . (12)As the distance contribution decreases, or as theresidual vector magnitude increases, less informationis needed to represent this point. We can then com-pare this to the expected value in regular convictionform, yielding a prediction conviction of π p = E II ( x ) , (13)where I is the self-information calculated for eachpoint in the model. Feature prediction contribution is motivated by meandecrease in accuracy (MDA) [Archer and Kimes,2008]. In MDA, scores are established for models with Other distributions may be selected by adjusting the as-sumptions slightly, such as the log-normal distribution. Thelog-normal distribution is the maximum entropy distributionassuming that we know the standard deviation rather thanthe mean, but this distribution assumes that something closeris less likely, and may be better suited for familiarity convic-tion. Further, the exact distribution of the distance contri-bution may be solved if the distributions of the features areknown.
Copyright 2018-2019 Diveplane Corporation. 5ll the features, M , and models with each feature heldout, M − f i , i = 1 . . . ξ . The difference | M − M f i | is theimportance of each feature, where the result’s sign isaltered depending on whether the goal is to maximizeor minimize score. Feature prediction contributiondiffers from MDA in that feature prediction contri-bution measures the conditional entropy of addinga feature. This means using prediction conviction onfeatures with significant information may yield highercontribution values even if the feature is independent.Prediction contribution information, π c , is corre-lated with accuracy and thus can be used as a surro-gate. The expected self-information required to ex-press a feature is given by E I ( M ) = 1 ξ ξ X i =0 I ( x i ) , and the expected self-information to express a featurewithout feature i is E I ( M − i ) = 1 ξ ξ X j =0 I − i ( x j ) . From these equations, we can more formally defineprediction contribution of a feature and predictionconviction of a feature.
Definition 3
The prediction contribution of a fea-ture , π c , of feature i is π c ( i ) = E I ( M ) − E I ( M − f i ) E I ( M ) . Definition 4
The prediction conviction of a feature , π p , of feature i is π p ( i ) = ξ P ξi =0 E I ( M − f i ) E I ( M − f i ) . Familiarity conviction is a metric for describing sur-prisal of points in a model relative to the trainingdata. This differs fundamentally from prediction con-viction. Consider a data set that has data points atregular intervals, such as a data point for each cornerin a grid. Given this grid data, prediction convictionwill indicate that a data point very close to an ex-isting data point will not be surprising and that itshould be easy to predict given the low level of un-certainty. However, familiarity conviction would in-dicate a higher surprisal for such a point even thoughit is easy to label because the point is unusual withregard to the even distribution of the rest of the data points. This new point does not form another cornerof the grid. The pair of prediction conviction and fa-miliarity prediction can be used together to find andremove data points that are easy to predict but un-usual with regard to uniqueness of data. These prop-erties make familiarity conviction valuable for sani-tizing data and reducing data as well as extractingpatterns and anomalies, as is discussed in other sec-tions.Familiarity conviction is based on the similaritymetrics as described in Section 3.1. As long as alow or zero value of p is used in L p space metrics forsimilarity, familiarity conviction is independent of thescale of the data and provided and does not overre-act to feature dominance based on feature scale andrange. Definition 5
Given a set of points X ⊂ R z for every x ∈ X and an integer ≤ k < | X | we define the distance contribution probability distribution , C of X to be the set C = (cid:26) φ ( x ) P ni =1 φ ( x i ) , φ ( x ) P ni =1 φ ( x i ) , . . . , φ ( x n ) P ni =1 φ ( x i ) (cid:27) (14) for a function φ : X → R that returns the distancecontribution. Note that because φ (0) = ∞ may be true undersome circumstances, multiple identical points mayneed special consideration, such as splitting the dis-tance contribution among those points. Remark 1
Clearly C is a valid probability distribu-tion. We will use this fact to compute the amount ofinformation in C . Definition 6
The point probability of a point x i , i = 1 , , . . . , n is l ( i ) = φ ( x i ) P i φ ( x i ) , (15) where we see the index i is assigned the probability ofthe indexed point’s distance contribution. Definition 7
We the set of random variables thatcharacterize the discrete distribution of point proba-bilities , L , is the set of L = { l (1) , l (2) , . . . , l ( n ) } . Remark 2
Because we have no additional knowledgeof the distribution of points other than they follow thedistribution of the data, we assume L is uniform asthe distance probabilities have no trend or correlation. Remark 3
A distance contribution is a discrete dis-tribution of point probabilities.
Copyright 2018-2019 Diveplane Corporation. 6 efinition 8
The familiarity conviction of a point x i ∈ X is π f ( x i ) = | X | P i KL ( L || L − { i } ∪ E l ( i )) KL ( L || L − { x i } ∪ E l ( i )) , (16) where KL is the Kullback-Leibler divergence. Sincewe assume L is uniform, we have that the expectedprobability E l ( i ) = n . Equation 16 can thus be used to compute familiar-ity conviction.
Having defined two methods (convictions) to mea-sure surprisal in the space of points X , we introducetechniques that naturally fall out of the information-based representation. The ability to add and re-move points to a model easily without retraining,coupled with familiarity and prediction conviction,enable numerous applications in model compression,online learning, anomaly detection, model to modelcomparison, reinforcement learning, synthetic datageneration, and likely more techniques we have notconsidered. We detail some of these in the subsec-tions following. Accuracy is best from models with more data, butusing kNN with additional data comes at a com-putational cost. Even for models where that rela-tionship does not hold, the memory needed to storethe data can be expensive. This important problemhas received some attention with some promising re-sults despite the difficulty of being an NP-hard prob-lem [Gottlieb et al., 2014, Kontorovich et al., 2017].Our probability and entropy based approach offerssome new ways of looking at this problem in an in-terpretable manner, relaxing some constraints withregard to strict metric spaces.The entropy measures we discuss can be employedfor pruning the model of cases as well as for targetlessfeature pruning, thereby reducing data size while re-taining information in the model. Overall, perform-ing feature or case pruning can benefit any systemby reducing the memory and possibly computationalresources needed to use the model. Further, thesetechniques can be used to help direct training, thatis, which parts of the model may benefit from hav-ing more data. We note that model reduction may be targetless, and in such cases is not a substitute for fea-ture engineering for a targeted application. Rather,it is useful for removing redundant data to focus onthe most or least surprising relationships.Our entropy-based techniques are generally appli-cable to either feature or case pruning based on theamount of information that each feature or case pro-vides to the model. A first step in data reduction canbe to detect and remove anomalies from the model.Anomalous cases, arguably, reduce the average use-fulness of decisions the model makes. Therefore, re-moving anomalies can actually improve the usefulnessof the model while slightly reducing the model size.A next step to reduce model size would be to use ourtechniques to remove those cases or features that donot provide significant-enough amounts of informa-tion to the model.There are a few approaches we use to determinewhich cases or features to prune. We can look atthe surprisal of the case relative to the rest of themodel. If the surprisal is low, the case may be re-moved from the model for being redundant, whichmay be done dynamically for online learning. Insome circumstances, we keep the instances with thehighest surprisal (i.e., the most informative) in or-der to cap the model size, or remove the case withthe lowest surprisal in order to reduce the model sizeby a set amount. We can also keep the cases witha surprisal above a certain threshold, and perhapsvary that threshold, balancing model size and infor-mation. This also applies to features in the model.Those features with high surprisal can be kept (or al-ternatively, those features with low surprisal can beremoved), thereby reducing the model size in the fea-ture space. Further, both case pruning and featurepruning could be performed, reducing the model inboth dimensions.Though we have outlined a few surprisal and con-viction measures in this paper for model reduction,they can be combined other ways to perform modelreduction, or used with other feature engineeringtechniques. For example, feature prediction convic-tion could be used to determine whether to includefeatures in a targetless model, based on high or lowsurprisal for predictions. Feature familiarity convic-tion could be used to determine which features havetraining data that most conforms to typical trainingdata reflected in the rest of the model.
In that kNN models are data, we can compare themusing prediction and familiarity conviction directly.The use of such a comparison is of particular inter-Copyright 2018-2019 Diveplane Corporation. 7st in online learning where one could detect a non-stationary process by comparing models built dur-ing different times. In reinforcement learning onecould compare models to establish propensity for ex-ploration verses exploitation.Further, by measuring the amount of surprisalthat one model contributes to another, some sparsemachine learning problems can be transformed intodense problems. Consider numerous individuals thatoccasionally take some action of a certain type. Amodel may be trained for each individual, and thencompared with surprisal against other individuals oragainst larger aggregated models to find which largermodel would be least surprising for the individualto be included in. We have found promising initialresults in commercial settings with these techniques.Combining these untargeted techniques with targetedmodel selection techniques, such as Bayesian informa-tion criterion [Schwarz et al., 1978] or Akaike infor-mation criterion [Akaike, 1973] is another potentialfuture direction.
Sparse data is an issue for many machine learning sys-tems, which fail to perform or perform poorly whentrained with sparse data, especially if the data issparse across multiple features. The sparse data cancome from sporadic unobservability (i.e., failure tocapture the feature at particular times) and unavail-ability (e.g., no collection of certain feature at certaintimes, a common issue with historical datasets). Re-gardless of the cause, the combination of targetlesskNN with entropy can be used to overcome sparsityby imputing missing features in sparse data, and do-ing so in a way that reflects the dataset.In past approaches, generically solving for miss-ing values in a data set has presented many chal-lenging problems, such as Boolean satisfiability andother NP-complete and NP-hard problems. Whenthe rules that determine the missing values are alsounknown, the problem becomes even more difficult.Semi-supervised learning has been a rich area of studyto attempt to tackle this problem [Triguero et al.,2015]. Our targetless approach, when combined withconviction, offers a novel technique for imputationthat can fill in missing data across all of the features We are referring to the limited semisupervised learningtechniques that are typically available to by data scientists.There are a number of statistics techniques available if suffi-cient information about a distribution is known, and domainspecific imputation techniques, particularly for language pro-cessing, such as that of Gemmeke and Cranen [2008]. and scaffold knowledge from cases with missing dataas it fills the missing data in.The algorithm finds the cases with missing datathat are least surprising, labels them, inserts the la-bel into the model for the case, and repeats the pro-cess until no more missing data exists. Grouping thecases into batches (i.e. imputing multiple missingdata points in a batch) can improve performance sothat conviction only needs to be computed occasion-ally.More specifically, the algorithm orders the caseswith missing data by their surprisal for each feature,conditioned only on the data known for the case, fromlowest to highest. For particularly sparse data, wehave had success with first ordering the cases by thosewith the fewest nulls prior to ordering by entropy. Starting with the case with the lowest entropy, wedetermine a value for the missing data using kNNbased on the known features. We can determinevalues for multiple features at once (in batch), or de-termine just one missing value, and then redeterminethe entropy for the cases, and return to impute moredata. This process continues until all of the data iscomplete. When we want to merely reduce the spar-sity of the data (vs. filling in all of the values), wecan continue until we hit a termination condition orsparsity threshold. Additional work is needed, but webelieve a viable termination condition is ceasing dataimputation when the entropy for the lowest-entropycase exceeds a threshold. We believe that, after thatpoint, the additional informational value of the im-puted data may be low, and, as such, we cease im-puting missing data.
The choice of a kNN model allows us number of inter-esting opportunities related to providing explanationsof the data. With kNN, the model is the data. Themodel is only complemented by kernel parameters todefine nearest neighbors, and with any weights, ad-justments or removal of data. In both the assessmentof the model for decision or action suggestion, thedata points that are within the kernel, or that areabove a certain threshold for how much they impact Currently, we use entropy for the cases to determine whatdata to impute first, but we are also exploring the use of en-tropy for features. We first order by feature entropy, and thenby entropy of the cases missing the lowest entropy feature. Wewould impute that data first, then recalculate entropies for themodel, similar to the manner described above. We can also use the techniques described in this paperfor synthetic data generation to determine the value for themissing feature as described in Section 3.3.6.
Copyright 2018-2019 Diveplane Corporation. 8he kernel values, are the data points that caused thedecision. This is compelling from a machine deci-sion auditability perspective because it means thatthe data relevant to each decision is directly identi-fied, and that auditing, editing or removing the train-ing data identified as being associated with a decisionwould have had a direct effect on the decision.Below, we discuss numerous types of explanationdata that can be derived from a kNN model. Foreach, we give a short description of how it is gen-erated, and what a system or operator can do withit. As an overview, for each type of explanation data(or based on combinations of explanation data), thatdata can be passed to a system or human operatorfor review. The system or human operator can usethe explanation data to decide whether to performthe action in question, perform a different action, orperform no action at all. Additionally, this data canbe used after-the-fact to audit decisions that weremade. Auditability begins, as alluded to above, withthe fact that we can determine what data was usedto suggested an action. The explanation data canalso be generated either at the time of suggesting theaction or at the time of the audit.One of the measures we use as part of our explana-tion data is conviction, which can be seen, broadly,as the ratio of the amount of surprisal of a particularcase to the actual surprisal. Numerous types of con-viction are discussed in this paper, such as familiarityconviction and prediction conviction. We also discussformulating each in a targeted or untargeted man-ner. Any of these formulations for conviction maybe used and provided as an explanation data alongwith answers or suggested actions. As an example,targeted or untargeted familiarity conviction may becalculated and provided to along with a suggested de-cision. An overly low (or high) conviction score canbe a cause for concern, and that suggested decisionmay be flagged for further reviewed or ignored com-pletely. In contrast, a suggested decision or actionwith a conviction score that is not concerning (e.g., amoderate conviction score, or, in some circumstances,a high conviction score), may be performed or actedupon without further review. A low conviction scoremay be associated with the data being outside theusual pattern of the data, and therefore be of con-cern for systems that are trying to perform “usual”actions. High conviction may be of concern whenthere is a desire to not perform actions that are too“usual.” A high and low pass filter can be used whenthere is a desire to perform actions that are neitheroverly usual or unusual, but instead are somewherein the middle in terms of how expected they are.We can also use feature prediction contribution or feature prediction contribution to find bias in deci-sion making. Many models and data sets containmany features, and are used for many decisions. Asdescribed elsewhere herein, our techniques can helpprune a model of cases and features. Nevertheless,there may always be features in models that, in anideal world, would contribute little to nothing to cer-tain decisions. For example, in the context of financ-ing decisions to individuals, there are certain fea-tures that, if decisions were made based on them,would constitute undesirable bias. These may includegender or race. The feature prediction contributioncan be provided to the system or human operator tohelp determine whether making the decision based onthat feature would constitute unpermitted bias. Thiscould lead to pruning that feature from the model ortaking other steps to reduce or eliminate its impacton the decision, such as the use of feature weights.The local region of the model (which we will re-fer to as the local model ) comes with no added costwith kNN and therefore we can perform analysis onthe local model and make other queries on relevantdata. As a few examples, we can easily find counter-factual cases [Wachter et al., 2018] and the boundaryconditions they yield by performing a query on thedata that maximizes the ratio of the action featuresto the total set of features. We can also find out ifany of the features in a case associated with the ac-tions suggested by the model are outside the rangeof the corresponding features of the cases in the lo-cal model. Features being outside the range may because for concern and may cause the suggested actionto not be performed.As a counter-point to the counterfactual cases, wecan also determine archetype cases, cases that havethe same action as the suggested action, and are fur-thest from other cases with different actions. Thedistance from the case with the suggested action tothe archetype case can be used as explanation data,audit data, and to determine whether to perform asuggested action without further review.We can find ratios of different types of convictionin the local model to those of the total model, whichcan indicate which cases and features add significantinformation to a particular region, and which addless information. If the information is not correlatedwith accuracy (e.g., has low conviction in the localmodel and high conviction in the model as a whole),then it may be a measure of noise. If the cases thatcontribute to the suggestion of an action are abovea noisiness threshold, then it may be inadvisable toperform a suggested action. On the other hand, ifthe noisiness of the cases used to suggest an actionis low, the system or operator may be confident inCopyright 2018-2019 Diveplane Corporation. 9erforming the action. There are many other ratiosof conviction that may be used as explanation andaudit data, and different of them will be useful indifferent scenarios.A “less similar” model can also be determined,where the closest k cases (by distance, by count, bydensity threshold) are excluded, and the distance tothe next closest cases is determined. That distancecan be used as explanation data, audit data, andto determine whether to perform a suggested actionwithout further review. For example, a higher dis-tance to the “less similar” cases can be an indicatorthat the suggested action is in a sparsely populatedpart of the model, and therefore should be reviewedbefore being acted upon.We can also define feature residuals based on thelocal model (or a regional model that is the N nearestneighbors, where N can be the same, higher, or lowerthan the k used to define the local model). Here, wecan use the mean absolute error, variance, or othermoments or measures to predict how well the modelpredicts each feature when it is removed. We can alsodetermine action probabilities, which, in the case ofcategorical actions, can be measured as the percent-age of cases in the local model that have that cate-gorical action. For continuous or ordinal actions, theaction probability can be a probability measure basedon the confidence interval of the suggested actions fora given tolerance (e.g., an action value for 250 maybe 67% (the action probability) likely to be within+/-5 (the tolerance) of 250). Feature residuals andconviction can be used in conjunction; a predictionwith high prediction conviction but also wide residu-als may not be reliable, but a prediction with low pre-diction conviction but wide residuals may potentiallybe improved if further training data is added similarto the prediction. We can also determine local or re-gional model complexity (i.e., whether the varianceis high, whether the accuracy is low, whether cor-relations among variables are low, etc.) and fractaldimensionality (i.e., placing a shape over the modeland shrinking the scale of the shape and counting thenumber of smaller shapes needed to cover the extentsof the model).
Given that prediction conviction is a method to ex-press how surprising an observation is, we can re-verse the math and use conviction to generate a newsample of data for a given amount of surprisal. Thegeneral approach is to randomly select or predict afeature of a case from the training data and then re- sample it based on the new condition. This approachis related to Gibbs sampling [Martino et al., 2015,Efros and Leung, 1999] in that it incrementally ob-tains new values for each feature conditioned on theprevious values, though the conditioning and sam-pling is based on our approach to kNN.Using the conditioned local residuals for a part ofthe model, as described in section 3.2.1, we can pa-rameterize the random number distribution to gener-ate a new value for a given feature. Our resamplingmethod is related to the approach used by the Mann-Whitney test [Mann and Whitney, 1947], a power-ful and widely used nonparametric test to determinewhether two sets of samples were drawn from thesame distribution. In the Mann-Whitney test, sam-ples are randomly checked against one another to seewhich is greater, and if both sets of samples weredrawn from the same distribution then the expecta-tion is that both sets of samples should have an equalchance of having a higher value when randomly cho-sen samples are compared against each other. Ourapproach for resampling a point is to randomly chosewhether the new sample is greater or less than theother point and then draw a sample from the distri-bution using the feature’s residual as the expectedvalue. Just as the exponential distribution is en-tropy maximizing given the sole constraint of a posi-tive mean, the double-sided exponential distribution(also known as the Laplace distribution) is the en-tropy maximizing distribution given a positive meandistance about a point. The log-normal and otherdistributions may be used as well, depending on thetypes of residuals computed and assumptions madeabout the local distributions.If a feature is not continuous but rather nominal,then the local residuals can populate a confusion ma-trix, and an appropriate sample can be drawn basedon the probabilities for drawing a new sample giventhe previous value.Suppose we would like to generate synthetic datawith features i ∈ Ξ. If there are no conditions placedon the new synthetic data, then we start with a ran-dom feature i. Because the observations within themodel are representative of the observations made sofar, a random instance is chosen from the observa-tions using the uniform distribution over all observa-tions. Then the value for feature i of this observa-tion is resampled via the methods mentioned above.The value for feature i then become a condition onsubsequently-generated features.Next, suppose that we would like to generate fea-ture j , given that features i ∈ Ξ have correspondingvalues x i . The model labels feature j conditionedby all x i to find some value t . This new value t be-Copyright 2018-2019 Diveplane Corporation. 10omes the expected value for the resampling processdescribed above, and the local residual (or confusionmatrix) becomes the appropriate parameter or pa-rameters for the expected deviation to find the valuefor x j .The process for filling in the features for an instancemay begin with no feature values subject to condi-tions, or some feature values may have been specifiedas conditions for the data to generate. Either way,the remaining features may be ordered (i.e., selectedfor determination of a new value) randomly or may beordered via a feature conviction value. When a newvalue is generated, then the process restarts with thenew value as an additional condition. Continuing with the double-sided exponential distri-bution as a maximum entropy distribution of distancein L p space, we can derive a closed form solution forhow to scale the exponential distributions based on aprediction conviction value.Starting with Equation 13, we specify a value, ν ,for the prediction conviction as ν = π p ( x ) = E II ( x ) (17)which can be rearranged as I ( x ) = E Iν . (18)Substituting in the self-information from Equa-tion 12, we have φ ( x ) || r ( x ) || p = E Iν . (19)Note that the units on both sides of Equation 19match. This is because of the natural logarithm andexponential in the derivation of Equation 19 cancelout, but leave the resultant in nats. We can rearrangein terms of distance contribution as φ ( x ) = || r ( x ) || p · E Iν . (20)To proceed further we need to make an assump-tion about the distribution of distance contributions φ . Seeking to minimize the complexity of our as-sumptions we simply observe that distances are sup-ported by the positive reals. Constraining the firstor first and second moments and maximizing the en-tropy gives us the exponential and log normal distri-butions respectively. For simplicity sake we proceed with e ( ζ ) but note that in practice ln N ( µ, σ ) is oftenobserved. One may distinguish among the distribu-tions using likelihood curvature tools such as FisherInformation.If we let p = 0, which is desirable for convictionand other aspects of the similarity measure, we canrewrite the distance contribution in terms of a normof the values observed for the number of features, ξ , each with an expected mean of ζ i . Taking theexpected value of both sides we find (cid:18) Π i ζ i (cid:19) ξ = (Π i r i ) ξ E Iν . (21)Due to the number of ways of assigning surprisalacross the features, many solutions may exist. How-ever, unless otherwise specified or conditioned, wewould want to distribute surprisal across the featuresholding expected proportionality constant. This al-lows us to write the distance contribution, which be-comes the mean absolute error for the exponentialdistribution, as 1 /ζ i = r i (cid:18) E Iν (cid:19) ξ . (22)and solving for the ζ i to parameterize the exponentialdistributions, we find ζ i = 1 r i (cid:16) ν E I (cid:17) ξ . (23)Equation 23, taken with the value of the feature, be-comes the distribution by which to generate a newrandom number under the maximum entropy as-sumption of exponentially distributed distance fromthe value. The ability to randomly generate data with a con-trolled amount of surprisal is a novel way to charac-terize the classic exploration versus exploitation tradeoff in searching for an optimal solution to a goal.Currently, pairing a means to search, such as MonteCarlo tree search [Abramson, 1987], with a universalfunction approximator, such as neural networks, isthe most successful approach to solving difficult rein-forcement learning problems without domain knowl-edge [Silver et al., 2017]. Because our data synthesistechnique comes from the universal function approx-imator model (kNN) itself, we can create a reinforce-ment learning architecture that is similar and tightlycoupled.Because the synthetic data generation can be con-ditioned, we can condition the search on both the cur-rent state of the system, as it is currently observed,Copyright 2018-2019 Diveplane Corporation. 11nd a set of goal values for features. As the systemis being trained, it can be continuously updated withthe new training data. Once states are evaluated fortheir ultimate outcome, a new set of features or fea-ture values can be updated or added to all of the ob-servations indicating the final scores or measures ofoutcomes. Keeping track of which observations be-long to which training sessions (or games) is a conve-nient way to track and update this data. Given thatthe final score or multiple goal metrics are already inthe kNN database, the synthetic data generation canquery for new data conditioned upon having a highscore or winning conditions, with a specified amountof conviction.This results in a reinforcement learning algorithmthat can be queried for the relevant training datafor every decision, as described in Section 3.3.4.The commonality among the similar cases, bound-ary cases, archetypes, etc. can be combined to findwhen certain decisions are likely to yield a positiveoutcome, negative outcome, or a larger amount ofsurprisal thus improving the quality of the model.By seeking high surprisal moves, the system will im-prove the breadth of its observations and learning,though it may not perform well. Setting the convic-tion of the data synthesis to 1 yields a balanced tradeoff between exploration versus exploitation. As moreinformation is learned, this conviction value may bereduced to focus on achieving goals.The interpretability of reinforcement learning mayhelp overcome many of the data-availability issues.For example, when data is needed for dangerous,expensive, or otherwise difficult-to-produce trainingdata, we can generate synthetic data conditioned invalue and conviction to match those difficult circum-stances. As such, our method can provide the sam-pling strategy necessary for reinforcement learningwith more control than with current techniques.
Although providing a rigorous review of our resultsand methods is out of the scope of this paper, wesummarize a few here to motivate and encourage ad-ditional exploration. We tested classic kNN as imple-mented in scikit-learn [Pedregosa et al., 2011], stan-dardizing the scale of the features as is common prac-tice in machine learning, against kNN with fractionalp-values and lim p → using uncertainty in distance asmentioned in Section 3.1.1 without standardization.We compared the results across a robust suite of 97regression datasets and 78 classification datasets se- lected from among the benchmark data published byOlson et al. [2017].On the classification datasets, scikit-learn’s ran-dom forest implementation averaged an accuracy of0.79. kNN already performs well on classificationproblems, averaging an accuracy of 0.76. However,using fractional p values, we saw the accuracy in-crease to 0.77, and allowing the p value of 0 basedon uncertainty we saw the average accuracy improveto 0.78. Though the accuracy improvements of ourtechniques are slight, the use of low or zero p valuesmeans that we can maintain the data directly with-out scaling and that we have made a step towardaccurate probability-based reasoning on data usingconjunctions as described in Section 3.1.1.On the regression datasets, scikit-learn’s randomforest implementation averaged an r-squared score of0.77. In many situations, such as those involving ex-trapolation, kNN regression does not perform as wellas other methods, and we saw this with the scikit-learn implementation resulting in an r-squared scoreof 0.53. However, our improvements yielded consider-able gain in kNN’s regression scores. Using fractionalp values yielded an r-squared score of 0.57, which isa significant (p ≪ .001) improvement based on theWilcoxon signed rank test. Further allowing the useof a zero p value with uncertainty in distance as men-tioned in Section 3.1.1, the r-squared score improvedto 0.66 which is also a significant result (p ≪ .001).We believe that kNN’s regression scores can also beimproved to be competitive with other cutting edgealgorithms. In Section 3.3.3, we showed how a tar-getless approach to data can fill in missing data, andfrom an auditability perspective it is easy to trackthe history of data imputation. Conversely, we areinvestigating exputation approaches can be employedto synthesize likely data points outside the bounds ofthe training data. Knowledge of the features, such astheir bounds, can help when reflecting or amplifyingor synthesizing exputed data points, which can thenbe used for interpolation.The reason for our belief that this is a core problemlies with the topology of data as the dimensionalitygrows. As the number of dimensions increases fora given set of data, many intuitive analytical tech-niques such as Euclidian norms and Gaussian kernelsbecome inappropriate as the unit radius hypervol-ume goes toward zero and the probability that datapoints falling in sharp corners of a hypervolume goestoward one [Verleysen and Fran¸cois, 2005]. This im-plies that nearly all data points will be at or beyondthe periphery, requiring extrapolation. Dealing withany kind of cost or value function to perform opti-mization will mean that nearly all points are ParetoCopyright 2018-2019 Diveplane Corporation. 12ptimal, meaning that it becomes increasingly moredifficult to define “good” because nearly every pointhas some unique quality. In many cases, the largernumber of dimensions can be helpful, but primarilywhen the structure is extracted and the dimensional-ity is reduced [Kittler, 1986, Kohavi and John, 1997,Stoppiglia and Dreyfus, 2003]. As nearly all new ob-servations will be on the periphery, we believe extrap-olation techniques, such as exputation, are likely toimprove results while maintaining interpretability.A dimensionality bottleneck is little different thanthe information bottlenecks used for generalizationand variance reduction across other areas of machinelearning [Tishby and Zaslavsky, 2015]. By employingclustering techniques in conjunction with model re-duction techniques as mentioned in Section 3.3.1, se-lecting prototypes or archetypes to represent a clusterin a hierarchical fashion, we may be able to character-ize the entropy flux between parts of the model, andhierarchical models and hierarchical explanations arenatural consequences. We note the striking common-ality of zero p value Lebesgue space as depicted in Ap-pendix A, the conjunction of probabilities of indepen-dent distributions, and the core of the no-flatteningtheorem by Lin et al. [2017] that relates hierarchicalarchitectures for neural networks and performance.Our future work will include further efforts to useprobability throughout all parts of kNN such thatany form of entropy or probability can be calculated,and assumptions can be clearly interpreted.Additional future work will be to characterize ourwork in the performance and scalability of targetlesskNN queries with fractional and zero p values, whichis outside the scope of this paper.Maximizing the interpretability of artificial intelli-gence leads to either understanding the generalizedrelationships of the data, such as symbolic or tree-based models, or to understand the data itself. Withthe improved performance of computing and the ad-vances in kNN, we conclude that using kNN providesa promising foundation for the future of interpretableartificial intelligence and machine learning. References
B. D. Abramson.
The Expected-outcome Model ofTwo-player Games . PhD thesis, Columbia Univer-sity, New York, NY, USA, 1987.P. K. Agarwal, B. Aronov, S. Har-Peled, J. M.Phillips, K. Yi, and W. Zhang. Nearest-neighborsearching under uncertainty ii.
ACM Transactionson Algorithms (TALG) , 13(1):3, 2016. C. C. Aggarwal, A. Hinneburg, and D. A. Keim. Onthe surprising behavior of distance metrics in highdimensional space. In
International conference ondatabase theory , pages 420–434. Springer, 2001.H. Akaike. Information theory and an extension ofthe maximum likelihood principle.
Proceedings ofthe 2nd International Symposium on InformationTheory , pages 267–281, 1973.E. Alpaydin. Voting over multiple condensed nearestneighbor.
Artificial Intelligence Review , pages 115–132, 1997.E. Alpaydin. Machine learning: The new AI. pages1013–1022, 2016.N. Altman. An introduction to kernel and nearest-neighbor nonparametric regression.
The AmericanStatistician , 46(3):175–185, 1992.K. J. Archer and R. V. Kimes. Empirical character-ization of random forest variable importance mea-sures.
Computational Statistics & Data Analysis ,52(4):2249–2260, 2008.K. Beyer, J. Goldstein, R. Ramakrishnan, andU. Shaft. When is “nearest neighbor” meaning-ful? In
International conference on database the-ory , pages 217–235. Springer, 1999.L. Breiman, J. Friedman, C. Stone, and R. Olshen.Classification and regression trees. 1984.J. R. Cano, F. Herrera, and M. Lozano. Evo-lutionary stratified training set selection for ex-tracting classification rules with tradeoff precision-interpretability.
Data and Knowledge Engineering ,60:90–108, 2006.D. Coomans and D. Massart. Alternative k-nearestneighbour rules in supervised pattern recognition :Part 1. k-nearest neighbour classification by usingalternative voting rules.
Analytica Chimica Acta ,136:15–27, 1982.A. A. Efros and T. K. Leung. Texture synthesisby non-parametric sampling. In iccv , page 1033.IEEE, 1999.J. F. Gemmeke and B. Cranen. Using sparse represen-tations for missing data imputation in noise robustspeech recognition. In
Signal Processing Confer-ence, 2008 16th European , pages 1–5. IEEE, 2008.I. Goodfellow, Y. Bengio, and A. Courville. Deeplearning. pages 1013–1022, 2016.Copyright 2018-2019 Diveplane Corporation. 13oogle LLC. What-if tool. https://pair-code.github.io/what-if-tool/ ,2018.L.-A. Gottlieb, A. Kontorovich, and P. Nisnevitch.Near-optimal sample compression for nearestneighbors. In
Advances in Neural Information Pro-cessing Systems , pages 370–378, 2014.E. C. Harrington. The desirability function.
Indus-trial quality control , 21(10):494–498, 1965.T. Hastie, R. Tibshirani, and J. Friedman. The ele-ments of statistical learning. page 63, 2001.A. Hinneburg, C. C. Aggarwal, and D. A. Keim.What is the nearest neighbor in high dimensionalspaces? In , pages 506–515, 2000.I. Hmeidi, B. B Hawashin, and E. El-Qawasmeh. Per-formance of knn and svm classifiers on full wordarabic articles.
Advanced Engineering Informatics ,22(1):106–111, 2008.M. E. Houle, H.-P. Kriegel, P. Kr¨oger, E. Schubert,and A. Zimek. Can shared-neighbor distances de-feat the curse of dimensionality? In
InternationalConference on Scientific and Statistical DatabaseManagement , pages 482–500. Springer, 2010.P. Indyk and R. Motwani. Approximate nearestneighbors: towards removing the curse of dimen-sionality.
In Proc. of the 30th ACM Sym.on Theoryof Computing , pages 604–613, 1998.J. Kittler. Feature selection and extraction. pages115–132, 1986.R. Kohavi and G. John. Wrappers for feature subsetselection.
Artificial Intelligence , 1-2:273–323, 1997.A. Kontorovich, S. Sabato, and R. Weiss. Nearest-neighbor sample compression: Efficiency, consis-tency, infinite dimensions. In
Advances in NeuralInformation Processing Systems , pages 1573–1583,2017.T. Leinster and M. W. Meckes. Maximizing diversityin biology and beyond.
Entropy , 18(3):88, 2016.H. W. Lin, M. Tegmark, and D. Rolnick. Why doesdeep and cheap learning work so well?
Journal ofStatistical Physics , 168(6):1223–1247, 2017.S. Lukaszyk.
Probability metric, examples of ap-proximation applications in experimental mechan-ics . PhD thesis, Cracow University of Technology,2003. S. Lukaszyk. A new concept of probability metricand its applications in approximation of scattereddata sets.
Computational Mechanics , 33(4):299–304, 2004.H. B. Mann and D. R. Whitney. On a test of whetherone of two random variables is stochastically largerthan the other.
The annals of mathematical statis-tics , pages 50–60, 1947.L. Martino, H. Yang, D. Luengo, J. Kanniainen, andJ. Corander. A fast universal self-tuned samplerwithin gibbs sampling.
Digital Signal Processing ,47:68–83, 2015.M. Mohri, R. Afshin, and T. Ameet. Foundations ofmachine learning. pages 1013–1022, 2012.R. S. Olson, W. La Cava, P. Orzechowski, R. J. Ur-banowicz, and J. H. Moore. Pmlb: a large bench-mark suite for machine learning evaluation andcomparison.
BioData mining , 10(1):36, 2017.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, andE. Duchesnay. Scikit-learn: Machine learning inPython.
Journal of Machine Learning Research ,12:2825–2830, 2011.N. Poerner, H. Sch¨utze, and B. Roth. Evaluating neu-ral network explanation methods using hybrid doc-uments and morphosyntactic agreement. In
Pro-ceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume 1:Long Papers) , volume 1, pages 340–350, 2018.J. Raikwal and K. Saxena. Performance evaluation ofsvm and k-nearest neighbor algorithm over medi-cal data set.
International Journal of ComputerApplications , 50(14):975–985, 2012.M. Rao, Y. Chen, B. C. Vemuri, and F. Wang. Cumu-lative residual entropy: a new measure of informa-tion.
IEEE transactions on Information Theory ,50(6):1220–1228, 2004.M. T. Ribeiro, S. Singh, and C. Guestrin. “Whyshould I trust you”: Explaining the predictions ofany classifier. In
Proceedings of the 22nd ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, San Francisco, CA,USA, August 13-17, 2016 , pages 1135–1144, 2016.Copyright 2018-2019 Diveplane Corporation. 14. Schuh, W. T., and R. Angryk. Mitigatingthe curse of dimensionality for exact knn re-trieval.
Proceedings of the Twenty-Seventh Interna-tional Florida Artificial Intelligence Research Soci-ety Conference , pages 363–368, 2014.M. A. Schuh, T. Wylie, , and R. A. Angryk. Im-proving the performance of high-dimensional knnretrieval through localized dataspace segmentationand hybrid indexing.
In Proc. of the 17th ADBISConf. , page 344357, 2013.G. Schwarz et al. Estimating the dimension of amodel.
The annals of statistics , 6(2):461–464, 1978.D. Silver, J. Schrittwieser, K. Simonyan,I. Antonoglou, A. Huang, A. Guez, T. Hu-bert, L. Baker, M. Lai, A. Bolton, et al. Masteringthe game of go without human knowledge.
Nature ,550(7676):354, 2017.D. M. Skapura. Building neural networks. page 63,1996.H. Stoppiglia and G. Dreyfus. Ranking a random fea-ture for variable and feature selection.
Journal ofMachine Learning Research, Special Issue on Vari-able/Feature Selection , 1-2, 2003.Q. Tan, G. Yu, C. Domeniconi, J. Wang, andZ. Zhang. Incomplete multi-view weak-label learn-ing. In
IJCAI , pages 2703–2709, 2018.Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Qualityand efficiency in high dimensional nearest neighborsearch.
In Proc. of the ACM SIGMOD Inter. Conf.on Mgmt. of Data , pages 563–576, 2009.N. Tishby and N. Zaslavsky. Deep learning and theinformation bottleneck principle. In
InformationTheory Workshop (ITW), 2015 IEEE , pages 1–5.IEEE, 2015.N. Tomaˇsev and D. Mladeni´c. Hubness-aware sharedneighbor distances for high-dimensional k-nearestneighbor classification.
Knowledge and informationsystems , 39(1):89–122, 2014.I. Triguero, S. Garc´ıa, and F. Herrera. Self-labeledtechniques for semi-supervised learning: taxonomy,software and empirical study.
Knowledge and In-formation systems , 42(2):245–284, 2015.H. Tuomisto. A consistent terminology for quantify-ing species diversity? yes, it does exist.
Oecologia ,164(4):853–860, 2010. V. V. B. Surya, H. Prasath, A. Arafat, O. Lasass-meh, and A. Hassanat. Distance and similaritymeasures effect on the performance of k-nearestneighbor classifier. arXiv:1708.04321 , 2017.M. Verleysen and D. Fran¸cois. The curse of dimen-sionality in data mining and time series predic-tion. In
International Work-Conference on Arti-ficial Neural Networks , pages 758–770. Springer,2005.S. Wachter, B. Mittelstadt, and C. Russell. Coun-terfactual explanations without opening the blackbox: Automated decisions and the gdpr.
HarvardJournal of Law and Technolog , 31(2), 2018.F. Wang and C. Rudin. Falling rule lists. In
ArtificialIntelligence and Statistics , pages 1013–1022, 2015.F. Zhao and Y. Guo. Semi-supervised multi-labellearning with incomplete labels. In
IJCAI , pages4062–4068, 2015.
A Geometric Mean Derivation
The geometric mean can be derived from the gener-alized mean aslim p → n X i =1 w i x pi ! /p = lim p → exp ln n X i =1 w i x pi ! /p = lim p → exp (cid:18) ln ( P ni =1 w i x pi ) p (cid:19) . Then using L’Hˆopital’s rule and the chain rule onthe inner part of this equation, we can simplify aslim p → ln ( P ni =1 w i x pi ) p = lim p → P ni =1 w i x pi ln x i P ni =1 w i x pi
1= lim p → P ni =1 w i x pi ln x i P ni =1 w i x pi = P ni =1 w i ln x i P ni =1 w i = ln ( Q ni =1 x w i i ) P ni =1 w i . Therefore substituting back in the previous resultCopyright 2018-2019 Diveplane Corporation. 15ieldslim p → n X i =1 w i x pi ! /p = lim p → exp (cid:18) ln ( Q ni =1 x w i i ) P ni =1 w i (cid:19) = (cid:16) e ( ln ( Q ni =1 x wii )) (cid:17) (cid:16) P ni =1 wi (cid:17) = n Y i =1 x w i i ! (cid:16) P ni =1 wi (cid:17) . Setting all w i = n yieldslim p → n X i =1 n x pi ! /p = n Y i =1 x i ! n ..