Asymmetric Impurity Functions, Class Weighting, and Optimal Splits for Binary Classification Trees
AASYMMETRIC IMPURITY FUNCTIONS, CLASS WEIGHTING, AND OPTIMALSPLITS FOR BINARY CLASSIFICATION TREES
DAVID ZIMMERMANN
Abstract.
We investigate how asymmetrizing an impurity function affects the choice of optimal node splitswhen growing a decision tree for binary classification. In particular, we relax the usual axioms of an impurityfunction and show how skewing an impurity function biases the optimal splits to isolate points of a particularclass when splitting a node. We give a rigorous definition of this notion, then give a necessary and sufficientcondition for such a bias to hold. We also show that the technique of class weighting is equivalent to applyinga specific transformation to the impurity function, and tie all these notions together for a class of impurityfunctions that includes the entropy and Gini impurity. We also briefly discuss cost-insensitive impurityfunctions and give a characterization of such functions. Introduction
In supervised learning, decision trees and their related methods are among the most popular tools forclassification. Their constructions are based on many criteria and parameters, among them a chosen functionto measure impurity of a node or dataset. This impurity function informs the optimal (greedy) split for agiven node when growing the tree. (There are splitting criteria that are not based on impurity functions,but we do not examine those here.) An impurity function satisfies certain axioms (which may slightly varyamong different authors and contexts), among them the property that the impurity function is symmetricin its entries. Intuitively, this condition says that an impurity function treats all classes equally during treeconstruction. For example, a dataset or tree node that consists of 80% Class 0 points and 20% Class 1 pointsis equally “impure” or of the same “quality” as a tree node that consists of 20% Class 0 points and 80%Class 1 points. However, in many applications this is not necessarily desirable. A couple of contexts forwhich this may be the case: • Imbalance in the number of occurrences of each class: If our dataset is highly imbalanced thendetection of the rare class may be difficult. In this case one might, for example, consider an 80-20mixture of points to be better or more informative than a 20-80 mixture of the same size, dependingon which class is the rare class. • Different costs for different misclassification types: The classic example of this is cancer detection,where the cost of a false negative is the death of a patient whereas the cost of a false positive (thoughoften high) is not nearly as catastrophic. In this case as well, the quality of an 80-20 mixture ofpoints might be considered different from the quality of a 20-80 mixture of the same size.Both of these situations arise frequently in practice, and the problem of dealing with them is well-studied[6, 8, 10]. A common strategy that is used to deal with the first situation is oversampling or undersampling:one artificially increases the number of samples of the rare class or decreases the number of samples ofthe common class in order to balance the prior class probabilities. There are many oversampling andundersampling techniques [2, 3, 7]; perhaps the simplest technique, which is the one we will consider in thispaper, is class weighting: one simply scales the weights of all points of a chosen class by some fixed factor. Astrategy that is used to deal with the second situation is to incorporate different misclassification costs intothe impurity function itself [1]. Along these lines, sensitivity of splitting criteria to different misclassificationcosts has been studied as well [4, 5]. Cost modification and class weighting are essentially just differentperspectives on the same idea; for example, misclassifying a point of doubled weight incurs the same penaltyas misclassifying an unweighted point with doubled misclassification cost. In this way, class weighting canbe thought of either as a simple over/undersampling technique or as a modification of misclassification costs.A different approach to dealing with imbalanced classes or misclassification costs is to choose an asym-metric impurity function to determine splits. Intuitively, the asymmetry in the impurity function shouldsomehow naturally create a bias toward or against a particular class. Work by Marcellin, Zighed, andRitschard [11, 12] considered the case of imbalanced classes and proposed a family of asymmetric impurityfunctions. They showed a change in the shapes of the precision-recall and ROC curves for several example a r X i v : . [ c s . L G ] A p r atasets when using these asymmetric impurity functions in place of a symmetric function, giving an im-provement in recall at lower-precision decision thresholds. The parametrized family h m : [0 , → R theyproposed is given by(1) h m ( p ) = p (1 − p )( − m + 1) p + m , m ∈ (0 , m is also the maximizer of h m .In this paper, we more closely investigate exactly how asymmetrizing an impurity function leads (at leastlocally) to favoring purity in one class over another when splitting a node. In particular, we relax the usualaxioms of an impurity function (Definition 1) then compare two arbitrary impurity functions f and g andinvestigate what causes f to more strongly prefer purity in one class than g does for a given split. We give arigorous definition of this notion (Definitions 4, 9), then state and prove a necessary and sufficient conditionon f and g for such a comparison to hold (Theorem 12). We also show that class weighting is equivalentto applying a specific transformation to the impurity function (Definitions 24, 25, Theorem 26), and tieall of these preceding ideas together for a class of impurity functions that includes the entropy and Giniimpurity (Definition 30, Theorem 33). We also give a characterization of cost-insensitive impurity functions(Definition 37, Theorem 38). Along the way, we consider the typical axioms imposed upon an impurityfunction and remark on each axiom’s utility and necessity.This paper is organized as follows: in Section 2 we state some preliminary terminology, conventions, andnotation. In Section 3 we give motivation for our main definition and describe how certain performancemetrics relate to a single split of a node. In Section 4 we give a modified definition of impurity function,then state and prove our main results about comparisons of impurity functions. In Section 5 we define atransformation on the set of impurity functions and show equivalence between this transformation and classweighting. We then relate this transformation back to Section 4 and briefly discuss cost-insensitive impurityfunctions. Finally, in Section 6 we close with a few remarks about the axioms of an impurity function astypically stated in the literature.2. Preliminary Terminology, Conventions, and Notation
Throughout this paper, we only concern ourselves with binary classification; all underlying distributionsof data are assumed to have two classes. We will refer to one of the classes as negatives or Class 0 , and tothe other as positives or Class 1 . We use the term positive prevalence of a tree node or dataset to refer to theweighted proportion of Class 1 points in said node or dataset. All trees are binary trees with each non-leafnode having two nonempty children. Given a node that splits into two children, we will refer to the childnode with lower positive prevalence as the left child , and the child node with higher positive prevalence the right child (if both nodes have the same positive prevalence, label them as left and right arbitrarily). Wewill always use the letters c, a, b (sometimes subscripted) to denote the positive prevalences of the parentnode, the left child, and the right child, respectively, and will refer to a and b as the left and right positiveprevalences . We will use the letters f and g to denote impurity functions. Finally, for a ≤ c ≤ b and afunction f we adopt the convention b − cb − a f ( a ) + c − ab − a f ( b ) (cid:12)(cid:12)(cid:12)(cid:12) ( a,b )=( c,c ) = f ( c ) . Performance Metrics for a Single Split
In this section we provide some motivation and intuition for what follows in Sections 4 and 5. Let usbegin with an example to illustrate the notion of “preference for purity in a given class” for one impurityfunction versus another.In Figure 1, the top plot shows a collection of Class 0 points (blue ‘x’s) and Class 1 points (red circles) inthe plane, all of unit weight, along with the optimal single split of this set (bold black line) with respect tothe Gini impurity g ( p ) = 2 p (1 − p ). In this plot we can see that the Gini impurity chooses a split that givesa left child that is quite pure (i.e., has a low positive prevalence) and a right child that is also reasonablypure (i.e., has a high positive prevalence).The second plot shows the same set of points, but now shows the optimal split with respect to theasymmetric impurity function f ( p ) = p − p . In this plot we can see that this particular asymmetricimpurity chooses a split with a right child that is much more pure than the right child produced by the Giniimpurity, but with the tradeoff of lower purity in the left child. Note also that the region corresponding tothe right child is smaller. In this example, the asymmetric impurity function f preferred purity for Class igure 1. Top: A set of class 0 points (blue ‘x’s) and Class 1 points (red circles) along withthe optimal split with respect to the Gini impurity (bold black line). Second from top: Sameset of points, but with optimal split with respect to the impurity function f ( p ) = p − p .Third from top: Same points, but with the Class 1 points’ weights halved. Weighted pointsare then split using Gini impurity. Bottom: Same points, but with the Class 1 points’weights scaled by a factor of 5. Weighted points are then split using Gini impurity.1 points more strongly than the Gini impurity did, whereas the Gini impurity preferred purity for Class 0points more strongly than f did.The third plot shows the same set of points, but now weighted so that all Class 1 points each have weightequal to 1/2 (which corresponds to undersampling Class 1 points). The split shown in this figure is theoptimal split of this weighted set with respect to the Gini impurity. The Gini impurity on this weighted setshows similar behavior to the asymmetric impurity function, preferring purity for Class 1 points. Note thatdecreasing the weight of the Class 1 points increased the purity of the right child. This makes intuitive sensefor the following reason: one can afford to “pollute” the left child with Class 1 points without ruining thepurity very much since the Class 1 points are light; on the other hand, polluting the right child with even afew Class 0 points can quickly ruin the purity since the Class 0 points are now relatively heavy.The bottom plot shows the same set of points, but now weighted so that all Class 1 points each haveweight equal to 5. The split shown in this figure is the optimal split of this weighted set with respect to theGini impurity. The optimal split of this weighted set has the purest left child of all, with the least pure rightchild. Note also that the region corresponding to this right child is larger than in the other plots.Now consider the following decision tree of depth 1 generated by a single split of a given dataset. Supposeour dataset has total weight equal to W and has positive prevalence c . Suppose our single split yields childrenwith positive prevalences a < b . Then the only nontrivial classifier we can make from this tree is to classifythe points in the left child as negatives and points in the right child as positives. Since the weights of the hildren are uniquely determined by their positive prevalences (see Proposition 3), we therefore have thefollowing confusion matrix for this classifier: PredictedPositive Negative A c t u a l Positive Wb − a ( c − a ) b Wb − a ( b − c ) a Negative Wb − a ( c − a )(1 − b ) Wb − a ( b − c )(1 − a )Now we have the usual pairs of metrics to describe performance: true positive rate and false positive rate,and precision and recall. Another pair of metrics that describes classifier performance is positive predictivevalue (PPV) and negative predictive value (NPV). The PPV is just a synonym for precision. The NPV is theanalogue of precision for negative points; i.e., the NPV is the number of true negatives divided by the totalnumber of predicted negatives. Now the true positive rate (recall) and false positive rate for the classifierabove do not have a particularly nice form, but the PPV (precision) and NPV do: P P V = b , N P V = 1 − a .In other words, a good split – which tries to maximize b and minimize a – tries to locally maximize PPVand NPV. In our example above, the asymmetric impurity function gave us a split with higher PPV thanthe split that the Gini impurity gave (on the unweighted set), with the tradeoff of lower NPV. Weightingthe Class 1 points instead by a factor 1/2 gave similar behavior. Equivalently, the Gini impurity on theunweighted set gave a split with higher NPV with the tradeoff of lower PPV. Weighting the Class 1 pointsby a factor of 5 gave an even higher NPV.PPV and NPV are “opposing” metrics in the sense that, loosely speaking, forcing an improvement inone metric typically leads to a worsening of the other metric, and vice versa. The same is true of precisionand recall. We will see in the next sections under what conditions an impurity function “tries harder” tomaximize PPV (precision) at the potential expense of NPV and recall, and vice versa.4. Comparison of Splitting Behavior for Different Impurity Functions
In much of the literature (e.g., the standard reference text [1] by Breiman et al.) an impurity function isdefined to be a function f : [0 , → R that satisfies three axioms:(1) f ( p ) is maximized only at p = 1 / f ( p ) is minimized only at the endpoints p = 0 , f is symmetric, i.e., f ( p ) = f (1 − p ).It is also not uncommon to require (or implicitly assume) that f satisfies other properties such as concavity(often strict concavity), differentiability, and the condition that f (0) = f (1) = 0. These variations inconvention are often minor, and most of the commonly used impurity functions in practice such as theentropy f ( p ) = − p log p − (1 − p ) log(1 − p ) and the Gini impurity f ( p ) = 2 p (1 − p ) satisfy all these propertiesanyway.However, in this paper we relax most of the above properties. Let us now state the definition of impurityfunction that we will use throughout this paper. Definition 1. A preimpurity function is a function f : [0 , → R that satisfies the following two properties:(1) f is continuous on [0 ,
1] and C on (0 , f (cid:48)(cid:48) < , f (0) = f (1) = 0, then we call f an impurity function . Remark 2.
A couple remarks are worth making here: Firstly, the smoothness condition above, whilestronger than what is typically imposed, will show to be a useful and convenient condition that facilitates thestatements and proofs of the results throughout this section and the next. We suspect that such smoothnessis not actually necessary for our results to hold anyway (see Remark 18). Concavity, on the other hand, isnot only necessary to prove our results, but is also necessary in general to ensure that an impurity functionbehaves well when splitting a node; we elaborate on this assertion in Section 6. Again, most commonly usedimpurity functions, e.g. entropy and Gini impurity, satisfy these conditions as well. (These conditions doexclude, for example, the misclassification rate f ( p ) = min( p, − p ) but that will not concern us.)Secondly, despite the fact that we do not really care about the value of our impurity functions at theendpoints, we will see (Corollary 20) that there is no loss of generality in fixing those values. We do want theflexibility of allowing for arbitrary values at the endpoints, however, and will therefore be using preimpurityfunctions when discussing optimal splits. ecall the following basic facts about impurity of a node: The total impurity (with respect to a preimpurityfunction f ) of a node n with positive prevalence c and total weight W is W · f ( c ). If n is split into twochildren with positive prevalences a and b with a ≤ b , then the combined total impurity of the children(which we will also refer to as the impurity of the split ) is W l · f ( a ) + W r · f ( b ), where W l , W r are the totalweights of the points in the left child and right child, respectively. Now W l + W r = W . If a = c = b , thenthe children’s combined total impurity simplifies to W · f ( c ) again. Otherwise, we have a < c < b . Now thetotal weight of the Class 1 points in n is W c . Then since we also have
W c = W l a + W r b (since total weightof Class 1 points in n is preserved) we can solve for W l , W r : W l = W · b − cb − a , W r = W · c − ab − a , so that the total impurity of this split is(2) W · (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) . The optimal split with respect to f is then the split whose left and right positive prevalences minimize (2).We summarize the above observations as a proposition: Proposition 3.
Let n be a node with positive prevalence c and total weight W . If n is split such that theleft and right positive prevalences are equal to a and b , respectively, then the weights W l , W r of the left andright child are given by W l = W · b − cb − a , W r = W · c − ab − a and the total impurity of the split with respect to the preimpurity function f is equal to W · (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) . We are now ready to start defining comparisons of preimpurity functions.
Definition 4.
Let f, g be preimpurity functions. We say f is equivalent to g if for every node n , and everyset of possible splits of n , the optimal split (or splits) with respect to f is the same as the optimal split withrespect to g . In other words (see Remarks 5 and 6 below), f is equivalent to g if for all c ∈ (0 ,
1) and allfinite subsets S ⊆ ([0 , c ) × ( c, ∪ { ( c, c ) } we have(3) arg min ( a,b ) ∈ S (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) = arg min ( a,b ) ∈ S (cid:18) b − cb − a g ( a ) + c − ab − a g ( b ) (cid:19) . Remark 5.
In Definition 4 above, it suffices to only consider sets S with two elements since the argmin of afunction on a finite set can be determined by pairwise comparing the values of the function over all possiblepairs of inputs. It is also clear, though perhaps worth re-emphasizing, that Definition 4 does not use theminimum values of the expressions in (3); only the minimizers matter since those are what determine thesplitting decision for a node. Hence we omit the total weight W of n in (3). Remark 6.
Observe that every pair of possible splits of a node with positive prevalence c yields two (pos-sibly nondistinct) elements ( a , b ) , ( a , b ) ∈ ([0 , c ) × ( c, ∪ { ( c, c ) } . Conversely, every pair of (possiblynondistinct) elements ( a , b ) , ( a , b ) ∈ ([0 , c ) × ( c, ∪ { ( c, c ) } is realizable as left and right positive preva-lences of two splits of some dataset with positive prevalence c (see Proposition 7 below). Hence Equation(3) above does indeed characterize splitting equivalence of preimpurity functions. Proposition 7.
Let c ∈ (0 , , and let ( a , b ) , ( a , b ) ∈ ([0 , c ) × ( c, ∪ { ( c, c ) } . Then there exists a dataset D with positive prevalence c such that: there exists two splits of D , one of which has left and right positiveprevalences a and b , and the other of which has left and right positive prevalences a and b .Proof. Take R as a feature space. If a < c < b and a < c < b , let R = b a ( c − a )( b − c )(1 − c ) , B = (1 − b )(1 − a )( c − a )( b − c ) c,R = a a ( b − c )( b − c )(1 − c ) , B = (1 − a )(1 − a )( b − c )( b − c ) c,R = a b ( b − c )( c − a )(1 − c ) , B = (1 − a )(1 − b )( b − c )( c − a ) c,R = b b ( c − a )( c − a )(1 − c ) , B = (1 − b )(1 − b )( c − a )( c − a ) c ; f a < c < b and a = c = b , let R = R = b ( c − a ) , B = B = (1 − b )( c − a ) ,R = R = a ( b − c ) , B = B = (1 − a )( b − c );and if a = a = c = b = b , let R = R = R = R = c, B = B = B = B = 1 − c. For i = 1 , , ,
4, place a point of Class 1 with weight R i and a point of Class 0 with weight B i in the i th quadrant. Take D to be the set of these points. A direct computation then shows that D has positiveprevalence c , that the left and right half-planes have positive prevalences a and b , respectively, and thatthe upper and lower half-planes have positive prevalences a and b , respectively. (cid:3) Lemma 8.
For every preimpurity function f and every A, B, C ∈ R with A > we have that f is equivalentto the preimpurity function ˜ f ( p ) = A f ( p ) + Bp + C .Proof. A direct computation shows that for every fixed c ∈ (0 ,
1) and every finite subset S ⊆ ([0 , c ) × ( c, ∪{ ( c, c ) } we havearg min ( a,b ) ∈ S (cid:18) b − cb − a ˜ f ( a ) + c − ab − a ˜ f ( b ) (cid:19) = arg min ( a,b ) ∈ S (cid:18) A (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) + Bc + C (cid:19) = arg min ( a,b ) ∈ S (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) . (cid:3) Definition 9.
Let f, g be preimpurity functions. We say f splits more positively purely (or more purely withrespect to Class 1 ) than g if for every node n , and every set of possible splits of n , there exists an optimalsplit with respect to f that produces a right child whose positive prevalence is greater than or equal to thepositive prevalence of every node produced by every optimal split of n with respect to g . In other words, f splits more positively purely than g if for all c ∈ (0 ,
1) and all finite subsets S ⊆ [0 , c ) × ( c,
1] we have(4) max (cid:40) arg min b :( a,b ) ∈ S (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19)(cid:41) ≥ max (cid:40) arg min b :( a,b ) ∈ S (cid:18) b − cb − a g ( a ) + c − ab − a g ( b ) (cid:19)(cid:41) . Similarly, we say g splits more negatively purely (or more purely with respect to Class 0 ) than f if for everynode n , and every set of possible splits of n , there exists an optimal split with respect to g that produces aleft child whose positive prevalence is less than or equal to the positive prevalence of every node producedby every optimal split of n with respect to f ; i.e., g splits more negatively purely than f if for all c ∈ (0 , S ⊆ [0 , c ) × ( c,
1] we have(5) min (cid:40) arg min a :( a,b ) ∈ S (cid:18) b − cb − a g ( a ) + c − ab − a g ( b ) (cid:19)(cid:41) ≤ min (cid:40) arg min a :( a,b ) ∈ S (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19)(cid:41) . Remark 10.
In (4), it again suffices to only consider sets S with two elements since any finite S canbe reduced to the subset that contains the two elements that attain the left and right-hand sides of (4).Furthermore, concavity of f and g imply that the pair ( c, c ) is a maximizer of the expressions in (4). Sincefor every other pair ( a, b ) we have b > c , the only way either side of the inequality (4) can equal c is if S = { ( c, c ) } , in which case (4) becomes trivial. (A similar argument holds for (5).) Hence, for convenience,we may exclude the pair ( c, c ) from S in Definition 9.In light of our discussion in Section 3, Definition 9 intuitively says that if f splits more positively purelythan g then for any given node the optimal split with respect to f has a higher PPV than the optimal splitwith respect to g . This definition also assumes the convention that in case of ties, each of f and g choosesits optimal split with the highest right-child positive prevalence, hence the usage of max in (4). Analogousremarks hold when g splits more negatively purely than f .At this point, let us give a few examples to illustrate Definition 9. Let f ( p ) = p − p , g ( p ) = 2 p (1 − p ),as we did with our example in Section 3. Then f splits more positively purely than g , and g splits morenegatively purely than f (a fact that will become clear when we reach Theorem 12). Suppose we have anode of total weight equal to 1 and positive prevalence equal to 40%, and suppose we have a choice of twopossible splits: Split 1, which splits the node into a left child with weight 0.4 and positive prevalence 10%,and a right child with weight 0.6 and positive prevalence 60%; and Split 2, which splits the node into a left igure 2. Top: Graph of impurity function f ( p ) = p − p , along with two possible splits.Bottom: Graph of impurity function g ( p ) = 2 p (1 − p ), along with the same two splits.child with weight 0.7 and positive prevalence 25%, and a right child with weight 0.3 and positive prevalence75%. We evaluate the impurities of Splits 1 and 2 with respect to f :Split 1: 0 . · f (0 .
10) + 0 . · f (0 .
60) = 0 . , Split 2: 0 . · f (0 .
25) + 0 . · f (0 .
75) = 0 . , so Split 2 is the optimal split with respect to f . Now we evaluate the impurities of Splits 1 and 2 withrespect to g :Split 1: 0 . · g (0 .
10) + 0 . · g (0 .
60) = 0 . , Split 2: 0 . · g (0 .
25) + 0 . · g (0 .
75) = 0 . , so Split 1 is optimal with respect to g . In this example we see f preferred the split that had the highly pureright child while g preferred the split with the highly pure left child.A second example, one that illustrates Definition 9 graphically, is given in Figure 2. Now for everyimpurity function f , every node n of positive prevalence c (and unit total weight), and every split of n withleft and right positive prevalences equal to a and b , the impurity of that split is equal to the y -value ofthe line segment between the points ( a, f ( a )) and ( b, f ( b )) at the point where p = c . In this example, let f ( p ) = p − p , g ( p ) = 2 p (1 − p ) as before. Suppose we have a node of total weight equal to 1 and a positiveprevalence of 45%. Suppose we have a choice of two splits: one split with left and right positive prevalencesof 0% and 70%; and the other split with left and right positive prevalences of 25% and 95%. The top plotshows the graph of f along with the line segments corresponding to our two splits. We can graphically seethat the line segment for Split 2 lies below the line segment for Split 1 when p = 0 .
45. So Split 2 has lowerimpurity, and is therefore optimal with respect to f . The bottom plot shows the graph of g along with theline segments corresponding to the same two splits. In this plot, we can see that the line segment for Split 1lies below the line segment for Split 2 when p = 0 .
45. So Split 1 has lower impurity, and is therefore optimalwith respect to g . As with our previous example, we see f preferred the split that had the highly pure rightchild while g preferred the split with the highly pure left child.A third example, one that illustrates Definition 9 on a dataset of points, is shown in the top two plots inFigure 1 in Section 3. Here, the set of possible splits is all splits whose boundary is a vertical line.Of course, for yet other examples, f and g might possibly choose the same split. Lemma 11.
Let f, g be preimpurity functions. Let ≤ a < a < b < b ≤ and suppose f ( a ) = g ( a ) = f ( b ) = g ( b ) = 0 . Suppose also that f (cid:48)(cid:48) /g (cid:48)(cid:48) is increasing on ( a , b ) . Then f ( a ) g ( a ) ≤ f ( b ) g ( b ) . Furthermore, if f (cid:48)(cid:48) /g (cid:48)(cid:48) is strictly increasing then the above conclusion is a strict inequality. roof. Observe that the hypotheses and conclusion are invariant under scaling of f and g by positive con-stants, and observe that strict concavity implies that f (cid:48) ( b ) and g (cid:48) ( b ) are both negative. So we may alsosuppose without loss of generality that f (cid:48) ( b ) = g (cid:48) ( b ). Let h = f (cid:48)(cid:48) /g (cid:48)(cid:48) so that f (cid:48)(cid:48) = hg (cid:48)(cid:48) , and let k = g − f .Then k ( a ) = k ( b ) = k (cid:48) ( b ) = 0, and k (cid:48)(cid:48) = (1 − h ) g (cid:48)(cid:48) .Claim: k ≥ a , b ].Proof of claim: By Rolle’s Theorem applied to k , there exists a c ∈ ( a , b ) such that k (cid:48) ( c ) = 0. Nowby Rolle’s Theorem applied to k (cid:48) , there exists a d ∈ ( c, b ) such that k (cid:48)(cid:48) ( d ) = 0. Since h is increasing and g (cid:48)(cid:48) <
0, we therefore have that k (cid:48)(cid:48) ≤ a , d ) and k (cid:48)(cid:48) ≥ d, b ). So k (cid:48) is decreasing on ( a , d ) andincreasing on ( d, b ). Since k (cid:48) ( c ) = 0, we have that k (cid:48) ≥ a , c ) and k (cid:48) ≤ c, d ); and since k (cid:48) ( b ) = 0,we have that k (cid:48) ≤ d, b ) and k (cid:48) ≥ b , b ). This implies that k is increasing on [ a , c ], decreasingon [ c, b ], and increasing on [ b , b ]. Finally, since k ( a ) = k ( b ) = 0, we therefore conclude that k ≥ a , b ], proving the claim.Now the above claim shows that both k ( a ) , k ( b ) ≥
0, so that g ( a ) ≥ f ( a ) and g ( b ) ≥ f ( b ). Strictconcavity of g implies that g ( a ) > g ( b ) <
0, so that f ( a ) g ( a ) ≤ f ( b ) g ( b ) ≥ , and the desired result follows.A straightforward modification of the above proof gives that our desired inequality is strict if f (cid:48)(cid:48) /g (cid:48)(cid:48) isstrictly increasing; details are omitted. (cid:3) We now present the main theorem of this section.
Theorem 12.
Let f, g be preimpurity functions. Then f splits more positively purely than g if and only if f (cid:48)(cid:48) /g (cid:48)(cid:48) is increasing on (0 , .Proof. ( ⇐ ) Suppose f (cid:48)(cid:48) /g (cid:48)(cid:48) is increasing. Fix c ∈ (0 , , c ) × ( c, | S | = 2.Let ( a , b ) , ( a , b ) be the two elements of S , and suppose without loss of generality that b < b (if b = b then we immediately have equality in Definition 9 and we are done). So a , a < c < b < b . Wetherefore want to show that if ( a , b ) is the better of the two splits with respect to g , then ( a , b ) is alsothe better of the two splits with respect to f . More precisely, we want to show that if(6) b − cb − a g ( a ) + c − a b − a g ( b ) ≤ b − cb − a g ( a ) + c − a b − a g ( b )then(7) b − cb − a f ( a ) + c − a b − a f ( b ) ≤ b − cb − a f ( a ) + c − a b − a f ( b ) . By Lemma 8, we may suppose without loss of generality that f ( a ) = f ( b ) = g ( a ) = g ( b ) = 0. The aboveimplication then reduces to(8) b − cb − a g ( a ) + c − a b − a g ( b ) ≤ ⇒ b − cb − a f ( a ) + c − a b − a f ( b ) ≤ . Strict concavity of f and g together with the fact that b < b implies f ( b ) < g ( b ) <
0. If a ≤ a ,then f ( a ) ≤ a > a , so that f ( a ) > g ( a ) >
0. Rearranging the inequalities in (8), we get that our desired condition is equivalent to(9) b g ( a ) − a g ( b ) g ( a ) − g ( b ) ≤ c ⇒ b f ( a ) − a f ( b ) f ( a ) − f ( b ) ≤ c. It is therefore sufficient to show(10) b f ( a ) − a f ( b ) f ( a ) − f ( b ) ≤ b g ( a ) − a g ( b ) g ( a ) − g ( b ) . Clearing denominators and simplifying shows that (10) is equivalent to f ( b ) g ( a ) ≤ g ( b ) f ( a ) , i.e., f ( a ) g ( a ) ≤ f ( b ) g ( b ) . But this follows from Lemma 11, and the desired conclusion follows. ⇒ ) Suppose that f (cid:48)(cid:48) /g (cid:48)(cid:48) is not increasing. Since f, g are C and have nonvanishing second derivatives, f (cid:48)(cid:48) /g (cid:48)(cid:48) is C . Hence there exists some interval ( a, b ) ⊆ [0 ,
1] such that f (cid:48)(cid:48) /g (cid:48)(cid:48) is strictly decreasing on ( a, b ),i.e., g (cid:48)(cid:48) /f (cid:48)(cid:48) is strictly increasing on ( a, b ). Choose a , a , b , b such that a ≤ a < a < b < b ≤ b . ByLemma 8, we may assume without loss of generality that f ( a ) = f ( b ) = g ( a ) = g ( b ) = 0, so that f ( a ) , g ( a ) > f ( b ) , g ( b ) <
0. Then by Lemma 11 (reversing the roles of f and g ) we have g ( a ) f ( a ) < g ( b ) f ( b ) . A bit of algebra shows that the above inequality is equivalent to(11) b g ( a ) − a g ( b ) g ( a ) − g ( b ) < b f ( a ) − a f ( b ) f ( a ) − f ( b ) . Choose a c such that(12) b g ( a ) − a g ( b ) g ( a ) − g ( b ) < c < b f ( a ) − a f ( b ) f ( a ) − f ( b ) . Now a ≤ a + ( b − a ) g ( a ) g ( a ) − g ( b ) = b g ( a ) − a g ( b ) g ( a ) − g ( b ) < c by (12). Also, writing b = b − b b − a · a + b − a b − a · b and using concavity of f , we get 0 = f ( b ) ≥ b − b b − a f ( a ) + b − a b − a f ( b )which simplifies to b f ( a ) − a f ( b ) f ( a ) − f ( b ) ≤ b so that c < b by (12). We therefore have a < a < c < b < b with b g ( a ) − a g ( b ) g ( a ) − g ( b ) < c and c < b f ( a ) − a f ( b ) f ( a ) − f ( b )which rearranges to(13) b − cb − a g ( a ) + c − a b − a g ( b ) < b − cb − a f ( a ) + c − a b − a f ( b ) > . Recalling that f ( a ) = f ( b ) = g ( a ) = g ( b ) = 0, we have that (13) becomes b − cb − a g ( a ) + c − a b − a g ( b ) < b − cb − a g ( a ) + c − a b − a g ( b )and b − cb − a f ( a ) + c − a b − a f ( b ) > b − cb − a f ( a ) + c − a b − a f ( b ) . Taking S = { ( a , b ) , ( a , b ) } , we therefore havearg min b :( a,b ) ∈ S (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) = b < b = arg min b :( a,b ) ∈ S (cid:18) b − cb − a g ( a ) + c − ab − a g ( b ) (cid:19) so that f does not split more positively purely than g . (cid:3) Theorem 12 has a corresponding analogue, stated below, for one preimpurity function splitting morenegatively purely than another; the proof is very similar and hence omitted.
Theorem 13.
Let f, g be preimpurity functions. Then g splits more negatively purely than f if and only if g (cid:48)(cid:48) /f (cid:48)(cid:48) is decreasing on (0 , . heorems 12 and 13 immediately establish the relationship between splitting more positively purely andsplitting more negatively purely: Corollary 14.
Let f, g be preimpurity functions. Then f splits more positively purely than g if and only if g splits more negatively purely than f . Remark 15.
In Definition 9, in (4) we broke ties by using max (i.e., by choosing the optimal split withhighest right-child positive prevalence). In fact, we just as well could have broken ties by using min, andTheorem 12 would still hold; the only modification necessary to the proof would be to replace all inequalitiesin (6),(7),(8), and (9) with strict inequalities. A similar remark of course holds for (5).Corollary 14 implies a special case of the following general fact, alluded to in Section 3 when discussingPPV versus NPV: an impurity function cannot produce an optimal split with both a higher right-childpositive prevalence and a lower left-child positive prevalence than an optimal split produced by anotherimpurity function (assuming, of course, that both impurity functions are optimizing over the same set ofsplits). In other words, to improve purity in one class, one must sacrifice purity in the other class. Proposition16 makes this precise.
Proposition 16.
Let f, g be preimpurity functions, and suppose that { ( a , b ) , ( a , b ) } is the set of possiblesplits of some node with positive prevalence c . Suppose further that ( a , b ) is optimal for g , and ( a , b ) isoptimal for f . If b > b , then a > a .Proof. Suppose for contradiction that a ≤ a . By Lemma 8, we may suppose without loss of generalitythat g ( a ) = g ( b ) = 0. Then since a ≤ a ≤ c ≤ b , we have g ( a ) ≥
0. Since b > b ≥ c , we must alsohave a < c so that a < c ≤ b < b , giving g ( b ) > g ( c ) >
0. Then b − cb − a g ( a ) + c − a b − a g ( b ) > b − cb − a g ( a ) + c − a b − a g ( b ) , so that ( a , b ) is not optimal with respect to g , a contradiction. (cid:3) Remark 17.
For any split of a node with unit weight we can use the Fundamental Theorem of Calculusand integration by parts to write the total reduction in impurity with respect to f as(14) f ( c ) − (cid:18) b − cb − a f ( a ) + c − ab − a f ( b ) (cid:19) = b − cb − a (cid:90) ca − f (cid:48)(cid:48) ( t )( t − a ) dt + c − ab − a (cid:90) bc − f (cid:48)(cid:48) ( t )( b − t ) dt. From this equation we make a few observations: Firstly, the reduction in impurity depends only on f (cid:48)(cid:48) andnot on the initial values of f or f (cid:48) . This is essentially a restatement of Lemma 8. Secondly, the right handside of (14) roughly tells us that if the mass of − f (cid:48)(cid:48) concentrates more to the right side of the unit intervalthan does the mass of some other function − g (cid:48)(cid:48) , then an increase in b gives a proportionally larger reductionin impurity with respect to f than with respect to g . This is a loose restatement of the backward implicationin Theorem 12. In general, one achieves a greater reduction in impurity with respect to f by capturing alarger proportion of the mass under − f (cid:48)(cid:48) between a and b , or by making a and b farther away from c . Remark 18.
We suspect Theorem 12 holds in more generality. In particular, suppose f and g are onlyassumed to be continuous and concave, but not necessarily differentiable or strictly concave. Then f (cid:48)(cid:48) and g (cid:48)(cid:48) exist in the distributional sense as non-positive measures [13]. We then conjecture that f splits morepositively purely than g if and only if f (cid:48)(cid:48) is absolutely continuous with respect to g (cid:48)(cid:48) and the Radon-Nikodymderivative of f (cid:48)(cid:48) with respect to g (cid:48)(cid:48) is increasing. Because the proof of this claim (if true) would likely bemore involved than the proofs of Lemma 11 and Theorem 12 without offering much additional insight intothe nature of Definition 9, we do not pursue it.Theorem 12 immediately gives us a few corollaries regarding equivalence of preimpurity and impurityfunctions. Corollary 19.
Let f, g be preimpurity functions. Then the following are equivalent:(1) f is equivalent to g .(2) f (cid:48)(cid:48) = Ag (cid:48)(cid:48) for some constant A > .(3) There exist constants A, B, C ∈ R with A > such that f ( x ) = Ag ( x ) + Bx + C. Proof. (1) ⇒ (2) Suppose f and g are equivalent. Then f splits more positively purely than g , and viceversa. So both f (cid:48)(cid:48) /g (cid:48)(cid:48) and g (cid:48)(cid:48) /f (cid:48)(cid:48) are increasing by Theorem 12. So f (cid:48)(cid:48) /g (cid:48)(cid:48) is constant and, by strict concavityof f and g , positive. So f (cid:48)(cid:48) = Ag (cid:48)(cid:48) for some positive A .(2) ⇒ (3) This follows from the Fundamental Theorem of Calculus.(3) ⇒ (1) This is Lemma 8. (cid:3) orollary 20. Let f be a preimpurity function. Then there exists a unique (up to positive constant scaling)impurity function ˜ f such that f is equivalent to ˜ f .Proof. Let ˜ f ( x ) = f ( x ) + ( f (0) − f (1)) x − f (0). Then ˜ f is an impurity function, and is equivalent to f byLemma 8.To establish uniqueness, suppose ˜ f and ˜ f are impurity functions equivalent to f . Then they are equiv-alent to each other. So by Corollary 19, ˜ f ( x ) = A ˜ f ( x ) + Bx + C for some A, B, C ∈ R , A >
0. Theboundary conditions ˜ f (0) = ˜ f (1) = ˜ f (0) = ˜ f (1) = 0 imply B = C = 0, so ˜ f = A ˜ f . (cid:3) Corollary 21.
Let f, g be impurity functions. Then f is equivalent to g if and only if f = Ag for someconstant A > .Proof. This follows from Corollary 20. (cid:3)
Recall the family h m of impurity functions in (1) given in the introduction. In light of Theorem 12, adirect computation shows that h m splits more positively purely than h m if and only if m ≥ m . (We willrevisit this family in more detail in the next section.) For this particular family, moving the “hump” (i.e.maximizer) of the function to the right is equivalent to making the function split more positively purely.The next corollary shows that for arbitrary impurity functions, this is partially the case. Corollary 22.
Let f, g be impurity functions, and suppose f splits more positively purely than g . Then themaximizer of f is greater than or equal to the maximizer of g .Proof. Let m f , m g ∈ (0 ,
1) be the maximizers of f and g , respectively (these maximizers are unique bystrict concavity). Scaling f, g by positive contants, we may assume without loss of generality that g ( m g ) = f ( m g ) = 1. Let k = g − f , so k ( m f ) ≤ k ( m g ) ≥
0. If k ( m g ) = 0 then g ( m g ) = f ( m g ) = 1 so m g isalso the maximizer of f and hence m f = m g and we are done. So suppose k ( m g ) > k > , m g ).Proof: Suppose for contradiction that k ( x ) ≤ x ∈ (0 , m g ). By Theorem 12, there exists anincreasing h such that f (cid:48)(cid:48) = hg (cid:48)(cid:48) , so k (cid:48)(cid:48) = (1 − h ) g (cid:48)(cid:48) . Now k ( m f ) ≤ k ( m g ) ≥
0, so by the IntermediateValue Theorem there exists some c between m f and m g such that k ( c ) = 0. In particular, c ∈ (0 , k has at least three zeroes (since also k (0) = k (1) = 0). Applying Rolle’s Theorem to k and k (cid:48) , we then getthat k (cid:48) has at least two zeroes, and k (cid:48)(cid:48) has at least one zero d . Since h is increasing and g (cid:48)(cid:48) <
0, we havethat h ( d ) = 1 and therefore(15) k (cid:48)(cid:48) ≤ , d ) and k (cid:48)(cid:48) ≥ d, . By the Intermediate Value Theorem there exists an x ∈ ( x , m g ) such that k ( x ) = k ( m g ) /
2. Then0 < x < x and k ( x ) ≤ < x − x x − k (0) + x − x − k ( x )so that k cannot be concave on (0 , x ). Hence, k (cid:48)(cid:48) takes on a positive value at some point in (0 , x ). Thereforeby (15) we have x ≥ d and hence k (cid:48)(cid:48) ≥ x , x ∈ ( x , m g ) such that k (cid:48) ( x ) = k ( m g ) − k ( x ) m g − x = k ( m g )2( m g − x ) ≥ . Therefore k (cid:48) ≥ k (cid:48) ( x ) ≥ x ,
1) and therefore k is increasing on [ m g , k ( m g ) ≤ k (1) = 0,giving a contradiction and therefore proving our claim.Finally, since k > , m g ) and k ( m f ) ≤
0, we must therefore have m f ≥ m g as desired. (cid:3) Remark 23.
The converse to Corollary 22 is false as can be seen by taking, for example, f ( p ) = p − p +4 p and g ( p ) = p − p .5. Equivalence of Class Weighting to Transformation of the Impurity Function
As mentioned in the introduction, a common way to bias a tree’s construction toward performance on aspecific class is by class weighting. As the previous section shows, another way to do this is to choose anasymmetric impurity function to determine optimal splits. In this section we will see that class weightinggives rise to the exact same optimal splits as the optimal splits one obtains by transforming the impurityfunction in a specific way. We will also see exactly how and when class weighting relates to the precedingsection. efinition 24. For w >
0, define φ w : [0 , → [0 ,
1] by φ w ( p ) := wp w − p . Suppose we have a node n with positive prevalence c and total weight W . Then n has Class 0 weight equalto W (1 − c ) and Class 1 weight equal to W c . If we transform n into ˜ n by scaling the weights of all Class 1points in n by a factor of w , then this transformed node ˜ n still has Class 0 weight equal to W (1 − c ) but nowhas Class 1 weight equal to W wc , giving ˜ n an overall weight of W (1 − c ) + W wc = W (1 + ( w − c ). Thepositive prevalence of ˜ n is therefore equal to W wc/W (1 + ( w − c ) = φ w ( c ). Now if the original unweightednode n has a split into children with positive prevalences a and b , then similar reasoning as above showsthat the children of the transformed node ˜ n under the same split will have positive prevalences equal to φ w ( a ) and φ w ( b ). If we use preimpurity function f to determine node impurity, then this split of ˜ n has totalimpurity equal to W (1 + ( w − c ) · (cid:18) φ w ( b ) − φ w ( c ) φ w ( b ) − φ w ( a ) · f ( φ w ( a )) + φ w ( c ) − φ w ( a ) φ w ( b ) − φ w ( a ) · f ( φ w ( b )) (cid:19) by Proposition 3. Therefore, given a node n with positive prevalence c , together with a collection S ofpossible splits and a weighting factor w , the optimal split of the weighted node ˜ n is given byarg min ( a,b ) ∈ S (cid:18) W (1 + ( w − c ) · (cid:18) φ w ( b ) − φ w ( c ) φ w ( b ) − φ w ( a ) · f ( φ w ( a )) + φ w ( c ) − φ w ( a ) φ w ( b ) − φ w ( a ) · f ( φ w ( b )) (cid:19)(cid:19) = arg min ( a,b ) ∈ S (cid:18) φ w ( b ) − φ w ( c ) φ w ( b ) − φ w ( a ) · f ( φ w ( a )) + φ w ( c ) − φ w ( a ) φ w ( b ) − φ w ( a ) · f ( φ w ( b )) (cid:19) . Definition 25.
Let w >
0. Define the transformation T w on the set of functions f on [0 ,
1] by( T w f )( p ) = (1 + ( w − p ) · ( f ◦ φ w )( p ) . The preceding definitions and discussion put us in a position to quickly prove the first main theorem ofthis section:
Theorem 26.
Let f be a preimpurity function and w > . Let n be a node, and let ˜ n be the node obtainedfrom n by scaling the weights of the Class 1 points by w . Then the optimal split of ˜ n with respect to f is thesame as the optimal split of n with respect to T w f . In other words: for every preimpurity function f , every w > , every c ∈ (0 , , and every S ⊆ ([0 , c ) × ( c, ∪ { ( c, c ) } we have arg min ( a,b ) ∈ S (cid:18) φ w ( b ) − φ w ( c ) φ w ( b ) − φ w ( a ) · f ( φ w ( a )) + φ w ( c ) − φ w ( a ) φ w ( b ) − φ w ( a ) · f ( φ w ( b )) (cid:19) = arg min ( a,b ) ∈ S (cid:18) b − cb − a T w f ( a ) + c − ab − a T w f ( b ) (cid:19) . Proof.
Fix f, w, c, S as above. Then a direct computation shows that for all ( a, b ) ∈ S we have(1 + ( w − c ) · (cid:18) φ w ( b ) − φ w ( c ) φ w ( b ) − φ w ( a ) f ( φ w ( a )) + φ w ( c ) − φ w ( a ) φ w ( b ) − φ w ( a ) f ( φ w ( b )) (cid:19) = (1 + ( w − c ) · (cid:18) b − cb − a · w − a w − c · f ( φ w ( a )) + c − ab − a · w − b w − c · f ( φ w ( b )) (cid:19) = b − cb − a · (1 + ( w − a ) · f ( φ w ( a )) + c − ab − a · (1 + ( w − b ) · f ( φ w ( b ))= b − cb − a T w f ( a ) + c − ab − a T w f ( b ) . (cid:3) Remark 27.
The proof of Theorem 26 shows that not only are the optimal splits with respect to T w f thesame as the optimal weighted splits with respect to f , but in fact by multiplying all of the above equationsby the total weight W of n we see that for every split the value of the impurity of the split with respect to T w f is equal to the value of the impurity of the weighted split with respect to f .We now list some properties of T w . Proposition 28.
Let f, g be preimpurity functions, and let w, w , w > . Then:(1) T w f is a preimpurity function. T w T w = T w w .(3) T = id and T − w = T /w .(4) f splits more positively purely than g if and only if T w f splits more positively purely than T w g .Proof. (1) Firstly, note that smoothness of f is preserved since T w f is a precomposition and product of f with smooth functions. Secondly, a direct computation shows(16) ( T w f ) (cid:48)(cid:48) ( p ) = w (1 + ( w − p ) ( f (cid:48)(cid:48) ◦ φ w )( p )which is negative for p ∈ (0 ,
1) since f (cid:48)(cid:48) <
0, so strict concavity is preserved. So T w f is a preimpurityfunction.(2),(3) These are direct computations and are left as an exercise to the reader.(4) ( ⇒ ) Suppose f splits more positively purely than g . Then f (cid:48)(cid:48) /g (cid:48)(cid:48) is increasing by Theorem 12.Equation (16) above then gives( T w f ) (cid:48)(cid:48) ( p )( T w g ) (cid:48)(cid:48) ( p ) = w (1+( w − p ) ( f (cid:48)(cid:48) ◦ φ w )( p ) w (1+( w − p ) ( g (cid:48)(cid:48) ◦ φ w )( p ) = (cid:18) f (cid:48)(cid:48) g (cid:48)(cid:48) ◦ φ w (cid:19) ( p )which is increasing since f (cid:48)(cid:48) /g (cid:48)(cid:48) and φ w are increasing. So T w f splits more positively purely than T w g byTheorem 12.( ⇐ ) Suppose T w f splits more positively purely than T w g . Apply the forward implication of Part (4) to T w f and T w g using T /w and Part (3). (cid:3) Remark 29.
As it turns out, the family h m of functions in (1) given in the introduction can be expressedin the form T w f (up to constant scaling) for some f . Specifically, h m = 12(1 − m ) T w g where w = ( m − and g is the Gini impurity. In other words, the tree produced by using the impurityfunction h m is the same as the tree produced by first weighting the Class 1 points by ( m − and thengrowing the tree using the Gini impurity.Not every asymmetric impurity function f is of the form T w g for some symmetric g . For example, let f ( p ) = p − p . If f were of the form T w g for some symmetric g , then we would have T /w f = g , so that T /w f is symmetric, implying ( T /w f ) (cid:48)(cid:48) is symmetric. But this is never the case for any w > T /w f ) (cid:48)(cid:48) (0) = 0and ( T /w f ) (cid:48)(cid:48) (1) < Definition 30.
Let f be a preimpurity function. We say f respects class weighting if for all w , w > w ≤ w ⇒ T w f splits more positively purely than T w f. The above condition can be rather messy to check as it potentially requires verifying that the inequality (cid:18) ( T w f ) (cid:48)(cid:48) ( T w f ) (cid:48)(cid:48) (cid:19) (cid:48) ( p ) ≥ p, w , w . The following lemma allows us to reducesome of the computational messiness by eliminating one of the w i . Lemma 31.
Let f be a preimpurity function. Then f respects class weighting if and only if for all w (17) w ≥ ⇒ f splits more positively purely than T w f. Proof. ( ⇒ ) Let w = 1 , w = w in Definition 30.( ⇐ ) Let 0 < w ≤ w . Letting w = w /w ≥ f splits more positively purely than T w /w f . Applying Proposition 28, Parts (2) and (4) using T w we get T w f splits more positively purelythan T w ( T w /w f ) = T w f , as desired. (cid:3) n fact, we can fully characterize all preimpurity functions that respect class weighting (though we willneed to impose an additional order of smoothness). This is the second main theorem of this section, andit ties together Sections 4 and 5. To facilitate the presentation of the proof, we first list several equationswhose proofs are direct computations and therefore omitted. Lemma 32.
For all w > and p ∈ (0 , we have w − p ) = ( w − ( w − φ w ( p )) w ,∂∂w φ w ( p ) = φ w ( p )(1 − φ w ( p )) w ,φ (cid:48) w ( p ) = ( w − ( w − φ w ( p )) w , and∂∂w φ (cid:48) w ( p ) = (1 − φ w ( p ))( w − ( w − φ w ( p )) w . Theorem 33.
Let f be a preimpurity function, and suppose f is C on (0 , . Let H = log( − f (cid:48)(cid:48) ) (cid:48) = f (cid:48)(cid:48)(cid:48) /f (cid:48)(cid:48) ,and define G on (0 , by G ( p ) := p ( p − H (cid:48) ( p ) + (2 p − H ( p ) + 3 . Then f respects class weighting if and only if G ≥ .Proof. First, observe that by Lemma 31 and Theorem 12 we have f respects class weighting ⇐⇒ for all w ≥ f splits more positively purely than T w f ⇐⇒ for all w ≥ f (cid:48)(cid:48) ( T w f ) (cid:48)(cid:48) is increasing on (0 , ⇐⇒ for all w ≥ (cid:18) f (cid:48)(cid:48) ( T w f ) (cid:48)(cid:48) (cid:19) is increasing on (0 , ⇐⇒ for all w ≥ (cid:18) f (cid:48)(cid:48) ( T w f ) (cid:48)(cid:48) (cid:19) (cid:48) ≥ , ⇐⇒ for all w ≥ p ∈ (0 ,
1) log (cid:18) f (cid:48)(cid:48) ( T w f ) (cid:48)(cid:48) (cid:19) (cid:48) ( p ) ≥ . Define the function F on [1 , ∞ ) × (0 ,
1) by F ( w, p ) := log (cid:18) f (cid:48)(cid:48) ( T w f ) (cid:48)(cid:48) (cid:19) (cid:48) ( p )= log (cid:18) (1 + ( w − p ) · f (cid:48)(cid:48) ( p ) w · ( f (cid:48)(cid:48) ◦ φ w )( p ) (cid:19) (cid:48) = f (cid:48)(cid:48)(cid:48) ( p ) f (cid:48)(cid:48) ( p ) + 3( w − w − p − ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · φ (cid:48) w ( p )( f (cid:48)(cid:48) ◦ φ w )( p )where we used (16) for the second equality. We therefore want to show F ≥ ⇐⇒ G ≥
0. Note that F is C by our hypothesis on f . We compute the partial derivative of F with respect to w and simplify using emma 32: ∂F∂w ( w, p ) = 3(1 + ( w − p ) − (cid:32) (cid:2) ∂∂w ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · φ (cid:48) w ( p ) + ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · ∂∂w φ (cid:48) w ( p ) (cid:3) · ( f (cid:48)(cid:48) ◦ φ w )( p )( f (cid:48)(cid:48) ◦ φ w )( p ) − ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · φ (cid:48) w ( p ) · ∂∂w ( f (cid:48)(cid:48) ◦ φ w )( p )( f (cid:48)(cid:48) ◦ φ w )( p ) (cid:33) = 3(1 + ( w − p ) − ( f (4) ◦ φ w )( p ) · ∂∂w φ w ( p ) · φ (cid:48) w ( p ) · ( f (cid:48)(cid:48) ◦ φ w )( p )( f (cid:48)(cid:48) ◦ φ w )( p ) − ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · ∂∂w φ (cid:48) w ( p )( f (cid:48)(cid:48) ◦ φ w )( p )+ ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · φ (cid:48) w ( p ) · ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · ∂∂w φ w ( p )( f (cid:48)(cid:48) ◦ φ w )( p ) = 3(1 + ( w − p ) − ∂∂w φ w ( p ) · φ (cid:48) w ( p ) · (cid:18) ( f (4) ◦ φ w )( p ) · ( f (cid:48)(cid:48) ◦ φ w )( p ) − ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p )( f (cid:48)(cid:48) ◦ φ w )( p ) (cid:19) − ( f (cid:48)(cid:48)(cid:48) ◦ φ w )( p ) · ∂∂w φ (cid:48) w ( p )( f (cid:48)(cid:48) ◦ φ w )( p )= 3(1 + ( w − p ) − ∂∂w φ w ( p ) · φ (cid:48) w ( p ) · (cid:18) f (cid:48)(cid:48)(cid:48) f (cid:48)(cid:48) (cid:19) (cid:48) ( φ w ( p )) − ∂∂w φ (cid:48) w ( p ) · (cid:18) f (cid:48)(cid:48)(cid:48) f (cid:48)(cid:48) (cid:19) ( φ w ( p ))= 3 ( w − ( w − φ w ( p )) w − φ w ( p )(1 − φ w ( p )) w · ( w − ( w − φ w ( p )) w · H (cid:48) ( φ w ( p )) − (1 − φ w ( p ))( w − ( w − φ w ( p )) w · H ( φ w ( p ))= ( w − ( w − φ w ( p )) w · (3 − φ w ( p )(1 − φ w ( p )) · H (cid:48) ( φ w ( p ) − (1 − φ w ( p )) · H ( φ w ( p )))= ( w − ( w − φ w ( p )) w · G ( φ w ( p )) . In particular, evaluating at w = 1 we get ∂F∂w (1 , p ) = G ( p ) . Note also that F (1 , p ) = 0 for all p ∈ (0 , ⇒ ) Now suppose F ≥
0. Then for every fixed p ∈ (0 ,
1) we havefor all w ≥ F ( w, p ) ≥ ⇒ for all w > F ( w, p ) − F (1 , p ) w − ≥ ⇒ lim w → + F ( w, p ) − F (1 , p ) w − ≥ ⇒ G ( p ) = ∂∂w (cid:12)(cid:12)(cid:12)(cid:12) w =1 F ( w, p ) ≥ . ⇐ ) Now suppose G ≥
0. Then for all p, w we apply the Fundamental Theorem of Calculus and integrateover w to get F ( w, p ) = F (1 , p ) + (cid:90) w ∂F∂w ( t, p ) dt = 0 + (cid:90) w ( t − ( t − φ t ( p )) t · G ( φ t ( p )) dt ≥ . (cid:3) Corollary 34.
Let f be either the entropy or the Gini impurity. Then f respects class weighting.Proof. For the cases of entropy and Gini impurity, we apply Theorem 33 and compute G ≡ G ≡ (cid:3) Remark 35.
Both of the cases of the entropy and Gini impurity respecting class weighting follow just aseasily without Theorem 33 using Lemma 31, Theorem 12, and (16). Nevertheless, despite the condition inTheorem 33 being somewhat messy, it is still an improvement over Lemma 31 in the sense that Theorem33 reduces verification of Definition 30 to verification of nonnegativity of a univariate function on the unitinterval.
Remark 36.
The impurity function f ( p ) = p − p that we have been using in examples throughout thispaper also respects class weighting, as do f ( p ) = p − p α for α > f ( p ) = p α − p for 0 < α <
1. In thesecases, we apply Theorem 33 and compute G ≡ α + 1.For an example of a preimpurity function that does not respect class weighting, consider the preimpurityfunction (in fact, symmetric impurity function) f ( p ) = 1 − p − ) − p − ) . Then using Theorem 33we check that G (1 / <
0. Alternatively, one can directly show that f fails to split more positively purelythan T f using Theorem 12.Another very noteworthy example of an impurity function that respects class weighting is f ( p ) = (cid:112) p (1 − p ),considered in [9] and shown there to satisfy certain error bounds. It was also shown in [4] to be cost-insensitive, i.e., insensitive to class weighting. For this particular f , we compute T w f = √ w · f , so that T w f is actually equivalent to T w f for all w , w . In other words, class weighting doesn’t change the optimalsplits at all when using this impurity function. This is indeed in agreement with [4].In fact, we can revisit the proof of Theorem 33 to also characterize all cost-insensitive impurity functions.First, let us define cost-insensitivity in terms of the framework we have built so far: Definition 37.
Let f be a preimpurity function. We say f is cost-insensitive if f is equivalent to T w f forall w > f is cost-insensitive if and only if for all w > f (cid:48)(cid:48) / ( T w f ) (cid:48)(cid:48) is constant.Revisiting the definition of F in the proof of Theorem 33, we see that this is equivalent to F ≡ G ≡
0. Inother words, f is cost-insensitive if and only if f satisfies the ODE p ( p − H (cid:48) ( p ) + (2 p − H ( p ) + 3 = 0where we recall H = f (cid:48)(cid:48)(cid:48) /f (cid:48)(cid:48) . Now the solution to the above ODE is H ( p ) = 3 p + C p (1 − p ) , p ∈ (0 , C is a constant. Since H = f (cid:48)(cid:48)(cid:48) /f (cid:48)(cid:48) = log( − f (cid:48)(cid:48) ) (cid:48) we integrate and exponentiate both sides of theabove equality to obtain f (cid:48)(cid:48) ( p ) = C exp( C log p − ( C + 3) log(1 − p ))= C · p C (1 − p ) − C − . Integrating twice more and absorbing and relabeling constants we get f ( p ) = C · p C +2 (1 − p ) − C − + C p + C . Requiring that our preimpurity function be continuous on the closed interval [0 ,
1] gives − < C < − f (0) = f (1) = 0 gives C = C = 0. Finally, letting C = 1 and α = C + 2 we get f ( p ) = p α (1 − p ) − α , < α < . e have just proved the third and final main theorem of this section: Theorem 38.
Let f be an impurity function, and suppose f is C on (0 , . Then f is cost-insensitive ifand only if f is a positive scalar multiple of one of the functions in the family { f α } given by f α ( p ) = p α (1 − p ) − α , α ∈ (0 , . Remark 39.
A direct computation shows that for the above family we have T w f α = w α · f α , which isconsistent with the backward implication in Theorem 38. Also, observe that for α, β ∈ (0 ,
1) we compute f (cid:48)(cid:48) α f (cid:48)(cid:48) β ( p ) = α (1 − α ) β (1 − β ) · (cid:18) p − (cid:19) β − α so that f α splits more positively purely than f β if and only if α ≥ β .6. Some Remarks on the Axioms of Impurity Functions
We conclude by summarizing some remarks made earlier in this paper on the axioms of an impurityfunction as typically given in the literature, stated at the top of Section 4. Recall those axioms:(1) f ( p ) is maximized only at p = 1 / f ( p ) is minimized only at the endpoints p = 0 , f is symmetric, i.e., f ( p ) = f (1 − p ).As Corollary 20 shows, Axiom 2 is not necessary for good splitting behavior although there is no loss ofgenerality in assuming Axiom 2. Furthermore, even under the assumption that f (0) = f (1) = 0, Axioms 1and 3 are still not necessary for good splitting behavior; indeed, Theorem 26 shows that asymmetric impurityfunctions are, in many cases, equivalent to symmetric impurity functions under class weighting.The one property we did emphasize in our definition of impurity function is concavity. Indeed, whileconcavity is not explicitly stated as one of the axioms of an impurity function above, strict concavity istypically additionally imposed upon (or implicitly satisfied by) the impurity functions under consideration.The reason for this is to ensure that total impurity is decreased by splitting a node [1]. For completeness,we present a full argument below.Consider the following example. Let f ( p ) = p (1 − p ) . Then f satisfies Axioms 1-3 but is not concave.Now place two points of Class 0 and one point of Class 1, each with unit weight, on the real line in the order‘010’. Then the impurity of this set is 3 f (1 /
3) = 16 / ≈ . { ‘01’,‘0’ } and { ‘0’,‘10’ } each have impurity equal to 1 f (0) + 2 f (1 /
2) = 1 / ≈ . increase in impurity,causing our node to become “stuck” and unable to split.A property that an impurity function ought to have is that making a split should never increase totalimpurity; or, using the entropy/information gain heuristic, one should never lose information by splitting anode. We state this precisely below: Definition 40.
We say a function f on [0 ,
1] is proper if for every node n and every split of n , the totalimpurity of that split with respect to f is less than or equal to the impurity of n with respect to f . In otherwords, f is proper if for all c ∈ (0 ,
1) and all ( a, b ) ∈ ([0 , c ) × ( c, ∪ { ( c, c ) } we have b − cb − a f ( a ) + c − ab − a f ( b ) ≤ f ( c ) . With this definition it is easy to see that the property of being proper is just a slight rephrasing ofconcavity, making the following proposition immediate:
Proposition 41. f is proper if and only if f is concave. One usually also desires that the impurity function should be nondegenerate in the sense that impurityshould strictly decrease (i.e., information gain should be positive) if the split is nontrivial, i.e., a < c < b .This is easily seen to be equivalent to strict concavity of f . References [1] Breiman, L.; Friedman, J.; Olshen, R.; Stone, C.:
Classification and Regression Trees.
Wadsworth, 1984.[2] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P.:
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research, (2002), 321–357.[3] Chawla, N. V.: Data mining for imbalanced datasets: An overview.
Data mining and knowledge discovery handbook.Springer, Boston, 2009. 875-886.[4] Drummond, C.; Holte, R.C.:
Exploiting the Cost (In)sensitivity of Decision Tree Splitting Criteria.
International Confer-ence on Machine Learning, (2000)
5] Elkan, C.:
The foundations of cost-sensitive learning.
International joint conference on artificial intelligence, (2001), 973–978[6] He, H.; Garcia, E. A.:
Learning from imbalanced data.
IEEE Transactions on Knowledge and Data Engineering, (2008),1263–1284.[7] He, H.; Bai, Y.; Garcia, E. A.; Li, S.: ADASYN: Adaptive synthetic sampling approach for imbalanced learning.
In IEEEInternational Joint Conference on Neural Networks (2008), 1322–1328[8] Japkowicz, N.; Stephen, S.:
The class imbalance problem: A systematic study.
Intelligent data analysis, (2002), no. 5,429–449.[9] Kearns, M.; Mansour, Y.: On the boosting ability of top-down decision tree learning algorithms.
In Proceedings of theAnnual ACM Symposium on the Theory of Computing. ACM Press (1996) 459–468[10] Lomax, S.; Vadera, S.:
A survey of cost-sensitive decision tree induction algorithms.
ACM Computing Surveys, 45 (2013), 1–35[11] Marcellin, S.; Zighed, D.A.; Ritschard, G.: An asymmetric entropy measure for decision trees.
Evaluating decision trees grown with asymmetric entropies.
In: Foundations ofIntelligent Systems. Springer (2008), 58–67[13] Schwartz, L.:
Th´eorie des Distributions.
Hermann, Paris, 1966
E-mail address : [email protected]@gmail.com