[PDF] Making \emph{ordinary least squares} linear classfiers more robust

Abstract

In the field of statistics and machine learning, the sums-of-squares, commonly referred to as \emph{ordinary least squares}, can be used as a convenient choice of cost function because of its many nice analytical properties, though not always the best choice. However, it has been long known that \emph{ordinary least squares} is not robust to outliers. Several attempts to resolve this problem led to the creation of alternative methods that, either did not fully resolved the \emph{outlier problem} or were computationally difficult. In this paper, we provide a very simple solution that can make \emph{ordinary least squares} less sensitive to outliers in data classification, by \emph{scaling the augmented input vector by its length}. We show some mathematical expositions of the \emph{outlier problem} using some approximations and geometrical techniques. We present numerical results to support the efficacy of our method.

Full PDF

aa r X i v : . [ phy s i c s . d a t a - a n ] A ug Making ordinary least squares linear classﬁers more robust

Babatunde M. Ayeni Department of Physics & Astronomy, Macquarie University, NSW 2109, Australia (Dated: August 29, 2018)In the ﬁeld of statistics and machine learning, the sums-of-squares, commonly referred to as ordinary least squares , can be used as a convenient choice of cost function because of its manynice analytical properties, though not always the best choice. However, it has been long knownthat ordinary least squares is not robust to outliers. Several attempts to resolve this problem ledto the creation of alternative methods that, either did not fully resolved the outlier problem orwere computationally diﬃcult. In this paper, we provide a very simple solution that can make ordinary least squares less sensitive to outliers in data classiﬁcation, by scaling the augmented inputvector by its length . We show some mathematical expositions of the outlier problem using someapproximations and geometrical techniques. We present numerical results to support the eﬃcacy ofour method.

I. INTRODUCTION

Machine learning and computational statistics are twodiﬀerent but closely related ﬁelds with usually overlap-ping methods. For example, regression analysis and dataclassiﬁcation are two problems that are common to bothﬁelds. On the one hand, regression analysis involves ﬁnd-ing a mathematical model from a given set of trainingdata such that one could make prediction of the outputof a diﬀerent input. On the other hand, data classiﬁ-cation, from a supervised learning perspective, involvesﬁnding a model that learns the assignation of data intodiﬀerent classes during training with the aid of its accom-panying label, so that it is able to classify any other givendata without a knowledge of its label. The two problemsare similar. The major diﬀerence is that the label in clas-siﬁcation is discrete while the “label” in regression (i.e.the dependent variable) is continuous.To obtain the model, the common practice is to deﬁnea cost function that minimizes the distance between thedata and the model. An often easy choice is sums-of-squares cost function, which is commonly referred to as least squares . This cost function is best applied to datawith a normal distribution, though it is not uncommonto see it employed automatically on data which may notbe normal. The choice is motivated because of the manyanalytical properties least squares method enjoys and theease of its implementation. However, it has been knownthat least squares is not robust against outliers, sincethe dawn of statistics as a ﬁeld (in regression analysis)and also later in machine learning (as in data classiﬁca-tion). Several notable attempts have been made on ﬁnd-ing a solution to this problem, see Ref. 1 (and referencestherein) for historical facts. This led to the creation ofother methods that are referred to as “least squares al-ternatives” to resolve the outlier problem , though theyare either computationally ineﬃcient or had some otherlimitations. Nonetheless, the outlier problem with leastsquares was not resolved, and for the sake of distinction,it is now tagged as ordinary least squares (oLS) to diﬀer-entiate it from the other “least squares alternatives.” There is no mathematically rigorous measure of whatdeserved to be called an outlier in a data. A commonidea is that a data point is called an outlier if it does notfollow the pattern of the remaining data points. Looselyspeaking, if some set of data points are unusually “faraway” from the expected region of majority data points,that set of “strayed” data points may be called outliers.Outliers can arise due to many reasons, including exper-imental errors either due to faulty equipments, impreciseset up, or environmental conditions; human error; or forg-ing of results. Except where outliers are due to a knowncause, they should be retained in a data, and rather usea statistics that is robust against outliers.In this paper, we show how a simple idea can help make least squares cost function, on the ground of data clas-siﬁcation, less sensitive to outliers. We do not attemptto provide a method of identifying outliers in a data, butrather our method uses both the normal data and outlierdata to determine the optimal decision boundary. Thepaper is structured as follows: In Sec. II we brieﬂy reviewthe binary classiﬁcation problem and recall the derivationof the weight vector. In Sec. III, we give some expositionabout the outlier problem using geometrical techniquesand approximations. In Sec. IV, we present our solutionto the outlier problem . In Sec. V, we present numericalresults to support the eﬃcacy of our solution. We usesynthetic dataset in two dimensions for binary classiﬁ-cation, and also one “real world” dataset: the MNISTdataset for handwriting recognition for multiple classiﬁ-cation problem. In Sec. VI, we end with some conclusionand state some future research direction. On a ﬁnal note,the reader who is only interested in the solution is advisedto jump directly to Sec. IV and Sec. V.

II. LINEAR CLASSIFIER USING LEASTSQUARES ERROR

A linear classiﬁer is a tool in machine learning that isoften used for quick data exploration because it is fast totrain and very easy to implement, albeit at the expenseof its accuracy. It becomes a competitive method if thedimensionality of input space is very high. One commonexample problem is document classiﬁcation. See Ref. 2for a review of linear classiﬁers.The aim of statistical classiﬁcation, in general, is toclassify a given data into one of many classes. More for-mally, given a data represented as ( x , t ), where x is thevector representation of the data and t is the correspond-ing label. The goal is to classify x into one of K numberof classes C k , where k = 1 , . . . , K , using its label t (in thecase where the assignation of data belongs to one andonly one class). In cases where data is linearly separa-ble, a linear classiﬁer is suﬃcient, otherwise one shouldrecourse to one of the nonlinear methods such as neuralnetworks strategies. A. Binary classiﬁcation

In this work, we exclusively use binary classiﬁcation(i.e. where K = 2 ) as a fruitful ground to show ourproposed solution to the outlier problem , without loss ofgeneralization to multiple classiﬁcation.The usual starting point is to construct a linear dis-criminant y ( x ) = w T x + w , (1)where w is the weight vector, w is the bias (i.e. negativeof the threshold), x is the input data vector. Hereafter, y ( x ) will be called ordinary linear discriminant (oLD) inorder to diﬀerentiate it from the idea of scaled linear dis-criminant (sLD) that we introduce later. It is convenientto write Eq. (1) as y ( x ′ ) = w ′ T x ′ , (2)where w ′ = ( w , w T ) T and x ′ = ( x , x T ) T , with x = 1(a “dummy” variable). The new w ′ and x ′ are commonlyreferred to as augmented weight vector and augmentedinput vector, respectively.Let the two classes of data be C and C . The inputvector x is assigned to class C if the discriminant y ( x ) > C if y ( x ) <

0, while if it is exactly on thedecision boundary, y ( x ) = 0.The aim is to learn the w ′ that does this classiﬁcationfrom the available labelled training data, S = { x n , t n } where n ∈ { , N } , N is the total number of training data, x n is the input data vector, and t n is the target binaryvariable whose value is either +1 or − x n belongs to class C or C . We employ leastsquares as the cost function, deﬁned as C ( w ′ ) = 12 N X n =1 ( y ( x ′ n ) − t n ) , (3)which minimizes the distance between the given targetvalue t n and the model’s prediction y ( x ′ n ). It is knownthat the minimization of this cost function is equivalentto the maximization of the loglikelihood of a Gaussian ¯ x (1) ¯ x (2) ¯ x (2+) ¯ x (1+) ∆ (2) ∆ (1) (0 , FIG. 1. Graphical illustration of data with two classes:“pluses” and “crosses,” represented here as ideal circularclouds of data. Let the big clouds represent the “normaldata” and the small clouds some possible outliers. If thereis no outlier, the number of points in the small clouds becomezero. The set of “plus” signs belong to class C (1) and the setof “cross” signs belong to class C (2) . The other variables areexplained in the main text. probability distribution with respect to the weight andbias. The least squares approach therefore implies thedata have an assumed Gaussian distribution, which maynot be so, as the distribution is not known a priori . Thisis one of the main reasons often cited to mitigate the useof least squares on data that are not normally distributed.To derive the expression for the augmented weight vec-tor w ′ , take the derivative of Eq. (3) with respect to w ′ and set it to zero to give w ′ = X n x ′ n x ′ T n ! − X n t n x ′ n ! . (4) III. EXPOSITION OF THE OUTLIERPROBLEM

In this section, we show approximately using geometri-cal techniques, how the outlier problem manifest in leastsquares linear classiﬁers using binary classiﬁcation. Forease of visualization, we depict the problem in two di-mensions, see Fig. 1, but this does not deduct from ourconclusion generally; The same conclusion holds true inhigher dimensional input space (as we never use any fea-ture peculiar to the dimensionality of the input space, ex-cept as an aid for visualization). In addition, we believethat our exposition and the proposed solution applies tomultiple classiﬁcation, as supported by our numerical re-sult on a multiple classiﬁcation problem.We start with some basic assumptions and notations.We assume there are two classes C (1) and C (2) . The datain each class is split into “normal” data and a possibleoutlier data. Let the number of “normal” data pointsin each classes be N (1) in C (1) and N (2) in C (2) , and thenumber of outliers as N (1+) in C (1) and N (2+) in C (2) .When considering the “normal” case, when there is nooutlier in any class, N (1+) = N (2+) = 0.The weight vector is determined from w ′ = X n x ′ n x ′ T n ! − X n t n x ′ n ! , (5)where x ′ n = (1 , x T n ) T is the augmented input vectorfor each input vector x n and t n is the target variable, t n = ±

1. We will drop the primes and write x n for theaugmented vector for the sake of typographical conve-nience.We know the class that each input vector belongs toduring training. As such, rather than indexing the inputvector with a single index n , we use a two-index nota-tion, ( k, u k ), where k indexes the class and u k indexesthe sample that belongs to that class. If there is an out-lier in class C ( k ) , we index it with m k . We split theinput data in each cloud into a reference vector—whichis taken as the mean vector, ¯ x —and a residual vector ε .As in Fig. 1, let ¯ x (1) and ¯ x (2) be the mean vectors of thebig clouds, ¯ x (1+) and ¯ x (2+) be the mean vectors of thesmall outlying clouds, and ∆ (1) , ∆ (2) be the displace-ment vectors of the small outlying clouds from the bigclouds. Therefore, the “normal” input data and outliercan be written respectively as x ( k ) u k = ¯ x ( k ) + ε ( k ) u k , (6) x ( k +) m k = ¯ x ( k ) + ∆ ( k ) + ε ( k +) m k , (7)where the vector x ( k +) m k can also be written as x ( k +) m k =¯ x ( k +) + ε ( k +) m k . For convenience, we split Eq. 5 into partsas w = IS , where I = X n x n x T n ! − (8)and S = X n t n x n ! . (9)We derive approximate expressions for these terms.Starting with S , S = X k X u k t ( k ) u k x ( k ) u k + X m k t ( k +) m k x ( k +) m k ! , (10)which is split into sum over the “normal” data and thepossible outlying data. But as the target variable t onlydepends on the class, t ( k ) u k = t ( k ) , for all u k and m k , S = X k t ( k ) X u k x ( k ) u k + X m k x ( k +) m k ! . (11) Substitute Eqs. 6 and 7 into the above equation to give S = X k t ( k ) "X u k (cid:16) ¯ x ( k ) + ε ( k ) u k (cid:17) + X m k (cid:16) ¯ x ( k ) + ∆ ( k ) + ε ( k +) m k (cid:17) . (12)For the assumed symmetric cloud (i.e. the ideal case),the sum over the relative vectors, P u k ε ( k ) u k = 0, in allcases. Therefore S = X k t ( k ) h N ( k ) ¯ x ( k ) + N ( k +) ¯ x ( k ) + N ( k +) ∆ ( k ) i (13)We now consider the I term. I = "X k X u k x ( k ) u k x ( k ) T u k + X m k x ( k +) m k x ( k +) T m k ! − . (14)Again, we substitute Eqs. 6 and 7 into the above equationto get I = (X k "X u k (cid:16) ¯ x ( k ) ¯ x ( k ) T + ε ( k ) u k ε ( k ) T u k (cid:17) + X m k (cid:16) ¯ x ( k +) ¯ x ( k +) T + ε ( k +) m k ε ( k +) T m k (cid:17) − , (15)which simpliﬁes to I = "X k (cid:16) N ( k ) ¯ x ( k ) ¯ x ( k ) T + N ( k +) ¯ x ( k +) ¯ x ( k +) T (cid:17) + X k X u k ε ( k ) u k ε ( k ) T u k + X m k ε ( k +) m k ε ( k +) T m k ! − . (16)We shall let M = X k (cid:16) N ( k ) ¯ x ( k ) ¯ x ( k ) T + N ( k +) ¯ x ( k +) ¯ x ( k +) T (cid:17) , (17)and E = X k X u k ε ( k ) u k ε ( k ) T u k + X m k ε ( k +) m k ε ( k +) T m k ! . (18)Therefore I = ( M + E ) − , (19)which can be expanded into the series I = M − h I − (cid:0) EM − (cid:1) + (cid:0) EM − (cid:1) − . . . i . (20)If the length of the relative vectors is small enough, theconvergence of this series can be guaranteed. In any case,in order to show the eﬀect of outlier, it is suﬃcient toconsider the leading term. Therefore I ≈ M − , (21)which involves only the mean vectors of the two classes.Written explicilty, I ≈ "X k (cid:16) N ( k ) ¯ x ( k ) ¯ x ( k ) T + N ( k +) ¯ x ( k +) ¯ x ( k +) T (cid:17) − . (22)The expressions for S and I can be written more ex-plicitly by using the values of t (1) = 1 and t (2) = − S = N (1) ¯ x (1) + N (1+) ¯ x (1+) − N (2) ¯ x (2) − N (2+) ¯ x (2+) , (23) I ≈ (cid:16) N (1) ¯ x (1) ¯ x (1) T + N (1+) ¯ x (1+) ¯ x (1+) T + N (2) ¯ x (2) ¯ x (2) T + N (2+) ¯ x (2+) ¯ x (2+) T (cid:17) − . (24)We deﬁne the total number of data as N = N (1) + N (2) + N (1+) + N (2+) , and density of each cloud of data as ρ (1) = N (1) N , ρ (2) = N (2) N , ρ (1+) = N (1+) N , ρ (2+) = N (2+) N , (25)such that ρ (1) + ρ (2) + ρ (1+) + ρ (2+) = 1. Therefore, S and I can be expressed in terms of density as S = N (cid:16) ρ (1) ¯ x (1) + ρ (1+) ¯ x (1+) − ρ (2) ¯ x (2) − ρ (2+) ¯ x (2+) (cid:17) , (26) I ≈ N − (cid:16) ρ (1) ¯ x (1) ¯ x (1) T + ρ (1+) ¯ x (1+) ¯ x (1+) T + ρ (2) ¯ x (2) ¯ x (2) T + ρ (2+) ¯ x (2+) ¯ x (2+) T (cid:17) − . (27)From these equations, we explore the following specialcases, namely:1. When there are no outliers, i.e. ρ (1+) = 0 and ρ (2+) = 0, and there is an equal number of datapoints in the “big” clouds, N (1) = N (2) , i.e. ρ (1) = ρ (2) = . The expressions derived from this casewill be used to “benchmark” the remaining cases.2. When there is still no outlier, ρ (1+) = 0 and ρ (2+) =0, but the density of data in the remaining cloudsdiﬀer. We choose, for instance, ρ (1) = ǫ and ρ (2) =1 − ǫ , where ǫ is a small number.3. When there is an outlier, e.g. in class C (2) . There-fore, ρ (1+) = 0. We consider ρ (1) = 1 / ρ (2+) = − ρ (2) . 4. The last case is when none of ρ (1) , ρ (2) , ρ (1+) , and ρ (2+) is zero. We do not regard this as an outlierproblem , as the two classes have large variances,and hence have “equal advantage” of “competing”for the decision boundary. If however one class hasa much larger variance than the other, then thiscase becomes an example of an outlier problem likeCase No. 3. The solution we propose for the outlierproblem also works for it. This case is not devel-oped further.

1. Case 1

In this case, ρ (1+) = 0, ρ (2+) = 0, ρ (1) = ρ (2) = . Theexpression for S and I becomes S = N (cid:18) ¯ x (1) − ¯ x (2) (cid:19) , (28) I ≈ N − (cid:18)

12 ¯ x (1) ¯ x (1) T + 12 ¯ x (2) ¯ x (2) T (cid:19) − . (29)Just as in a two-body problem in classical mechanics, wecan express both S and I in terms of two new vectors: R = (¯ x (1) + ¯ x (2) ) /

2, the centroid of the two classes, and r = (¯ x (1) − ¯ x (2) ) /

2, the corresponding relative vector ofthe two classes. Therefore, S = N r , (30) I ≈ N − (cid:0) R R T0 + r r T0 (cid:1) − . (31)From this, one can expect the decision boundary to passthrough the centroid as illustrated in Fig. 2. If we in-clude more terms in the series expansion of I , the de-cision boundary may vary to allow for more statisticalvariations from the input data.While it is not possible to know the value of the com-ponents of the weight vector w except we specify thecoordinate values of the input vectors, we will use theexpressions for S , I , R and r as “benchmarks” whenconsidering other cases.

2. Case 2

When ρ (1+) = ρ (2+) = 0, and ρ (1) = ǫ and ρ (2) = 1 − ǫ ,where ǫ is a small number. In this case, S and I becomes S = N (cid:16) ǫ ¯ x (1) − (1 − ǫ )¯ x (2) (cid:17) , (32) I ≈ N − (cid:16) ǫ ¯ x (1) ¯ x (1) T + (1 − ǫ )¯ x (2) ¯ x (2) T (cid:17) − . (33)We cast S and I into the same form as in Case 1 bydeﬁning the centroid R and relative vector r as R = 1 √ (cid:16) √ ǫ ¯ x (1) + p (1 − ǫ )¯ x (2) (cid:17) , (34) r = 1 √ (cid:16) √ ǫ ¯ x (1) − p (1 − ǫ )¯ x (2) (cid:17) . (35) ¯ x (2) (0 , b R ¯ x (1) r = ¯ x ( ) − ¯ x ( ) FIG. 2. When there is no outlier and the number of datapoints in both classes are equal, the decision boundary, i.e.the solid black line, is expected (up to leading order, inthe ideal case) to bisect the distance 2 r between the twoclasses. The decision boundary passes through the centroid R = (¯ x (1) + ¯ x (2) ) /

2. The distance between the two classes,using the class means, is 2 r = (¯ x (1) − ¯ x (2) ) Therefore, I can be expressed as I ≈ N − (cid:0) RR T + rr T (cid:1) − . (36)For S , we make some approximations. In Eq. 35, we seethat ¯ x (2) ≈ −√ r . Therefore, S ≈ N ( √ r ) . (37)The expressions for S and I in this case can be com-pared with those of Case 1. Here, it can be noticed that¯ x (2) contributes more in determining S and I , and hencethe decision boundary is biased towards the second class C (2) . In addition, we already know that the centroid R determine the point through which the decision bound-ary passes. As such, since R is closer to being ¯ x (2) , theboundary is closer to the data in class C (2) than class C (1) ,and will hence misclassiﬁes data from the lower plane (asit will “cut” across it). This is hardly insightful, as un-equal number of data in the two classes will bias theboundary towards the denser class and gives a poorer ac-curacy on learning, albeit this is not a problem of anyoutlier. Therefore, we will no longer consider unequaldensities, either in the presence of outlier or not.

3. Case 3

We now consider the case when there is an outlier,e.g. in class C (2) . [The same conclusion holds true ifwe instead choose the outlier to be in class C (1) ]. To thatend, we let ρ (1+) = 0, ρ (1) = 1 / ρ (2) = γ , ρ (2+) = − γ ,where 0 < γ ≤ /

2. Under these assumptions, S and I become S = N (cid:20)

12 ¯ x (1) − γ ¯ x (2) − (cid:18) − γ (cid:19) ¯ x (2+) (cid:21) , (38) I ≈ N − (cid:20)

12 ¯ x (1) ¯ x (1) T + γ ¯ x (2) ¯ x (2) T + (cid:18) − γ (cid:19) ¯ x (2+) ¯ x (2+) T (cid:21) − . (39)Even though the density of both classes are balanced,the density (1 / − γ ) of the outlier (at some value of γ )and its coordinate values ¯ x (2+) can still exert a “pullingweight” on the boundary, “pulling” it more towards thesecond class.For completeness, we again express S and I in termsof centroids and relative vectors, given as R = 12 (cid:18) √ x (1) + √ γ ¯ x (2) + p / − γ ¯ x (2+) (cid:19) , (40) r = 12 (cid:18) √ x (1) − √ γ ¯ x (2) − p / − γ ¯ x (2+) (cid:19) , (41)˜ R = 12 (cid:18) √ x (1) + √ γ ¯ x (2) − p / − γ ¯ x (2+) (cid:19) , (42)˜ r = 12 (cid:18) √ x (1) − √ γ ¯ x (2) + p / − γ ¯ x (2+) (cid:19) . (43)Hence, S = N (cid:20) R + r √ − √ γ ( R − ˜ r ) − p (1 / − γ ) ( R − ˜ R ) (cid:21) , (44) I ≈ N − (cid:16) RR T + rr T + ˜ R ˜ R T + ˜ r ˜ r T (cid:17) − , (45)where we recover Eqs. 30 and 31 for both S and I in thelimit γ → .Depending on the coordinates of the input samples,one of either R or ˜ R and r or ˜ r will be more importantthan the other, and will (approximately) determine thepoint where the decision boundary will pass through.In the particular illustration of the positive-positivequarter-plane of Fig. 1, R dominates over ˜ R , and hence,determines the point where the decision boundary passesthrough. It can be expected that at some value of γ < /

2, the centroid R will move closer to the out-lier and the decision boundary will therefore “cut across”some of the data in class C (2) . This is an instance of thefamed outlier problem . This leads to a lower accuracy inlearning algorithms that employ least squares .The denser or farther the outlier is from the “normal”data, controlled either through γ or the position vector¯ x (2+) , the more the boundary line is “pulled” towardsthe half plane of the outlier, and thereby gives a poorerresult on classiﬁcation. This eﬀect can be easily seen inany of the above equations. For example, in R , where theoutlying data enters the equation as p (1 / − γ )¯ x (2+) ,i.e. as a product of the square-root of its density (1 / − γ )and coordinate vector ¯ x (2+) . Therefore, increasing eitherthe density or the position of the outlier can accentuateits eﬀect. We will assume, as in a real scenario, thatthe density of the outlier is less than the density of the“normal” data—otherwise the outlier should rather beconsidered as the “normal” data. In that case, we arestill left with its position vector ¯ x (2+) , which, if “veryfar” from the “normal data” aﬀect the decision boundary.In the next section, we provide a simple solution to thisproblem, and present numerical proofs of its eﬃcacy. IV. MAKING

OUTLIERS

LESS SIGNIFICANT

Having reviewed how outlier poses to be a problemin least squares error function, we now provide a simplemeans by which it can be corrected: by applying somelength scale to the discriminant function .We start with the realization that in binary classiﬁca-tion, the classiﬁcation criteria is the sign of the discrim-inant function y ( x ) = w T x + w , (46)where y ( x ) < x is on the lower plane, y ( x ) = 0 if x is on the decision boundary, and y ( x ) > x is onthe upper plane. In multiple classiﬁcation using lineardiscriminant, the classiﬁcation criteria is the maximumvalue of the linear discriminants of the diﬀerent classes.What is common with both classiﬁcation problem is thatthe classiﬁcation criteria is “scale invariant.” Basically,this means that if we divide y ( x ) by some length scale S , the classiﬁcation criteria of the discriminant does notchange, though its value changes; In binary classiﬁcation,the sign of y ( x ) is unchanged by a length scale. Also, inmultiple classiﬁcation, all the entries of y ( x ) are scaleduniformly, so that the highest number remains so.For reasons that will be clear later, we work with theaugmented version of the above equation, y ( x ′ ) = w ′ T x ′ , (47)where x ′ is the augmented input vector, with length k x ′ k = p k x k . We deﬁne the scaled version of theabove equation as Y ( x ′ ) = y ( x ′ ) k x ′ k = w ′ T x ′ k x ′ k = w ′ T X ′ , (48)where X ′ = x ′ k x ′ k is the scaled augmented input vector.Under this scaled version, all the machineries of the leastsquares linear classiﬁers, namely, the cost function andthe expression for the weight vector, remain the same,where the input vectors are now scaled.We posit that using this scaled version “cures” the out-lier problem. To show this, we recall the expressionsEqs. 38 and 39 derived for Case 3 of Sec. III. Its scaled version is given as S = N (cid:20)

12 ¯ X (1) − γ ¯ X (2) − (cid:18) − γ (cid:19) ¯ X (2+) (cid:21) , (49) I ≈ N − (cid:20)

12 ¯ X (1) ¯ X (1) T + γ ¯ X (2) ¯ X (2) T + (cid:18) − γ (cid:19) ¯ X (2+) ¯ X (2+) T (cid:21) − , (50)where for typographical convenience we have alsodropped the primes on the augmented vectors.We simplify S and I . Starting with S , S = N (cid:20)

12 ¯ x (1) k ¯ x (1) k − γ ¯ x (2) k ¯ x (2) k − (cid:18) − γ (cid:19) ¯ x (2+) k ¯ x (2+) k (cid:21) . (51)We assume the length of the outlier, k ¯ x (2+) k , is higherthan non-outlier vectors, and hence factor it out. There-fore, S = N k ¯ x (2+) k (cid:20) k ¯ x (2+) kk ¯ x (1) k ¯ x (1) − γ k ¯ x (2+) kk ¯ x (2) k ¯ x (2) − (cid:18) − γ (cid:19) ¯ x (2+) (cid:21) . (52)Similarly, we simplify I , I ≈ N − "

12 ¯ x (1) ¯ x (1) T k ¯ x (1) k + γ ¯ x (2) ¯ x (2) T k ¯ x (2) k + (cid:18) − γ (cid:19) ¯ x (2+) ¯ x (2+) T k ¯ x (2+) k − , (53)which we write as I ≈ k ¯ x (2+) k N − (cid:20) k ¯ x (2+) kk ¯ x (1) k ¯ x (1) ¯ x (1) T + γ k ¯ x (2+) kk ¯ x (2) k ¯ x (2) ¯ x (2) T + (cid:18) − γ (cid:19) ¯ x (2+) ¯ x (2+) T k ¯ x (2+) k − . (54)The weight vector w is w ≈ IS = ˜ I ˜ S, (55)where˜ S = (cid:20) k ¯ x (2+) kk ¯ x (1) k ¯ x (1) − γ k ¯ x (2+) kk ¯ x (2) k ¯ x (2) − (cid:18) − γ (cid:19) ¯ x (2+) (cid:21) , (56)and ˜ I = (cid:20) k ¯ x (2+) kk ¯ x (1) k ¯ x (1) ¯ x (1) T + γ k ¯ x (2+) kk ¯ x (2) k ¯ x (2) ¯ x (2) T + (cid:18) − γ (cid:19) ¯ x (2+) ¯ x (2+) T k ¯ x (2+) k − . (57)From the above equations, we see that the vectors ofthe “normal” data in both ˜ S and ˜ I have been “scaledup” by the length of the outlier, thereby making themas equally important as, or even more important than,the outlier in determining the decision boundary. Thefarther the outlier the lesser its eﬀect on the boundary.We can compare these equations with the set, Eqs. 38 and39, when there is an outlier (and using the conventionalapproach) and with Eqs. 28 and 29, when there is nooutlier. Unlike other methods of handling outlier, it isnoteworthy that the outlier is still present in ˜ S and ˜ I (which is good, since it is originally part of the inputdata), but the decision boundary is now less sensitive toit. That is, we made the outliers less signiﬁcant.We have shown how to improve statistical classiﬁcationusing least squares in the presence of outlier, by scalingthe augmented input data vector by its length. Not onlyis this true approximately, but also in the exact case, asthe residuals would have also been scaled and will notsum to zero in a “real-world” data, unlike in the idealcase we considered.We present numerical proofs in Sec. V. In our numeri-cal implementation, we do not use the approximate equa-tions but the exact equation for determining the weightvector as in Eq. 4, only now scaling the input data vectoraccording to our prescription above. A. Potential pitfall of the proposed solution and itssolutions

In our prescribed solution to the outlier problem, weproposed that the length scale to be used should be thatof the augmented input vector x ′ without giving any jus-tiﬁcation. We give one now. It is tempting to want touse the length k x k of the original input vector x . How-ever, this has a serious potential pitfall. For deﬁnitenessand simplicity, we consider the input data to be in twodimensions. Assuming that we scale with k x k , then thescaled augmented input vector becomes X ′ = x ′ k x k , (58)where x ′ = (1 , x, y ) T , where 1 is the “dummy” number, x and y are the coordinates of the input data vector x .Therefore, the coordinates have the scaling transforma-tion x → x p x + y , (59) y → y p x + y , (60)which maps a point ( x, y ) on a Cartesian plane to a pointon a unit circle, as x + y = 1. This is dangerous if twodiﬀerent data-points in two diﬀerent classes are relatedby a scale factor. Under the above mapping, they map tothe same point on the unit circle, and hence become non-separable. Mathematically, let x and x be two distinct input data vectors that belong to two diﬀerent classesin the original input space, and related as x = λ x ,where λ is some scale factor. We have that (for the scaledversion of the original input vector) X = x k x k = λ x k λ x k = x k x k = X , (61)where k x k = λ k x k . Therefore, diﬀerent points on theplane now map to the same point on the unit circle. Al-though, in the (augmented) weight expression for linearclassiﬁer, we use the augmented vectors. The scaled aug-mented versions of X and X are X ′ =  √ x + y x √ x + y y √ x + y  X ′ =  λ √ x + y x √ x + y y √ x + y  . (62)With this, the usual assumed contiguity hypothesis inlinearly separable data may not be maintained as a datapoint may “jump” from a region of one class to a region of another class under this transformation, and mightnot be linearly separable anymore, though perhaps stillpossible to be separated using nonlinear classiﬁers.A solution that potentially resolves the above problemis to use the length of the augmented input vector x ′ asthe length scale, with the length as k x ′ k = p x + y .It is obvious that if x = λ x , x ′ = (1 , λ x T1 ) T , and k x ′ k 6 = λ k x ′ k , and hence the scaled input vector X and X , using the length of the augmented input vec-tor as the length scale, are not equal. Furthermore, thescaled augmented vectors are X ′ =  √ x + y x √ x + y y √ x + y  , X ′ =  √ x + y λx √ x + y λy √ x + y  , (63)which cannot be equated by any factor. Therefore, the contiguity hypothesis is maintained.However, the above solution also breaks down in thelimit | x | , | y | ≫

1. There are two possible solutions tothis:1. One can apply some uniform transformation to theinput space, e.g. scaling or translation, to changethe coordinate values to a range where addition of1 is signiﬁcant.2. An alternative solution would be to use a lengthscale such as p c + x + y , where c is arbitrary,and such a value such that p c + x + y p x + y .On the overall, our proposed “rule of thumb” is: usethe length scale p c + x + y , where c is chosen to be1 if the coordinate values of the input space are not toolarge, otherwise use a higher value of c . -4 -2 0 2 4 6 8 x -10-8-6-4-2024 x oLDsLD FIG. 3. Random synthetic data in two dimensions with out-lier. There are two classes: “red crosses” and “blue cir-cles.” The magenta line is the decision boundary obtainedusing ordinary linear discriminant (oLD), i.e. the conven-tional method, while the black line is the decision boundaryobtained using our proposed method of scaled linear discrim-inant (sLD).

The above treatment generalizes to higher dimensionalinput space. The scaling transformation will map the co-ordinates of D-dimensional data points in the (assumed)Cartesian coordinate system to the surface of a hyper-sphere in D dimensions.

V. NUMERICAL RESULTS

We now provide numerical proofs that our proposed so-lution works eﬃciently. We test it both on synthetic dataand “real-world” data, comparing the solutions to thoseobtained using ordinary linear discriminant. The real-world data is the MNIST dataset for handwriting imagerecognition. The synthetic data are randomly generatedin two dimensions, and made to look similar to Figure4.4, Pg. 186 of Ref. 3. The data contain two classes: “redcrosses” and “blue circles,” with and without outlier.First, we present results when the density populationof data of the two classes are equal. In Fig. 3, there are100 data points in both classes, i.e. 100 “red crosses,”70 “blue circles,” and 30 “blue circles” outlier. The ma-genta line is the solution using ordinary linear discrimi-nant (oLD), i.e. using the conventional method, which,as is already known and also shown here again, fails inthe presence of the outlier. The black line is our solutionusing scaled linear discriminant (sLD), which remarkablygets the right boundary. While it is already known thatoLD is sensitive to outlier, that is, not robust against out-lier, our method of sLD is less sensitive to outlier, thatis, robust against outlier, and moreover uses the same -4 -2 0 2 4 6 8 x -12-10-8-6-4-2024 x oLDsLD FIG. 4. Random synthetic data in two dimensions withoutoutlier. The magenta line is the decision boundary obtainedusing oLD, while the black line is the decision boundary ob-tained using sLD. It can be seen that the result obtained usingsLD is perhaps better than the result of oLD method. weight expression as the oLD but with scaled augmentedinput data vectors rather than ordinary augmented in-put data vectors. We then also test the method in the“normal” case when there are no outliers, with 100 datapoints in each class. The method is as good, or perhapsbetter, than the conventional method as shown in Fig. 4.Secondly, we then test our method for data densitypopulation that are not equal, namely, when there ismore data in one class than the other. In this case, weconsidered 100 “red crosses,” 100 “blue circles,” and 30extra “blue circles” outlier. In Fig. 5, we present solutionof the oLD and sLD, with the outlier still at the same po-sition as in Fig. 3, and in Fig. 6, we vary the position ofthe outlier, displacing it to a diﬀerent position. Whilethe oLD misclassiﬁes data from both classes, it can beseen that sLD is optimal, giving a much better decisionboundary.In both Figs. 5 and 6, we see that while the oLDmethod is very sensitive to outlier and does poorer withchange in the position of the outlier, the sLD method isvery robust against outliers and also give superior resultwhen the position of the outlier is varied.Lastly, we show that our method of scaled linear dis-criminant does not only prove to have advantage(s) over ordinary linear discriminant when using least squares onbinary classiﬁcation problems and/or synthetic data, butthat it is also very competitive for multiple classiﬁcationand on “real-world” data. To this end, we tested it onthe MNIST dataset for handwriting recognition. TheMNIST dataset consist of 60000 training data and 10000test data. Linear classiﬁers are generally poor on the im-age recognition problem, as patterns are nonlinear. Using least squares , the accuracy of learning using ordinary lin- -4 -2 0 2 4 6 8 x -10-8-6-4-2024 x oLDsLD FIG. 5. Unequal density data population with outlier. Readtext for more details. -4 -2 0 2 4 6 8 x -12-10-8-6-4-2024 x oLDsLD FIG. 6. Unequal density, with outlier displaced to a diﬀerentposition. Read text for more details. ear discriminant is 85 .

77% on the test data, while usingour method, we obtained 85 . statistical errorin accuracy of classiﬁcation using sLD relative to oLD is ∼ . ∼ O ( N D ) in addition to the cost of oLDmethod, where D is the dimensionality of the input spaceand N is the total number of samples. The formulafor determining the weight vector (or matrix in multi-ple classiﬁcation) is the same. When testing, the com-puted weight vector (or matrix in multiple classiﬁcation)is multiplied directly with the input vectors without anyfurther scaling applied to the input vectors. VI. CONCLUSION

In this work, we presented a simple, eﬀective way of im-proving the accuracy of linear classiﬁers that employ leastsquares in the presence of outliers by deﬁning a “scale-invariant” linear discriminant. We presented numericalresults that supported our proposition. The method alsoworks when there are no outliers, making the our methodmore versatile than the conventional approach.Our consideration in this paper has been on data clas-siﬁcation, whose labels take discrete values. The methodpresented here can be adapted to regression analysis,where “labels” (or the dependent variables) take contin-uous values. This can help provide solution to the outlierproblem of regression analysis when using least squares . Wikipedia Article, “Robust regression,” (Accessed on19.08.2018). G. Yuan, C. Ho, and C. Lin,Proceedings of the IEEE , 2584 (2012). C. Bishop,

Pattern Recognition and Machine Learning , In- formation Science and Statistics (Springer New York, 2006).4