[PDF] Optimization of distributions differences for classification

Abstract

In this paper we introduce a new classification algorithm called Optimization of Distributions Differences (ODD). The algorithm aims to find a transformation from the feature space to a new space where the instances in the same class are as close as possible to one another while the gravity centers of these classes are as far as possible from one another. This aim is formulated as a multiobjective optimization problem that is solved by a hybrid of an evolutionary strategy and the Quasi-Newton method. The choice of the transformation function is flexible and could be any continuous space function. We experiment with a linear and a non-linear transformation in this paper. We show that the algorithm can outperform 6 other state-of-the-art classification methods, namely naive Bayes, support vector machines, linear discriminant analysis, multi-layer perceptrons, decision trees, and k-nearest neighbors, in 12 standard classification datasets. Our results show that the method is less sensitive to the imbalanced number of instances comparing to these methods. We also show that ODD maintains its performance better than other classification methods in these datasets, hence, offers a better generalization ability.

Full PDF

11 Optimization of distributions differences forclassiﬁcation

Mohammad Reza Bonyadi, Quang M. Tieng, David C. Reutens,

Abstract —In this paper we introduce a new classiﬁcation algorithm called Optimization of Distributions Differences (ODD). Thealgorithm aims to ﬁnd a transformation from the feature space to a new space where the instances in the same class are as close aspossible to one another while the gravity centers of these classes are as far as possible from one another. This aim is formulated as amultiobjective optimization problem that is solved by a hybrid of an evolutionary strategy and the Quasi-Newton method. The choice ofthe transformation function is ﬂexible and could be any continuous space function. We experiment with a linear and a non-lineartransformation in this paper. We show that the algorithm can outperform 6 other state-of-the-art classiﬁcation methods, namely naiveBayes, support vector machines, linear discriminant analysis, multi-layer perceptrons, decision trees, and k-nearest neighbors, in 12standard classiﬁcation datasets. Our results show that the method is less sensitive to the imbalanced number of instances comparingto these methods. We also show that ODD maintains its performance better than other classiﬁcation methods in these datasets, hence,offers a better generalization ability. (cid:70)

NTRODUCTION T HE ultimate goal of a supervised classiﬁcation method isto identify to which class a given instance belongs basedon a given set of correctly labeled instances. A classiﬁer inthis paper is deﬁned [1] as follows: Deﬁnition 1. (Discriminative classiﬁer) Let

S, S , S , ..., S c besets of instances, | S i | = m i , S i ∩ S j = ∅ for all i, j ∈ { , ..., c } , i (cid:54) = j , ∪ ci =1 S i = S . A classiﬁer ψ κ ( x ) , ψ κ : S → R , aims toguarantee ∀ i ∈ { , ..., c } , ∀ x ∈ S i , P ( ψ κ ( x ) = i | x ) = 1 . where κ is a set of conﬁgurations for the procedure ψ κ ( x ) , and P is the probability measure. In this paper, we consider a special case of classiﬁcationproblems where all members of S are in R n (so called featurespace). We also assume that feasible values for x i (calleda variable throughout this paper), the i th element of theinstance (cid:126)x , are ordered by the operator ” ≤ ” (i.e., x i is notcategorical).In reality, only a subset of S is given (the training set)for which the classes are known. It is then critical to ﬁndthe best κ in a way that ψ κ ( (cid:126)x ) is the true class of any (cid:126)x in the training set and, ideally, all possible instances in S .However, the distribution of instances in each class S i isunknown, making the best estimation of κ challenging.Classiﬁcation is required in many real-world problems.Although there have been many classiﬁcation methods pro-posed to date, such as multi-layer perceptrons [2], decisiontrees, support vector machines [3], and extreme learningmachines [4], there are still some limitations associated withmany of these methods. Some methods are sensitive toimbalanced number of instances in each class [5]. Also, non-linear classiﬁcation methods outperform linear classiﬁcation All authors are with the Centre for Advanced Imaging (CAI), the Uni-versity of Queensland, Brisbane, QLD 4072, Australia. M. R. Bonyadi([email protected], [email protected]) is also with the Optimisation andLogistics Group, The University of Adelaide, Adelaide 5005, Australia. methods in the training set, however, may end up havingworse performance when they are applied to new instances,an issue known as overﬁtting [6].We propose a new classiﬁcation algorithm called opti-mization of distributions differences (ODD) in this paper.ODD aims to optimize distribution of instances in differentclasses to ensure they do not overlap. ODD ﬁnds a trans-formation, F : R n → R p , n is the number of dimensions(features) for each instance and p is a positive integer, ina way that the distance between the gravity centers of theinstances in different classes is maximized while the spreadof the instances within the same class is minimized. Ifsuch transformation exists, the instances could be assignedto each class based on their distances to the centers ofthe classes. We formulate the optimization of this trans-formation as a multiobjective optimization problem withtwo sets of objectives. The ﬁrst set of objectives ensurethat the centers of different classes are as far as possiblefrom one another by deﬁning c ( c − / objectives, c isthe number of classes. The second set of objectives ensurethat the spread of instances within each class is minimized,deﬁned by the norm of the eigenvalues of the covariancematrix of instances in each class, that adds c extra objectivesto the system. We solve this problem using a combina-tion of evolutionary algorithms [7] and conjugate gradientmethods [8]. We experiment with linear and non-lineartransformations and show that the method can outperformexisting classiﬁcation methods when they are applied to 12standard classiﬁcation benchmark problems and 4 artiﬁcialclassiﬁcation problems. We also show that the algorithmis not sensitive to the imbalance number of instances inthe classes, assuming that the given instances for trainingrepresent the classes distribution parameters to some extent.Our experiments indicate that the method outperforms bothnon-linear and linear classiﬁers in terms of generalizationability.We structure the paper as follows: section 2 outlinessome background information on classiﬁcation methods a r X i v : . [ c s . L G ] M a r and optimization algorithms we will use in this paper. Sec-tion 3 provides details of our proposed method, includingthe model, deﬁnitions, and optimization. Section 4 reportsand discusses comparative results among the multiple clas-siﬁcation methods on 12 standard benchmark classiﬁcationproblems. Sensitivity to imbalance number of instances ineach class and overﬁtting are also discussed in that section.Section 5 concludes the paper and discusses potential futuredirections. ACKGROUND

This section provides some background information aboutexisting classiﬁcation and optimization methods.

In this section, we describe classiﬁcation methods that weuse for comparison purposes.

K-nearest neighbor (Knn) [9] works based on the assump-tion that instances of each class are surrounded mostly bythe instances from the same class. Hence, given a set oftraining instances in the feature space and a scalar k , a givenunlabeled instance is classiﬁed by assigning the label whichis most frequent among the k training samples nearest tothat instance. Among many different measures used fordistance between instances, Euclidean distance is the mostfrequently used for this purpose. NBY is a classiﬁcation algorithm that works based on theBayes theorem [10]. The aim of the algorithm is to ﬁndthe probability that a give instance (cid:126)x belongs to a class c , i.e., P ( c | (cid:126)x ) . To calculate this, NBY uses the Bayes the-orem as P ( c | (cid:126)x ) = P ( (cid:126)x | c ) P ( c ) P ( (cid:126)x ) . The value of P ( c ) , P ( (cid:126)x ) ,and P ( (cid:126)x | c ) can be all estimated from the given instancesin the training set. As the instance (cid:126)x is in fact a vectorthat contains multiple variables, P ( (cid:126)x | c ) is estimated by P ( x | c ) × P ( x | c ) × ... × P ( x n | c ) that indeed ignores thedependency among variables, a ”naive” assumption. The aim of SVM [3] is to ﬁnd a hyperplane deﬁned by thenormal vector (cid:126)ω that could separate two class of instances[3]. The separation is determined by the sign of (cid:126)ω(cid:126)x T + r that indicates to which side of the hyperplane the instance (cid:126)x belongs. In other words, given a set of instances andtheir classes (supervised learning), the algorithm outputs anoptimal hyperplane which categorizes instances.One way to extend this algorithm to deal with multiclassclassiﬁcation problems is to use one-vs-all or one-vs-onestrategies proposed in [11]. MLP aims to optimize the parameters of a mapping from aset of input instances to their provided outputs to estimatethe Bayes optimal discriminant [12]. The mapping can belinear or non-linear and could be presented in multiplelayers. The algorithm minimizes the mean square of error between the generated outputs and expected outputs foreach instance. One of the most frequently used optimizationmethods in MLPs is the Levenberg-Marquardt [13], [14], thatis also used in this paper. See [15] for more details.

The aim of LDA is to calculate (cid:126)ω for which (cid:126)ω(cid:126)x T > k ifthe instance (cid:126)x belongs to the second class. Assuming thatthe conditional probability p ( (cid:126)x | y = 0) and p ( (cid:126)x | y = 1) ( y is the label of (cid:126)x ) are both normally distributed withmean and covariance parameters ( (cid:126)µ , Σ ) and ( (cid:126)µ , Σ ) ,Fisher [16] proved that (cid:126)ω = (Σ + Σ ) − ( (cid:126)µ − (cid:126)µ ) and k = (cid:126)µ T Σ − (cid:126)µ − (cid:126)µ T Σ − (cid:126)µ could distinguish betweenthe two classes. Considering S W = (Σ + Σ ) as a measurefor the within-class spread and S B = ( (cid:126)µ − (cid:126)µ )( (cid:126)µ − (cid:126)µ ) T ( T is the transpose operator) as a measure for between-class spread, Fisher’s value for (cid:126)ω ensures that W T S B WW T S W W ismaximized. (cid:126)ω is the norm of a hyperplane that discriminatesthe two classes and k is the shift to ensure this hyperplaneis between the two classes.If Σ and Σ are small then (Σ + Σ ) − becomessingular that lead to vanishing the impact of ( (cid:126)µ − (cid:126)µ ) , i.e.,a solution that leads to singular (Σ + Σ ) − dominatesall other solutions, no matter the distance between thecenters of the classes [17]. This, however, is not desirableas it is important that the classes centers are apart fromone another to be able to distinguish between them. Thisscenario occurs particularly when the number of instancesin a class is smaller than the number of dimensions n .Also, the threshold k is effective only if the distribution ofthe classes are similar, that might not be the case in manydatasets.One way to extend this algorithm to deal with multiclassclassiﬁcation problems is to use one-vs-all or one-vs-onestrategies proposed in [11]. Direct-LDA is another versionof LDA that can handle multiple classes. Direct-LDA is a variant of LDA that handles multiple classes[18]. Between-class spread for Direct-LDA is formulated by: S B = 1 c c (cid:88) i =1 ( µ i − µ )( µ i − µ ) T (1)where µ i is the average of the instances in the class i , µ isthe average of all µ i s. The within-class spread is deﬁned by: S W = c (cid:88) i =1 Cov ( X i ) (2)where X i is a m i × n matrix, m i is the number of instancesin the class i , each row is an instance of the class i , and Cov ( . ) is the covariance operator. In the multiclass case,the optimum value for ω (that is not a vector anymore buta n × ( c − matrix) is then the ﬁrst c − eigenvectorscorresponding to the c − largest eigenvalues of S − W S B .If the number of dimensions is smaller than the numberof classes then the algorithm might not ﬁnd effective ω todistinguish between classes [17]. Another limitation withthis formulation is that, if the number of dimensions issmaller than the number of instances in one of the classes, the covariance matrix for that class becomes non-full rank.In addition, if the distance between two classes is large itmay dominate the spread of the classes ( S B ) and lead toineffective transformation [19]. Decision tree (DTR) is a tree structure for which each interiornode corresponds to one of the input variables and each leafrepresents a class label. The outgoing edges from each inte-rior node represent the decision made for variable valuesin that node. For a given instance, a path from the root ofthe tree that follows the values of each variable lead to theclass label for that instance. The tree is trained (e.g., by themethod proposed in [20]) according to the given instancesin the training set.

In this section we provide a brief background informationabout optimization algorithms we use in this paper.

The aim of QN is to ﬁnd a point in a search space thatits gradient is zero. The method assumes that the objectivefunction can be estimated by a quadratic function aroundthe local optimum and ﬁnds the root of the ﬁrst derivative ofthe objective function by generalizing the secant method. Weuse Broyden-Fletcher-Goldfarb-Shanno (BFGS) [8] in thisarticle to constrain the solutions of the secant equation asthis method is frequently used in the literature and provideacceptable practical performance. The ﬁnite difference gra-dient approximation is usually used for objective functionsthat their gradient cannot be calculated analytically.

Evolutionary algorithms work based on a population ofcandidate solutions that are evolved according to somerules until they converge to an optimum solution. Examplesinclude particle swarm optimization [21] and evolutionarystrategy [7], each has its own speciﬁc properties that makethem advantageous/disadvantageous on various types ofproblems. The aim of these methods is to use informationcoded in each individual in the population (with the size λ ) and update them to ﬁnd better solutions. For example,evolutionary strategy (ES) generates new individuals usinga normal distribution with the mean of the current locationof the individual and an adaptive variance, calculated basedon the distribution of ”good” solutions. Covariance matrixadaptation evolutionary strategy (CMAES) employs simi-lar idea but updates the covariance matrix of the normaldistribution (rather than the variance alone) to generatenew instances that accelerates convergence to local optima.This idea takes into account non-separability of dimensionsduring the optimization process, hence, would be moresuccessful when the variables are interdependent. See [7]for detail of these methods. ROPOSED ALGORITHM

We deﬁne a ψ κ by a tuple < F, f, Ω > , a surjective transfor-mation F , a discriminator f , and an optimization problem Ω that aims to ﬁnd the best F such that ∀ i ∈ { , ..., c } , ∀ x ∈ S i , f ( F ( (cid:126)x )) = i where f denotes the class index of the transformed instance (cid:126)x . Although only F and f are required to classify a set ofinstances, the optimization problem, Ω , is also very impor-tant component to ensure efﬁciency of the model and thediscriminator.In this paper, we consider that F : R n → R p , and f : R p → R . It is usually assumed that the discriminatorfunction f is constant while the parameters of the transfor-mation F are formulated into Ω and optimized through anoptimization procedure. In SVM with linear kernel, for ex-ample, the function F is deﬁned by F ( (cid:126)x ) = (cid:126)x(cid:126)ω T − b , (cid:126)x ∈ R n , f ( F ( (cid:126)x )) = sign ( F ( (cid:126)x )) . MLP with no hidden layer assumesthat F ( (cid:126)x ) = t ( (cid:126)xM n × p + (cid:126)b ) and f ( F ( (cid:126)x )) = F ( (cid:126)x ) , where t isthe activation function (usually tan or log sigmoid), and Ω is to minimize the average of ( f ( F ( (cid:126)x i )) − G ( (cid:126)x i )) over allgiven instances ( i ) where G ( (cid:126)x i ) is the class of (cid:126)x i .Let us assume that the instances in each class S i arerandom variables that follow a distribution with speciﬁcmoments. The optimization problem Ω for the optimizationof distribution difference (ODD) algorithm is to ﬁnd atransformation F : R n → R p such that the the gravitycenter of the transformed instances by F that are in differentclasses are as far as possible from one another while thedistance among transformed instances that are in the sameclass is minimized. After optimization of F , a discriminator f could be simply deﬁned by the distance between the giveninstances and the centers of the classes. We formulate F , f ,and Ω for ODD in the remaining of this section. Ω for ODD Let X ( k ) m k × n include all given instances of the class k (a subsetof S k ), each row corresponds to one instance. We transformeach row of this matrix by the function F to form Y ( k ) m k × p .We deﬁne (cid:126)a k , a p dimensional vector, as the center of gravityof all m k instances in Y ( k ) m k × p as follows: (cid:126)a k = 1 m k m k (cid:88) i =1 (cid:126)y i (3)where (cid:126)y i is the i th row of Y ( k ) m k × p . We also deﬁne the scalar v k , as the norm of the eigen values of the covariance matrixof Y ( k ) m k × p : v k = || Eig ( Cov ( Y ( k ) m k × p )) || (4)where Cov ( . ) is the covariance operator and Eig ( . ) cal-culates the eigen values of its input matrix. The value of v k indicates how the instances in the class k have beenspread around their center along with their most importantdirections (eigen vectors). The aim of ODD is to adaptthe transformation F such that v k is minimized for all k while the distances among all possible gravity centers aremaximized. This could be formulated by a multiobjectiveoptimization problem: Ω = (cid:40) maximize || (cid:126)a i − (cid:126)a j || for all j > i minimize v i for all i (5)where i, j ∈ { , ..., c } . The problem contains c minimizationand c ( c − maximization objectives. We use the followingremarks to convert this multiobjective problem to a singleobjective problem: Remark 1.

Let us assume that A i ( x ) is a function and A i ( x ) > for all i and x . A solution that minimizes (cid:80) i A i ( x ) is onthe true Pareto front of the multiobjective optimization problem:”minimize A i ( x ) for all i ” Remark 2.

Let us assume that A i ( x ) is a function and A i ( x ) > for all i and x . A solution that maximizes (cid:81) i A i ( x ) is on thetrue Pareto front of the optimization problem: ”maximize A i ( x ) for all i ” The proofs for both these remarks are elementary andcould be done by contradiction.Using remark 1 and remark 2, the multiobjective opti-mization problem deﬁned in Eq. 5 can be transformed to asingle objective optimization problem deﬁned by:

Ω = minimize γ + (cid:80) ck =1 v k (cid:16)(cid:81) ci (cid:81) cj = i +1 || (cid:126)a i − (cid:126)a j || (cid:17) c ( c − (6)where γ is a positive constant (set to 1 in our experiments)to ensure that, among all possible solutions for which (cid:80) ck =1 v k = 0 , the one that maximizes centers distances( (cid:81) ci (cid:81) cj = i +1 || (cid:126)a i − (cid:126)a j || ) is preferred. Because the growthrate of the denominator is factorial, we have used theregulator c ( c − to balance the growth rate of the nominatorand denominator. This ensures balancing the importanceof distinct gravity centers while minimizing the spread ofinstances within each class. The product (rather than asimple weighted summation) enforces the optimizer to ﬁndsolutions that impose scattered centers for classes. This is anextremely important point as, otherwise, the optimizationalgorithm may ﬁnd a solution that maps some of the centersclose to one another while move the remaining centers farfrom the others, that is not desirable.Note that the transformation of the multiobjective opti-mization (Eq. 5) to its single objective form (Eq. 6) is effectiveonly if the objectives are assumed to be equally important,that is the case here. f for ODD After solving Ω (Eq. 6), we need to identify to whichclass a given vector (cid:126)y belongs. We deﬁne the function f = < f , f , ..., f c > as follows: f k ( (cid:126)y ) = D k (cid:80) cj =1 D j (7)where D k = || (cid:126)y − (cid:126)a k || is the distance between (cid:126)y and thecenter of the class k . The smaller the value of f k is, themore likely that the vector (cid:126)y belongs to the class k . Onecan calculate − f k ( (cid:126)y ) and then normalize the results toget probabilities of (cid:126)y ∈ S k . This leads us to the followingformula for f k : f k ( (cid:126)y ) = 1 − − D k (cid:80) cj =1 D j (cid:80) ci =1 (1 − D k (cid:80) cj =1 D j ) (8) = 1 c − − D k ( c − (cid:80) ci =1 D i ) Clearly, f k ( (cid:126)y ) ∈ [0 , , that could be interpreted as a measurefor the probability of (cid:126)y ∈ S k . In order to convert the resultsto a categorical value (i.e., converting generative results to the discriminative results), we use the threshold found in thetraining set that maximizes the area under the curve (AUC)and use that threshold to discriminate the ﬁnal results in testcases. This thresholding strategy is used for all generativemethods in this article (Direct-LDA, ODD, and MLP) unlessspeciﬁed. F for ODD We consider the linear case for F in this paper where F ( (cid:126)x ) = (cid:126)x × M n × p + (cid:126)r , (cid:126)x is a n dimensional vector.Accordingly, Ω in Eq. 6 is dependent only on M n × p and (cid:126)r that introduces p ( n + 1) variables. In the rest of thispaper, we combine M n × p and (cid:126)r and denote M n (cid:48) × p where n (cid:48) = n + 1 . This, of course, would assume that an instance (cid:126)x i is presented as < x i , ..., x in , > . One can see ODD withthis setting is very similar to MLP with no hidden layerthat uses a different energy function: MLP uses the meansquare error of the outputs of instances and expected classesindependently while ODD uses the idea of centrality ofinstances that are in the same class.Not all problems could be effectively transformed by alinear function. Hence, one may add nonlinear ﬂexibilityto F by introducing a function g : R p → R p where F ( (cid:126)x ) = g ( (cid:126)x × M n (cid:48) × p ) . If the function g is nonlinear thenthe ﬁnal model could classify instances that are nonlinearlyseparable. The choice of the function g is problem depen-dent and could be done through a trial and error procedure.We will test g ( (cid:126)x ) = tanh ( (cid:126)x ) in the experiment section asa candidate to introduce non-linearity to our algorithm.This function has been frequently used for this purposein MLP articles. Note also that, unlike other classiﬁcationmethods that require speciﬁc functions for their kernel, thetransformation F for ODD could be deﬁned in more genericform as the optimization function Ω for ODD is solved by aderivative-free optimization method. The optimization problem, Ω , introduced in Eq. 6 is nonlin-ear. It is also difﬁcult to calculate the gradient and Hessianfor this equation, that leaves us with the methods that areeither derivative-free or approximate gradient and Hessian.In addition, it is not clear if this optimization problemis unimodal or multimodal, make the solution even morechallenging. In this paper, we use three methods to solvethis optimization problem: ES, CMAES, and QN. We are notusing PSO as it does not take into account the dependencyamong variables [21] that is required for the purposes of thispaper. The ﬁrst two methods are stochastic and population-based and have a better exploration ability than the last.However, QN could converge to a local optimum fasterthan other methods [22]. In this section we compare thecomputational complexity of these methods.Let F ( (cid:126)x ) = (cid:126)xM n × p , n v = np the number of variablesto optimize, m = kn v , where k is a constant. In order todemonstrate the time complexities in practice, we designeda random dataset in which m = knp instances were uni-formly randomly sampled in an n dimensional space ( n dimensional uniformly sampled instances) and assigned totwo classes randomly. We set p = 2 in all examples andapplied ES, CMAES, and QN to the objective function of k (the ratio m/n v ) -4 -3 -2 -1 A v e r a g e c o m pu t a ti on ti m e ESCMAESQN (a) n v (number of variables, np) -3 -2 -1 A v e r a g e c o m pu t a ti on ti m e ESCMAESQN (b)

Fig. 1. The average computation time (in milliseconds) for differentmethods at each iteration: (a) when k is changed (that changes thenumber of instances), (b) when n v is changed. ODD, each method for 5 iterations, 50 times. The function F was set as speciﬁed earlier (linear function), and λ wasset to 50 for ES and CMAES. Figure 1(a) shows the resultswhen k was changed from to and n = 100 . The ﬁgureshows that the average computation time for all methodsis linear w.r.t k . However, the calculations included muchlarger constant multipliers for QN that makes the algorithmsigniﬁcantly slower than other methods. Figure 1(b) showsthe results when n v was changed from to and m = 2000 . The ﬁgure shows that the required time for ES isless than other methods.Another important factor that contributes to the per-formance of the optimization method is the convexity ofthe search space. As there is no reason to assume thatthe optimization problem deﬁned in Eq. 6 is convex, theexploration ability of the algorithm becomes important. While methods like QN are very efﬁcient in convex spaces,they have difﬁculty in ﬁnding good solutions in non-convex problems. In contrast, ES and CMAES have a betterexploration ability that enables them to offer better ﬁnalsolutions in multimodal optimization problems. It is hencebeneﬁcial to use a hybrid of ES and CMAES with QN toensure effective exploration at the beginning of the searchwhile better exploitation at the later stages of the search.Hence, in all of our implementations, we used CMAES incombination with QN for small problems ( n v ≤ ) andES alone for large problems ( n v > ), set experimentally.Although the number of iterations for these methods couldbe set according to n v , our experiments showed that 100for CMAES, 100 for QN, and 500 for ES work efﬁciently forour test cases. Also, ES and CMAES are terminated if theirperformance is not improved by at least 0.001 in the last20 iterations. For QN, the algorithm was terminated if thegradient value was smaller than 1e-8. Let us give an example to clarify how ODD works. Assumethe dataset presented in Fig. 2 is given that contains 140points in two classes (70 data points in each class) in whicheach instance (cid:126)x i is 4 dimensional.We set the transformation F as F ( (cid:126)x (cid:48) ) = (cid:126)x (cid:48) M × , where (cid:126)x (cid:48) i = < x i , x i , x i , x i , > , and (cid:126)x ki is the k th element of theinstance (cid:126)x i . We then solve the optimization problem, Ω (Eq.6) to ﬁnd the matrix M . After optimization, we found M = (cid:20) . − .

006 80 . − . − . . − . − .

473 39 . − . (cid:21) T where T is the transpose operator. The transformed datasetbefore and after optimization of M has been shown in Fig.3.a and b (ﬁlled circles), respectively. The ﬁgures also indi-cate the center of the distributions of transformed instances(crosses) as well as misclassiﬁed instances when the valueof f was thresholded by an arbitrary value 0.5. With thisthreshold, the algorithm has classiﬁed 133 instances (overall 140) correctly. With an optimized threshold, this could beimproved to 137 correctly classiﬁed cases. In this section we compare ODD with Direct-LDA on someartiﬁcially-generated datasets to demonstrate the differencesbetween the methods. • Db1 : contains two sets of time series with identicallength (300 samples). Both series are in the formof a sin(2 πzt ) + 10 r cos(20 πt ) + 10 r cos(400 πt ) + N (0 , , where N (0 , generates random valuesfrom a normal distribution with mean equal to zeroand standard deviation 30. The value of a is pickeduniformly randomly from [15 , for both series. Thefrequency z is picked randomly (uniform distribu-tion) from [65 , for 50 instances while it is pickedrandomly (uniform distribution) from [15 , for 500instances. The ﬁrst 50 instances are labeled as 1 and

1. This dataset is in fact a part of the crab gender dataset [23] whereonly 4 of the attributes were used for this example. This dataset hasbeen used in its complete form in our experimental results. -1 0 1 x -101 x -1 0 1 x -101 x -1 0 1 x -101 x -1 0 1 x -101 x -1 0 1 x -101 x -1 0 1 x -101 x Fig. 2. Illustration of the instances from the Crab gender in four dimen-sions. the rest are labeled as 2. A similar set is used fortesting purposes but with 500 instances from eachclass. • Db2 : similar to the ﬁrst dataset, but thistime we have 4 different frequency ranges: [10 , , [30 , , [50 , , [70 , in 4 classes. We place50, 250, 500, and 10 instances from each time seriesfor training and the same number of instances fortesting. • Db3 : two sets of time series, 100 instances in the ﬁrstand 1000 in the second. Each sample of each seriesis generated by N ( a, b ) . For the time series in theﬁrst group it is either a = 40 and b ∈ [95 , (uniform distribution) OR b = 80 and a ∈ [49 , (uniform distribution). For the time series in thesecond group it is either a = 40 and b ∈ [55 , (uniform distribution) OR b=80 and a ∈ [29 , (uniform distribution). It is clear that these two setsare not distinguishable by their variance or mean, butboth at the same time. • Db4 : was taken from ”Figure 3” of [24]. It includes788 instances in 7 classes in 2 dimensions .

2. The data is available online athttps://cs.joensuu.ﬁ/sipu/datasets/. -1.4 -1.2 -1 -0.8 -0.6 -0.4 y -2.5-2-1.5-1-0.500.51 y (a) -40 -20 0 20 40 y -10-8-6-4-202 y (b) Fig. 3. Application of ODD to the Crab gender dataset (4 dimensions)(a) before, (b) after, optimization of the parameters of F . We applied ODD and Direct-LDA to these four datasets,results have been reported in Fig. 4. For both methods weused the distance between the instances and the closestdistribution center without any thresholding (function f ,see section 3.2). Clearly, ODD outperforms Direct-LDA inall training sets. The most important denominator in all ofthese sets was that either the number of dimensions wassmaller than the number of classes (e.g., Db4) or the numberof instances in at least one of the classes was smaller than thenumber of dimensions (e.g., Db3). Both of these scenarioscause Direct-LDA to fail (the method could not ﬁnd anysolution for Db2 and Db4) while ODD performs ﬁne in thesescenarios. Note that the second scenario is very common intime series, i.e., large number of samples, each represent onedimension, and small number of instances for each class. XPERIMENTAL RESULTS

In this section we provide comparative results of the al-gorithm with other state-of-the-art classiﬁcation methods.We apply ODD together with other methods to 12 standardclassiﬁcation problems. * * * ** * *

Fig. 4. The results of the accuracy (area under the curve) of Direct-LDAin comparison to ODD. Star indicates the signiﬁcance of the comparison(0.05 conﬁdence, t-test). Direct-LDA could not ﬁnd any solution for Db2and Db4.

In this subsection we introduce the datasets, pre-processes,and algorithms settings used for comparisons.

We use 12 datasets for comparison among different classi-ﬁers, namely, Breast cancer (BC), Crab gender (GD), Glasschemical (GC), Parkinson (PR) [25], Seizure detection (SD)[26], Iris (IR), Italian wines (IW), Thyroid function (TF),Yeast dataset (YD) [27], Red wine quality (RQ) [28], Whitewine quality (WQ) [28], and Handwritten digits (HD). Maincharacteristics of these datasets have been reported in Table1. These datasets have been used frequently in previousclassiﬁcation studies as standard benchmarks. We used theone-vs-rest presentation for the classes, hence, the class ofeach instance was represented by a binary vector with thelength c .The SD dataset includes interacornial electroencephalo-gram (iEEG) of 12 patients (4 dogs and 8 Human) withvariable number of channels (Table 2 shows the detailsof this dataset) [26]. There are 2 classes, namely seizure(ictal) and no-seizure (interictal), in the dataset with variousnumber of instances and iEEG channels for each patient.While each seizure event might take up to 60 seconds, eachinstance of ictal or interictal in the dataset indicates only 1second of an event from all iEEG channels. As the propertiesof the signals that belong to the same ictal event could besimilar, including different segments of a single event inboth training and testing sets may simplify the problem.Hence, we used all ictal segments that belonged to the sameseizure event in either testing or training set, but not both.The segment index is available with the dataset that couldbe used to reconstruct signals from the same event. Thisprocedure is usually used for cross-validation in seizuredetection and prediction literature [26]. We preprocessed the instances in the SD dataset by calcu-lating the fast Fourier transform of each channel and con-

TABLE 1The datasets used for comparison purposes in this paper. n is thenumber of variables and c is the number of classes in each dataset.The number of instances in each class has been reported in the lastcolumn. Dataset name n c

BC 9 2 < , > CG 6 2 < , > GC 9 2 < , > PR 22 2 < , > SD* ? 2 ?IR 4 3 < , , > IW 13 3 < , , > TF 21 3 < , , > YD 8 10 < , , , , , , , , , > RQ 11 6 < , , , , , > WQ 11 7 < , , , , , , > HD 784 10 < , , , , , , , , , > *The seizure detection dataset includes 12 patients, each of them hastheir own number of variables and instances in different classes. SeeTable 2. TABLE 2Details of the seizure detection (SD) dataset. The ﬁrst value in the lastcolumn is the number of instances of seizure (one second each) andthe second number is the number of instances of non-seizure (onesecond each). The seizure instances are one-second segments fromdifferent seizure events.

Patient

Subject 1 16 400 < , > Subject 2 16 400 < , > Subject 3 16 400 < , > Subject 4 16 400 < , > Subject 5 68 500 < , > Subject 6 16 5000 < , > Subject 7 55 5000 < , > Subject 8 72 5000 < , > Subject 9 64 5000 < , > Subject 10 30 5000 < , > Subject 11 36 5000 < , > Subject 12 16 5000 < , > catenating these transformed signals to generate one largesignal (FFT of the channels one after another). The lengthof this signal is a function of the number of channels of theiEEG device. We used frequencies from 1 to 50 Hz only,as it showed sufﬁciently accurate presentation of seizure[26]. For subject 1, for example, the preprocessed signal was × samples long (16 channels, 1 Hz to 50 Hz FFT). Theclassiﬁers were trained for each patient independently. Thetraining set was generated by selecting 70% of the seizureevents randomly and the rest of the seizure events wereleft for test. We also selected 70% of non-seizure signals ran-domly for training and left the rest for test. For each selectedset of training signals we trained KNN, MLP, SVM, DTR,LDA, and ODD models and tested their performances onthe remaining instances. This was done for 50 independentruns to decrease the impact of possible biased selection intrain and test sets. As the number of instances in each class for SD dataset is imbalance (the ratio ictal to interictal isalmost 2:19 in average), this dataset forms a good test forthe models.For other datasets, the variables were normalized to [ − , for the instances in the train set. The same mappingthat normalized the training set was then applied to thetest set to ensure both train and test sets are in the samedomain. The variables with the variance equal to zero acrossthe whole dataset were completely removed from furthercalculations. For HD dataset, we also used principle compo-nent analysis [29] to reduce the number of dimensions to 50,prior to normalization.For each dataset (except for HD), we randomly selected70% of instances in each class for training purposes and leftthe rest for tests. After this selection, all methods were ap-plied to the training set and then their performances on bothtraining set and test set were measured. The selection andmodeling were performed for 50 independent runs to ensurethe results are not biased towards a speciﬁc combination ofthe instances in the train set. For each tune, the training setwas remained unchanged for all methods. For HD dataset(70,000 instances), the number of selected instances fortraining purposes at each run was set to 5% (rather than70%) from each class and the remaining instances were leftfor testing purposes.We used area under the curve (AUC) of the recipientoperation curve (ROC) [30] as a performance measure. Formulticlass problems, the overall performance was calculatedby averaging AUC for each class separately. As MLP andODD are generative classiﬁers, we used the best thresholdfound on the ROC curve of the training set to threshold theresults of the test set and convert the results to discrimina-tive. The same threshold was used for testing purposes.We used the t-test (conﬁdence 0.05) for statistical com-parisons between the performance of each method andODD. We compare the results of ODD with NBY, SVM (we usedone-vs-all strategy to enable SVM to deal with multipleclasses), MLP (with the number of neurons in the hiddenlayer equal to p (the dimension of the transformation F in ODD) with Levenberg-Marquardt for backpropagationof weights errors), LDA (we used one-vs-all strategy toenable LDA to deal with multiple classes), DTR, and KNN(5 neighbors model). We used Matlab 2016b for imple-mentations and tests. For SVM and LDA we used linearkernels. We tested three settings for ODD that are ODD for which p = 1 and F ( (cid:126)x ) = (cid:126)xM n (cid:48) × p , ODD l for which p = c and F ( (cid:126)x ) = (cid:126)xM n (cid:48) × p , and ODD n for which p = c and F ( (cid:126)x ) = tanh ( (cid:126)xM n (cid:48) × p ) . Note that all derivatives ofODD could handle multiclass classiﬁcation with no needfor external strategies.Stopping criteria for MLP was set to gradient < ODD and ODD l , the stopping criteriawas 100 iterations (constant) of CMAES and then 100 itera-tions of QN to ensure efﬁcient exploration and exploitation.For CMAES, if the performance was not improving thenthe algorithm was terminated (improvement for the last 20iterations was smaller than 0.0001). For QN, if gradient wassmaller than 1e-8 then the algorithm was terminated. For large problems in these tests (HD and SD), we only used ESfor 500 iterations. Table 3 shows comparative results of tested methods forall datasets except for SD. The value in the row i column j shows P i,j − G i,j where P i,j is the number of datasetsfor which the ODD type indicated in column j performssigniﬁcantly (based on t-test, conﬁdence 0.05) better thanthe method indicated in the row i and G i,j is the numberof datasets for which the method indicated in the row i performs signiﬁcantly better than the ODD type indicatedin column j , for the performance measure indicated inthe ”Measure” column. For example, the value 3 in therow 3 (LDA), column 2 ( ODD n ) for the measure ”Test”indicates that the number of datasets for which ODD n performs signiﬁcantly better than LDA is 3 datasets (over11) more than number of datasets for which LDA performssigniﬁcantly better than ODD n .Clearly the running time of all derivatives of ODD is sig-niﬁcantly longer than the running time of NBY, SVM, LDA,DTR, and KNN. In comparison to MLP, however,

ODD l requires signiﬁcantly less time in majority of datasets. Thiswas not as good for ODD n mainly because of the overheadof the non-linear function. For ODD , this running time wassigniﬁcantly less than MLP’s in majority of datasets.The performance of ODD l was signiﬁcantly better thanall other methods in majority of datasets in the test sets (allvalues in that column are positive). The algorithm also per-formed signiﬁcantly better than other methods in majorityof datasets in the training set except in comparison to MLPand DTR.The performance of ODD was signiﬁcantly better thanall other methods in majority of datasets for test sets exceptin comparison to MLP and LDA. For training sets, ODD performs signiﬁcantly better than KNN, LDA, SVM, andNBY in majority of datasets while DTR and MLP performbetter than ODD in majority of datasets for training set.The performance of ODD n was signiﬁcantly better thanall other methods in majority of datasets in the test setsexcept in comparison to MLP, where there is a draw (for5 dataset MLP performs better while for 5 ODD n per-forms better). ODD n also performed signiﬁcantly betterthan other methods in the training set in majority of datasetsexcept in comparison to MLP and DTR. Table 4 shows comparative results of tested methods for SDdataset. ODD l and ODD performed signiﬁcantly betterthan all other methods in majority of subjects for the test set.For the training set, however, ODD l and ODD performedbetter than all methods except LDA and SVM for majority ofsubjects. ODD n performed better than all other methods fortraining set for majority of subjects except in comparison toSVM, LDA, and MLP. For test sets, the algorithm performssigniﬁcantly better than all other methods in majority ofsubjects except in comparison to LDA.

3. Details of this experiment is available in Appendix A.4. Details of this experiments are available in Appendix B

TABLE 3Comparison results among different classiﬁcation algorithms. Each rowindicates the comparative results with one type of ODD in terms oftime, training set performance, and test set performance. Positivevalues indicate better performance (in terms of different measures) ofthe ODD types.

ODD l ODD n ODD Measure T i m e T r a i n T e s t T i m e T r a i n T e s t T i m e T r a i n T e s t NBY -11 11 7 -11 10 4 -11 7 5

SVM -11 11 8 -11 11 7 -11 7 5

LDA -11 9 2 -11 9 3 -11 5 -1

MLP

DTR -11 -5 7 -11 -4 7 -11 -7 5

KNN -11 6 3 -11 7 3 -11 3 1

TABLE 4Comparative results among different classiﬁcation algorithms whenthey were applied to SD dataset. The values in the table are similar towhat was discussed for Table 3.

ODD l ODD n ODD Measure T i m e T r a i n T e s t T i m e T r a i n T e s t T i m e T r a i n T e s t NBY -12 12 11 -12 12 10 -12 12 11

SVM -12 -6 12 -12 -9 12 -12 -6 12

LDA -12 -1 6 -12 -5 -1 -12 0 5

MLP -12 3 12 -12 -1 11 -12 3 12

DTR -12 10 10 -12 6 9 -12 10 10

KNN -12 12 12 -12 12 10 -12 12 12

In terms of running time, all types of ODD performedsigniﬁcantly slower than other method.

ODD is not sensitive to the number of instances in eachclass. The reason is that ODD optimizes characteristics(mean and spread) of a class that is independent of thenumber of instances in that class. It is easy to see that,assuming instances in each class represent the exact meanand spread of that class, the performance of ODD is inde-pendent of the number of instances in the classes. Althoughin reality the instances in each class might not reﬂect theexact mean and spread of the class, it is expected that theseparameters are preserved in the given data to some extent toenable the algorithm to perform well. Similar assumption isrequired by any other classiﬁcation algorithm (i.e., trainingset must reﬂect the characteristics of the data in each class),otherwise, it is impossible to design any effective classiﬁer.We test the between-class imbalance sensitivity of ODDby using the CG dataset. We divide this dataset to twosubsets, one for training and the other for test, with 70%of instances from the ﬁrst class were always present in thetrain set. The ratio of the instances from the second class was r % , r ∈ { . , . , ..., . } . As the number of instances ineach class is equal to 100 in this dataset, the changes in r causes changes in the balance of the number of instancesin the training set. We run NBY, SVM, LDA, DCT, KNN,and ODD for training and predict the class of the remaininginstances. The average of the performance (area under thecurve, AUC, was used for performance measure) of themethods over 50 runs has been reported in Fig. 5. For ODD, p was set to 1 to ensure a fair comparison with SVM, hence, Second class ratio P e rf o r m a n ce ( AU C ) NBYSVMLDAMLPDTRKNNODD

Fig. 5. Comparative results in terms of the sensitivity to the numberof instances in each class. The dashed lines are the test results whilethe solid lines are the training results. ODD and MLP are affected onlyslightly by imbalanced between-class number of instances. The CGdataset was used for this test. F ( (cid:126)x ) = (cid:126)xM n (cid:48) × where n (cid:48) = 7 as the number of variables inthe dataset is 6.Figure 5 that the performance of ODD and MLP changedslightly when the class ratio was changed from 0.1 to 0.7.This, however, is not correct for SVM, NBY, LDA, DCT, andKNN and these methods have been affected signiﬁcantly bychanging the imbalance ratio. Over-ﬁtting is a very common issue among classiﬁcationmethods. It is usually measured by the extent the perfor-mance of a classiﬁcation method could be generalized tounseen instances. To formulate overﬁtting for a classiﬁca-tion method, we calculate the performance extension indexdeﬁned by: G indx = Test performanceTrain performance × Test performance (9)This index indicates to what extent the performance ofthe classiﬁer on the train data could be generalized to theunseen data (the fraction) while also takes into account howwell the algorithm performs in the testing set (the secondterm). The second term has been used in this formula topenalize algorithms that have a very low performance (e.g.,50 percent accuracy in both test and train) on both trainingset while their bad performance is extendable.The larger the value of G indx is, the better the algorithmcan generalize its performance to the training set, takinginto account how good is the algorithm performance inthe test set. We calculated the G indx for all methods basedon their results in section 4.2. Figure 6 shows the averageranking of the algorithms based on their G indx value overall 11 datasets (the smaller the ranking the better). The ﬁgureindicates that ODD l has the best ranking among these l ODD n ODD A v e r a g e r a nk b a s e d on G i ndx Fig. 6. Comparison results among different types of ODD and otherclassiﬁers in terms of G indx . The vertical axis indicates the averagerank of the methods in terms of their G indx . The smaller the rank themore successful the algorithm was in generalization of its performanceaccording to the G indx measure. methods in terms of its G indx . LDA is in the second placewith a small margin and NBY is in the third place. Also, ODD n has better generalization ability comparing to allnon-linear classiﬁcation methods tested in this article (i.e.,MLP, DTR, and KNN). ONCLUSION AND FUTURE WORKS

This paper introduced a new method based on optimizationof distribution differences (ODD) for classiﬁcation. The aimof the algorithm was to ﬁnd a transformation to minimizethe distance between instances in the same class while maxi-mize the distance between gravity centers of the classes. Thedeﬁnitions of ODD allows the use of any transformation,however, this paper only tested linear transformation anda particular non-linear transformation. The algorithm wasapplied to 12 benchmark classiﬁcation problems and re-sults were compared with state-of-the-art classiﬁcation algo-rithms. Comparisons showed that the proposed algorithmoutperforms previous methods in most datasets. Resultsalso showed that the method is less sensitive to the betweenclass imbalance and has a better ability to maintain itsperformance for the instances that have not been used inthe training process.This paper only experimented with reducing the numberof dimensions to a value equal to the number of classes.However, ODD design allows ﬂexible dimensionality re-duction/increase that could be beneﬁcial in some datasets.Hence, as a future work, we are going to design an adaptivemethod to increase the number of dimensions in the lineartransformation ( p ) gradually to ﬁnd the best size for thetransformation matrix. A small transformation matrix couldbe beneﬁcial especially for hardware implementations. Inaddition, the method could be extended to make use ofa multiple layers in which each layer reduces the numberof dimensions that could improve the performance of thealgorithm for some classiﬁcation tasks. Another interestingdirection would be to change the stopping criteria and testits impact on overﬁtting, i.e., stop the algorithm as soon asthe instances were separable in different classes. A PPENDIX -A Table 5 shows the average results of 50 independent runs forall algorithms. The values in the table have been preﬁxedby three characters. The character in position one, two, andthree after each value at a speciﬁc row and column indicatethe results of the statistical test (t-test with conﬁdence 0.05)between the method indicated in that column and

ODD l , ODD n , and ODD , respectively, for the dataset at that row.”*”, ”-”, and ”+” at a position indicate that that result isstatistically worst, the same, or better than ODD l , ODD n ,and ODD , depending on the position of the character. Forexample, the value ” . ∗−∗ ” in the row ”BC”, measure”Test”, column NBY indicates that the average performance(AUC) of the method NBY was . on the training set,that was signiﬁcantly worse than ODD l and ODD whilestatistically the same comparing to ODD n A PPENDIX -B Table 6 shows the average results of 50 independent runs forall algorithms when they were applied to SD. The values inthe table have been postﬁxed by three characters that havethe same deﬁnitions as in Table 5. R EFERENCES [1] A. Y. Ng and M. I. Jordan, “On discriminative vs. generativeclassiﬁers: A comparison of logistic regression and naive bayes,”

Advances in neural information processing systems , vol. 2, pp. 841–848, 2002.[2] S. Haykin and N. Network, “A comprehensive foundation,”

Neu-ral Networks , vol. 2, no. 2004, p. 41, 2004.[3] J. A. Suykens and J. Vandewalle, “Least squares support vectormachine classiﬁers,”

Neural processing letters , vol. 9, no. 3, pp. 293–300, 1999.[4] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning ma-chine: theory and applications,”

Neurocomputing , vol. 70, no. 1, pp.489–501, 2006.[5] P. Branco, L. Torgo, and R. P. Ribeiro, “A survey of predictive mod-eling on imbalanced domains,”

ACM Computing Surveys (CSUR) ,vol. 49, no. 2, p. 31, 2016.[6] D. M. Hawkins, “The problem of overﬁtting,”

Journal of ChemicalInformation and Computer Sciences , vol. 44, no. 1, pp. 1–12, 2004,pMID: 14741005.[7] H.-G. Beyer and H.-P. Schwefel, “Evolution strategies–a compre-hensive introduction,”

Natural computing , vol. 1, no. 1, pp. 3–52,2002.[8] R. Fletcher,

Practical methods of optimization . John Wiley & Sons,2013.[9] N. S. Altman, “An introduction to kernel and nearest-neighbornonparametric regression,”

The American Statistician , vol. 46, no. 3,pp. 175–185, 1992.[10] D. J. Hand and K. Yu, “Idiot’s bayesnot so stupid after all?”

International statistical review , vol. 69, no. 3, pp. 385–398, 2001.[11] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclasssupport vector machines,”

IEEE Transactions on Neural Networks ,vol. 13, no. 2, pp. 415–425, Mar 2002.[12] D. W. Ruck, S. K. Rogers, M. Kabrisky, M. E. Oxley, and B. W. Suter,“The multilayer perceptron as an approximation to a bayes opti-mal discriminant function,”

IEEE Transactions on Neural Networks ,vol. 1, no. 4, pp. 296–298, 1990.[13] K. Levenberg, “A method for the solution of certain non-linearproblems in least squares,”

Quarterly of applied mathematics , vol. 2,no. 2, pp. 164–168, 1944.[14] D. W. Marquardt, “An algorithm for least-squares estimationof nonlinear parameters,”

Journal of the society for Industrial andApplied Mathematics , vol. 11, no. 2, pp. 431–441, 1963.[15] S. K. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, andclassiﬁcation,”

IEEE Transactions on neural networks , vol. 3, no. 5,pp. 683–697, 1992. TABLE 5Comparison results among 6 well-known classiﬁcation methods with ODD.

ODD l refers to a setting where p = c , c is the number of classes, anda linear kernel is used for F . ODD n uses the same value for p with F ( (cid:126)x ) = tanh ( (cid:126)xM n (cid:48) × p ) . ODD uses a linear F while sets p = 1 . The valuesin each row is the average (over 50 runs) of results. The row ”Time” reports the average of time in milliseconds over 50 runs. Dataset Measure

NBY SVM LDA MLP DTR KNN

ODD l ODD n ODD BC Time 20.4 12.6 10.5 991.9 8.8 8.4 99 423.7 63.2Train 96.37*** 96.9*** 95.26*** +++ 97.96+-+ 97.55*** 97.77 97.91 97.77Test 96.37*-* 96.49*-* 95.13*** 95.42*** 93.98*** 96.71— +++ 96.16*** 93.03*** 96.7 97.46 96.93Test 69.93*** 94.13— +++ 94.33— 82.07*** 86.93*** 94.08 94.07 94.25GC Time 7.9 5.2 5.2 527.8 4.8 4.5 124.1 366.1 84.2Train 87.61*** 89.86*** 89.58*** +++ 96.93+++ 93.08*** 95.35 96.07 95.36Test 86.58*** 87.71*** 85.7*** 86.74*** 88.73*** 87.04*** 90.44 +++ 96+++ 93.05+-+ 88.03 93.86 87.6Test 76.64— 74.83-*- 77.64–+ 75.83-*- 78.24–+ +++ 76.29 78.35 75.12IR Time 6.2 9.6 6 255.2 4.7 4.6 138.9 421.2 86.1Train 97.06*** 80.7*** 98.53*** +++ 98.47*** 97.64*** 98.69 99.32 98.69Test 96.43— 78.73*** +++ 95.7*** 96.2–* 97.03— 96.88 96.59 96.86IW Time 10.4 9.9 5.9 83.7 4.9 4.1 316.2 719.8 136.4Train 98.94**+ 99.31**+ 99.99+++ +++ 98.85**+ 98.26**+ 99.78 99.76 97.33Test 97.94-++ 98.04-++ +++ 95.58**+ 91.88*** 97.48-++ 97.56 96.83 93.83TF Time 0 229.2 10.8 8431.5 8.4 5.2 685.1 4261.6 275.6Train N/A*** 64.35*** 62.57*** 98.3+++ +++ 70.03*** 78.4 83.49 76.92Test N/A*** 63.8*** 61.99*** 96.5+++ +++ 64.83*** 76.71 81.72 75.5YD Time 0 100.7 18.3 15303.7 11.1 5.6 2601.5 6299.7 454.5Train N/A*** 64.07*** 76.74**+ +++ 82.52+++ 80.27–+ 80.19 80.43 73.87Test N/A*** 63.61*** 76.23+-+ +++ 67.1*** 75.34-*+ 75.26 75.92 69.71RW Time 13.1 83.4 10.2 6335.4 13.5 4.3 1386.2 3871.9 395.9Train 64.82*** 57.58*** 62.13*** 79.36+++ +++ 66.3*** 77.75 78.06 75.62Test 60.57*** 57.37*** 60.42*** 70.67-+- 60.16*** 58.92*** +++ 67.58*** 71.8 70.67 68.85Test 59.33*** 52.01*** 57.66*** +++ 60.78*** 59.18*** 65.8 61.99 65.26HD Time 92 2099.2 33.1 3222.5 127.9 7.7 20570.5 28141.2 941.2Train 93.1*-+ 90.63**+ 92.73**+ 96.39+++ 95.95+++ +++ 93.26 93.13 78.08Test 92.34-++ 89.23**+ 92.06**+ 92.46-++ 85.11**+ +++ 92.33 92.19 77.04[16] R. A. Fisher, “The use of multiple measurements in taxonomicproblems,”

Annals of eugenics , vol. 7, no. 2, pp. 179–188, 1936.[17] H. Gao and J. W. Davis, “Why direct lda is not equivalent to lda,”

Pattern Recognition , vol. 39, no. 5, pp. 1002–1006, 2006.[18] H. Yu and J. Yang, “A direct lda algorithm for high-dimensionaldatawith application to face recognition,”

Pattern recognition ,vol. 34, no. 10, pp. 2067–2070, 2001.[19] T. Li, S. Zhu, and M. Ogihara, “Using discriminant analysis formulti-class classiﬁcation: an experimental investigation,”

Knowl-edge and information systems , vol. 10, no. 4, pp. 453–472, 2006.[20] J. R. Quinlan, “Induction of decision trees,”

Machine learning ,vol. 1, no. 1, pp. 81–106, 1986.[21] M. R. Bonyadi and Z. Michalewicz, “Particle swarm optimizationfor single objective continuous space problems: a review,”

Evolu-tionary computation , 2016.[22] I. Loshchilov, “Lm-cma: An alternative to l-bfgs for large-scaleblack box optimization,”

Evolutionary computation , 2015.[23] A. Asuncion and D. Newman, “Uci machine learning repository,”2007.[24] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,”

ACM Transactions on Knowledge Discovery from Data (TKDD) , vol. 1,no. 1, p. 4, 2007.[25] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman, L. O. Ramig et al. , “Suitability of dysphonia measurements for telemonitoringof parkinson’s disease,”

IEEE transactions on biomedical engineering ,vol. 56, no. 4, pp. 1015–1022, 2009.[26] A. Temko, A. Sarkar, and G. Lightbody, “Detection of seizuresin intracranial eeg: Upenn and mayo clinic’s seizure detectionchallenge,” in

Engineering in Medicine and Biology Society (EMBC),2015 37th Annual International Conference of the IEEE . IEEE, 2015,pp. 6582–6585.[27] P. Horton and K. Nakai, “A probabilistic classiﬁcation system forpredicting the cellular localization sites of proteins.” in

Ismb , vol. 4,1996, pp. 109–115. [28] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis, “Modelingwine preferences by data mining from physicochemical proper-ties,”

Decision Support Systems , vol. 47, no. 4, pp. 547–553, 2009.[29] H. Hotelling, “Analysis of a complex of statistical variables intoprincipal components.”

Journal of educational psychology , vol. 24,no. 6, p. 417, 1933.[30] J. A. Hanley and B. J. McNeil, “The meaning and use of the areaunder a receiver operating characteristic (roc) curve.”

Radiology ,vol. 143, no. 1, pp. 29–36, 1982. TABLE 6Results for seizure detection dataset.

Dataset Measure

NBY SVM MLP LDA DTR KNN

ODD l ODD n ODD Subject 1 Time 371.2 24.1 1924.4 94.8 68.1 7.6 6730 7652.3 3354.1Train 83.73*** +++ 100+++ 100+++ 99.72*** 97.35*** 100 99.83 100Test 87.11*** 95.15*** 96.72*-* 82.81*** 89.77*** 90.72*** 97.2 96.76

Subject 2 Time 410.5 48.2 3087.3 170.1 302.2 5 10848.9 13009.4 6720.9Train 90.58*** — 100— 100— 99*** 93.6*** 100 100 100Test 69.65*** 79.44*** +++ 77.54*** 72.36*** 87.98*** 92.1 91.78 91.93Subject 3 Time 827.9 791.2 8542.1 452.5 2258.6 7.5 47136.3 52129.2 33319.5Train 91.53*** +++ 99.88++- 96.47*** 98.78*** 96.54*** 99.81 99.38 99.83Test 91.07*** 89.44*** 95.3*-* 91*** 89.06*** 89.43*** +++ 100+++ 95.35*** 98.53*** 96.09*** 99.79 99.65 99.81Test 83.5*** 77.13*** 73.07*** 80.52*** 76.39*** 82.61*** — 99.01— 100— 100— 98.32*** 100 100 100Test 86.85*** 84.09*** — 79.54*** 79.91*** 87.01*** 94.74 96.43 94.66Subject 6 Time 636.6 192.5 5614.7 294.8 680.4 6.1 25360.2 29460.1 18817.2Train 92.65*** +++ 99.81*+* 97.9*** 97.65*** 94.34*** 99.97 99.6 99.97Test 94.9*** 95.03*** 98.7*** 96.7*** 90.53*** 91.02*** 99.34 99.07

Subject 7 Time 1413 388.7 5461.3 600.4 367.2 5.9 31242.9 32758.9 25726.8Train 82.76*** -+- 98.16— 100-+- 99.71*-* 90.92*** 100 99.67 100Test 60.84*** 61.44*** +-+ 60.87*** 71.98+-+ 55.91*** 69.13 72.69 67.12Subject 8 Time 1621.4 15.1 1619.4 127.2 42.9 4.7 11072.7 11317.2 7080.7Train 99.47*** -+- 99.04— 100-+- 100-+- 97.5*** 100 99.86 100Test 52*** 53.75*** 64.31*+* 56.21*-* 46.44*** 57.78*+* -+- 99.98-+- 100-+- 99.96*+* 94.44*** 100 99.65 100Test 91.81-+- 85*** 89.66*-* 68.11*** 77.67*** 86.11*** +-+ 100+-+ 100+-+ 99.76*** 90.89*** 100 99.99 100Test 85.87*** 85.84*** 93.31*+* 86.96*** 89.29*+* 75*** -+- 84.12*** 99.98*+* 99.42*** 96.41*** 100 99.96 100Test 98.6*** 99.03*** 83.43*** 98.38*** 96.14*** 99.21*** +++ 99.74*+* 98.89*** 98.33*** 97.22*** 99.98 99.05 99.98Test 90.04*** 91.77***97.95