[PDF] Scalable Nonlinear AUC Maximization Methods

Abstract

The area under the ROC curve (AUC) is a measure of interest in various machine learning and data mining applications. It has been widely used to evaluate classification performance on heavily imbalanced data. The kernelized AUC maximization machines have established a superior generalization ability compared to linear AUC machines because of their capability in modeling the complex nonlinear structure underlying most real-world data. However, the high training complexity renders the kernelized AUC machines infeasible for large-scale data. In this paper, we present two nonlinear AUC maximization algorithms that optimize pairwise linear classifiers over a finite-dimensional feature space constructed via the k-means Nyström method. Our first algorithm maximize the AUC metric by optimizing a pairwise squared hinge loss function using the truncated Newton method. However, the second-order batch AUC maximization method becomes expensive to optimize for extremely massive datasets. This motivate us to develop a first-order stochastic AUC maximization algorithm that incorporates a scheduled regularization update and scheduled averaging techniques to accelerate the convergence of the classifier. Experiments on several benchmark datasets demonstrate that the proposed AUC classifiers are more efficient than kernelized AUC machines while they are able to surpass or at least match the AUC performance of the kernelized AUC machines. The experiments also show that the proposed stochastic AUC classifier outperforms the state-of-the-art online AUC maximization methods in terms of AUC classification accuracy.

Full PDF

aa r X i v : . [ c s . L G ] A p r Scalable Nonlinear AUC Maximization Methods (cid:0)

Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz

Computer Science DepartmentColorado State University, Fort Collins, USA [email protected] , [email protected]@chitsazlab.org Abstract.

The area under the ROC curve (AUC) is a widely usedmeasure for evaluating classiﬁcation performance on heavily imbalanceddata. The kernelized AUC maximization machines have established a su-perior generalization ability compared to linear AUC machines because oftheir capability in modeling the complex nonlinear structures underlyingmost real-world data. However, the high training complexity renders thekernelized AUC machines infeasible for large-scale data. In this paper, wepresent two nonlinear AUC maximization algorithms that optimize linearclassiﬁers over a ﬁnite-dimensional feature space constructed via the k-means Nystr¨om approximation. Our ﬁrst algorithm maximizes the AUCmetric by optimizing a pairwise squared hinge loss function using thetruncated Newton method. However, the second-order batch AUC max-imization method becomes expensive to optimize for extremely massivedatasets. This motivates us to develop a ﬁrst-order stochastic AUC max-imization algorithm that incorporates a scheduled regularization updateand scheduled averaging to accelerate the convergence of the classiﬁer.Experiments on several benchmark datasets demonstrate that the pro-posed AUC classiﬁers are more eﬃcient than kernelized AUC machineswhile they are able to surpass or at least match the AUC performanceof the kernelized AUC machines. We also show experimentally that theproposed stochastic AUC classiﬁer is able to reach the optimal solution,while the other state-of-the-art online and stochastic AUC maximizationmethods are prone to suboptimal convergence.

The area under the ROC Curve (AUC) [11] has a wide range of applications inmachine learning and data mining such as recommender systems, informationretrieval, bioinformatics, and anomaly detection [5,22,25,1,26]. Unlike error rate,the AUC metric does not consider the class distribution when assessing theperformance of classiﬁers. This property renders the AUC a reliable measure toevaluate classiﬁcation performance on heavily imbalanced datasets [7], which arenot uncommon in real-world applications.The optimization of the AUC metric aims to learn a score function thatscores a random positive instance higher than any negative instance. There-fore, the AUC metric is a threshold-independent measure. In fact, it evaluates a (cid:0)

Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz classiﬁer over all possible thresholds, hence eliminating the eﬀect of imbalancedclass distribution. The objective function maximizing the AUC metric optimizesa sum of pairwise losses. This objective function can be solved by learning abinary classiﬁer on pairs of positive and negative instances that constitute thediﬀerence space. Intuitively, the complexity of such algorithms increases linearlywith respect to the number of pairs. However, linear ranking algorithms likeRankSVM [4,21], which can optimize the AUC directly, have shown a learningcomplexity independent from the number of pairs.However, the kernelized versions of RankSVM [13,4,20] are superior to linearranking machines in terms of producing higher AUC classiﬁcation accuracy. Thisis due to its ability to model the complex nonlinear structures that underlie mostreal-world data. Analogous to kernel SVM, the kernelized RankSVM machinesentail computing and storing a kernel matrix, which grows quadratically withthe number of instances. This hinders the eﬃciency of kernelized RankSVMmachines for learning on large datasets.The recent approaches attempt to scale up the learning for AUC maximiza-tion from diﬀerent perspectives. The ﬁrst approach adopts online learning tech-niques to optimize the AUC on large datasets [18,32,10,9,17]. However, onlinemethods result in inferior classiﬁcation accuracy compared to batch learningalgorithms. The authors of [15] develop a sparse batch nonlinear AUC maxi-mization algorithm, which can scale to large datasets, to overcome the low gen-eralization capability of online AUC maximization methods. However, sparsealgorithms are prone to the under-ﬁtting problem due to the sparsity of themodel, especially for large datasets. The work in [28] imputes the low general-ization capability of online AUC maximization methods to the optimization ofthe surrogate loss function on a limited hypothesis space. Therefore, it devisesa nonparametric algorithm to maximize the real AUC loss function. However,learning such nonparametric algorithm on high dimensional space is not reliable.In this paper, we address the ineﬃciency of learning nonlinear kernel ma-chines for AUC maximization. We propose two learning algorithms that learnlinear classiﬁers on a feature space constructed via the k-means Nystr¨om ap-proximation [31]. The ﬁrst algorithm employs a linear batch classiﬁer [4] thatoptimizes the AUC metric. The batch classiﬁer is a Newton-based algorithm thatrequires the computation of all gradients and the Hessian-vector product in eachiteration. While this learning algorithm is applicable for large datasets, it be-comes expensive for training enormous datasets embedded in a large dimensionalfeature space. This motivates us to develop a ﬁrst-order stochastic learning al-gorithm that incorporates the scheduled regularization update [3] and scheduledaveraging [23] to accelerate the convergence of the classiﬁer. The integration ofthese acceleration techniques allows the proposed stochastic method to enjoythe low complexity of classical ﬁrst-order stochastic gradient algorithms and thefast convergence rate of second-order batch methods.The remainder of this paper is organized as follows. We begin by reviewingclosely related work in Section 2. In Section 3, we deﬁne the AUC problem andpresent related background. The proposed methods are presented in Section 4. calable Nonlinear AUC Maximization Methods 3

The experimental results are shown in Section 5. Finally, we conclude the paperand point out the future work in Section 6.

The maximization of the AUC metric is a bipartite ranking problem, a specialtype of ranking algorithm. Hence, most ranking algorithms can be used to solvethe AUC maximization problem. The large-scale kernel RankSVM is proposedin [20] to address the high complexity of learning kernel ranking machines. How-ever, this method still depends quadratically on the number of instances, whichhampers its eﬃciency. Linear RankSVM [27,4,21,2,14] is more applicable to scal-ing up in comparison to the kernelized variations. However, linear methods arelimited to linearly separable problems. Recent study [6] explores the Nystr¨omapproximation to speed up the training of the nonlinear kernel ranking function.This work does not address the AUC maximization problem. It also does notconsider the k-means Nystr¨om method and only uses a batch ranking algorithm.Another method [15] attempts to speed up the training of nonlinear AUC clas-siﬁers by learning a sparse model constructed incrementally based on chosencriteria [16]. However, the sparsity can deteriorate the generalization ability ofthe classiﬁer.Another class of research proposes using online learning methods to reducethe training time required to optimize the AUC objective function [18,32,10,9,17].The work in [32] addresses the complexity of pairwise learning by deploying aﬁrst-order online algorithm that maintains a buﬀer of ﬁxed size for positiveand negative instances. The work in [17] proposes a second-order online AUCmaximization algorithm with a ﬁxed-sized buﬀer. The work [10] maintains theﬁrst-order and second-order statistics for each instance instead of the buﬀeringmechanism. Recently the work in [30] formulates the AUC maximization problemas a convex-concave saddle point problem. The proposed algorithm in [30] solvesa pairwise squared hinge loss function without the need to access the buﬀeredinstances or the second-order information. Therefore, it shows linear space andtime complexities per iteration with respect to the number of features.The work in [12] proposes a budget online kernel method for nonlinear AUCmaximization. For massive datasets, however, the size of the budget needs to belarge to reduce the variance of the model and to achieve an acceptable accuracy,which in turns increases the training time complexity. The work [8] attemptsto address the scalability problem of kernelized online AUC maximization bylearning a mini-batch linear classiﬁer on an embedded feature space. The authorsexplore both Nystr¨om approximation and random Fourier features to constructan embedding in an online setting. Despite their superior eﬃciency, online lin-ear and nonlinear AUC maximization algorithms are susceptible to suboptimalconvergence, which leads to inferior AUC classiﬁcation accuracy.Instead of maximizing a surrogate loss function, the authors of [28] attemptto optimize the real AUC loss function using a nonparametric learning algorithm. (cid:0)

Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz

However, learning the nonparametric algorithm on high dimensional datasets isnot reliable.

Given a training dataset S = { x i , y i } ∈ R n × d , where n denotes the number ofinstances and d refers to the dimension of the data, generated from unknowndistribution D . The label of the data is a binary class label y = {− , } . We use n + and n − to denote the number of positive and negative instances, respectively.The maximization of the AUC metric is equivalent to the minimization of thefollowing loss function: L ( f ; S ) = 1 n n + X i =1 n − X j =1 I ( f ( x + i ) ≤ f ( x − j )) , (1)for a linear classiﬁer f ( x ) = w T x , where I ( · ) is an indicator function that out-puts 1 if its argument is true, and 0 otherwise. The discontinuous nature of theindicator function makes the pairwise minimization problem (1) hard to opti-mize. It is common to replace the indicator function with its convex surrogatefunction as follows, L ( f ; S ) = 1 n n + X i =1 n − X j =1 ℓ ( f ( x + i ) − f ( x − j )) p . (2)This pairwise loss function ℓ ( f ( x + i ) − f ( x − j )) is convex in w , and it upperbounds the indicator function. The pairwise loss function is deﬁned as hinge losswhen p = 1, and is deﬁned as squared hinge loss when p = 2. The optimal linearclassiﬁer w for maximizing the AUC metric can be obtained by minimizing thefollowing objective function:min w || w || + C n + X i =1 n − X j =1 max (0 , − w T ( x + i − x − j )) p , (3)where || w || is the Euclidean norm and C is the regularization hyper-parameter.Notice that the weight vector w is trained on the pairs of instances ( x + − x − ) thatform the diﬀerence space. This linear classiﬁer is eﬃcient in dealing with large-scale applications, but its modeling capability is limited to the linear decisionboundary.The kernelized AUC maximization can also be formulated as an uncon-strained objective function [20,4]: calable Nonlinear AUC Maximization Methods 5 min β ∈ R n β T K β + C X ( i,j ) ∈ A max (0 , − (( Kβ ) i − ( Kβ ) j ) p , (4)where K is the kernel matrix, and A is a sparse matrix that contains all possiblepairs A ≡ { ( i, j ) | y i > y j } . In the batch setting, the computation of the kernelcosts O ( n d ) operations, while storing the kernel matrix requires O ( n ) memory.Moreover, the summation over pairs costs O ( n log n ) [20]. These complexitiesmake kernel machines costly to train compared to the linear model that haslinear complexity with respect to the number of instances. The Nystr¨om approximation [19,31] is a popular approach to approximate thefeature maps of linear and nonlinear kernels. Given a kernel function K ( · , · ) andlandmark points { u l } vl =1 generated or randomly chosen from the input space S ,the Nystr¨om method approximates a kernel matrix G as follows, G ≈ ¯ G = EW − E T , where W ij = κ ( u i , u j ) is a kernel matrix computed on landmark points and W − is its pseudo-inverse. The matrix E ij = κ ( x i , u j ) is a kernel matrix representingthe intersection between the input space and the landmark points. The matrix W is factorized using singular value decomposition or eigenvalue decomposition asfollows: W = U Σ − U T , where the columns of the matrix U hold the orthonor-mal eigenvectors while the diagonal matrix Σ holds the eigenvalues of W indescending order. The Nystr¨om approximation can be utilized to transform thekernel machines into linear machines by nonlinearly embedding the input spacein a ﬁnite-dimensional feature space. The nonlinear embedding for an instance x is deﬁned as follows, ϕ ( x ) = U r Σ − r φ T ( x ) , where φ ( x ) = [ κ ( x, u ) , . . . , κ ( x, u v )], the diagonal matrix Σ r holds the top r eigenvalues, and U r is the corresponding eigenvectors. The rank- r , r ≤ v , is thebest rank- r approximation of W . We use the k-means algorithm to generatethe landmark points [31]. This method has shown a low approximation errorcompared to the standard method, which selects the landmark points based onuniform sampling without replacement from the input space. The complexity ofthe k-means algorithm is linear O ( nvd ), while the complexity of singular valuedecomposition or eigenvalue decomposition is O ( v ). Therefore, the complexityof the k-means Nystr¨om approximation is linear in the input space. In this section, we present the two nonlinear algorithms that maximize the AUCmetric over a ﬁnite-dimensional feature space constructed using the k-means (cid:0)

Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz

Algorithm 1:

Nonlinear AUC Maximization

Embedding Steps :Compute the centroid points { u l } vl =1 Form the matrix W : W ij = κ ( u i , u j )Compute the eigenvalue decomposition: W = UΣU T Form the matrix E : E i = φ ( x i ) = [ κ ( x i , u ) , . . . , κ ( x i , u v )]Construct the feature space: ϕ ( X ) = U r Σ − r E T Training :Learn the batch model described in Algorithm 2 or the stochastic modeldetailed in Algorithm 3

Prediction :Map a test point x : ϕ ( x ) = U r Σ − r φ T ( x )Score value: w T ϕ ( x ) Nystr¨om approximation [31]. First, we solve the pairwise squared hinge lossfunction in a batch learning mode using the truncated Newton solver [4]. Forthe second method, we present a stochastic learning algorithm that minimizesthe pairwise hinge loss function.The main steps of the proposed nonlinear AUC maximization methods areshown in Algorithm 1. In the embedding steps, we construct the nonlinear map-ping (embedding) based on a given kernel function and landmark points. Thelandmark points are computed by the k-means clustering algorithm applied tothe input space. Once the landmark points are obtained, the matrix W and itsdecomposition are computed. The original input space is then mapped nonlin-early to a ﬁnite-dimensional feature space in which the nonlinear problem canbe solved using linear machines.The AUC optimization (3) can be solved for w in the embedded space asfollows, min w || w || + C n + X i =1 n − X j =1 max (0 , − w T ( ϕ ( x + i ) − ϕ ( x − j ))) p , (5)where ϕ ( x ) is a nonlinear feature mapping for x . The minimization of (5) canbe solved using truncated Newton methods [4] as shown in Algorithm 2. Thematrix A in Algorithm 2 is a sparse matrix of size r × n , where r is the number ofpairs. The matrix A holds all possible pairs in which each row of A has only twononzero values. That is, if ( i, j ) | y i > y j , the matrix A has a k -th row such that A ki = 1 , A kj = −

1. However, the complexity of this Newton batch learning isdependent on the number of pairs. The authors of [4] also proposed the PSVM+algorithm, which avoids the direct computation of pairs by reformulating the calable Nonlinear AUC Maximization Methods 7 pairwise loss function in such a way that the calculations of the gradient andthe Hessian-vector product are accelerated.

Algorithm 2:

Batch Nonlinear AUC Maximization

Input: embedded data ˜ X Output: the ranking model w initial vector w ← while stopping criterion is not satisﬁed do D = max (0 , − A ( w T ˜ X ))Compute gradient g = w − ( CD T A ˜ X ) T Compute a search direction s t by applying conjugate gradient to solve ∇ F ( w k ) s = −∇ F ( w k )Update w k +1 = w k + s k end while Nevertheless, the optimization of PRSVM+ to maximize the AUC metricstill requires O ( n ˆ d + 2 n + ˆ d ) operations to compute each of the gradient andthe Hessian-vector product in each iteration, where ˆ d is the dimension of theembedded space. This makes the training of PRSVM+ expensive for massivedatasets embedded using a large number of landmark points. A large set oflandmark points is desirable to improve the approximation of the feature maps;hence boosting the generalization ability of the involved classiﬁer.To address this complexity, we present a ﬁrst-order stochastic method tomaximize the AUC metric on the embedded space. Speciﬁcally, we optimizea pairwise hinge loss function using stochastic gradient descent accelerated byscheduling both the regularization update and averaging techniques. The pro-posed stochastic algorithm can be seen as an averaging variant of the SVMSGD2method proposed in [3]. Algorithm 3 describes the proposed stochastic AUCmaximization method. The algorithm randomly selects a positive and negativeinstance and updates the model in each iteration as follows, w t +1 = w t + 1 λ ( t + t ) ℓ ′ ( w Tt x t ) x t , where ℓ ′ ( z ) is a subgradient of the hinge loss function, the vector x t holds thediﬀerence ϕ ( x + i ) − ϕ ( x − j ), w t is the solution after t iterations, and λ ( t + t ) isthe learning rate, which decreases in each iteration. The hyper-parameter λ canbe tuned on a validation set. The positive constant t is set experimentally, andit is utilized to prevent large steps in the ﬁrst few iterations [3]. The model isregularized each rskip iterations to accelerate its convergence. We also foster theacceleration of the model by implementing an averaging technique [23,29]. Theintuitive idea behind the averaging step is to reduce the variance of the modelthat stems from its stochastic nature. We regulate the regularization update andaveraging steps to be performed each askip and rskip iterations as follows, w t +1 = w t +1 − rskip ( t + t ) − w t +1 (cid:0) Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz ˜ w q +1 = q ˜ w q + w t +1 q + 1 , where ˜ w is the averaged solution after q iterations with respect to the askip .The advantage of regulating the averaging step is to reduce the per iterationcomplexity, while eﬀectively accelerating the convergence.The presented ﬁrst-order stochastic AUC maximization requires O ( ˆ da ) op-erations per iteration in addition to the O ( ˆ d ) operations needed for each of theregularization update and averaging steps that occur per rskip and askip iter-ations respectively, where a denotes the average number of nonzero coordinatesin the embedded diﬀerence vector x t . Algorithm 3:

Stochastic Nonlinear AUC Maximization

Input: embedded data ˜ X , λ , t , T , rskip, askip Output: the ranking model ww ← w ← ,rcount = rskip , acount = askip , q = 0 for t = 1 , . . . , T do Randomly pick a pair i t ∈ , . . . , n + , j t ∈ , . . . , n − x t = ˜ x i t − ˜ x j t w t +1 = w t + λ ( t + t ) ℓ ′ ( w Tt x t ) x t rcount = rcount − if rcount ≤ then w t +1 = w t +1 − rskip ( t + t ) − w t +1 rcount = rskip end if acount = acount − if acount ≤ then ˜ w q +1 = q ˜ w q + w t +1 q + 1 q = q + 1acount = askip end ifend for set w = ˜ w q return w In this section, we evaluate the proposed methods on several benchmark datasetsand compare them with kernelized AUC algorithm and other state-of-the-art on-line AUC maximization algorithms. The experiments are implemented in MAT-LAB, while the learning algorithms are written in C language via MEX ﬁles.The experiments were performed on a computer equipped with an Intel 4GHzprocessor with 32G RAM. calable Nonlinear AUC Maximization Methods 9

The datasets we use in our experiments can be downloaded from LibSVM web-site or UCI . The datasets that are not split (i.e., spambase, magic04, connect-4, skin, and covtype) into training and test sets; we randomly divide them into80%-20% for training and testing. The features of each dataset are standardizedto have zero mean and unit variance. The multi-class datasets (e.g., covtype andusps) are converted into class-imbalanced binary data by grouping the instancesinto two sets, where each set has the same number of class labels. To speed upthe experiments that include the kernelized AUC algorithm, we train all thecompared methods on 80k instances, randomly selected from the training set.The other experiments are performed on the entire training data. The charac-teristics of the datasets along with their imbalance ratios are shown in Table1. Table 1: Benchmark datasets Data

We compare the proposed methods with kernel RankSVM and linear RankSVM,which can be used to solve the AUC maximization problem. We also includetwo state-of-the-art online AUC maximization algorithms. The random Fouriermethod that approximates the kernel function is also involved in the experimentswhere the resulting classiﬁer is solved by linear RankSVM.1.

RBF-RankSVM:

This is the nonlinear kernel RankSVM [20]. We use Gaus-sian kernel K ( x, y ) = exp ( − γ || x − y || ) to model the nonlinearity of thedata. The best width of the kernel γ is chosen by 3-fold cross validation onthe training set via searching in { − , . . . , − } . The regularization hyper-parameter C is also tuned by 3-fold cross validation by searching in the grid http://archive.ics.uci.edu/ml/index.php (cid:0) Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz { − , . . . , } . The searching grids are selected based on [20]. We also trainthe RBF-RankSVM on 1 / Linear RankSVM (PRSVM+):

This is the linear RankSVM that opti-mizes the squared hinge loss function using truncated Newton [4]. The bestregularization hyper-parameter C is chosen from the grid { − , . . . , } via3-fold cross validation.3. RFAUC:

This uses the random Fourier features [24] to approximate thekernel function. We use PRSVM+ to solve the AUC maximization problemon the projected space. The hyper-parameters C and γ are selected via 3-foldcross validation by searching on the grids { − , . . . , } and { , , } ,respectively.4. NOAM:

This is the sequential variant of online AUC maximization [32]trained on a feature space constructed via the k-means Nystr¨om approxima-tion. The hyper-parameters are chosen as suggested by [32] via 3-fold crossvalidation. The number of positive and negative buﬀers is set to 100.5.

NSOLAM:

This is the stochastic online AUC maximization [30] trained ona feature space constructed via the k-means Nystr¨om approximation. Thehyper-parameters of the algorithm (i.e., the learning rate and the bound onthe weight vector) are selected via 3-fold cross validation by searching in thegrids { } and { − , . . . , } , respectively. The number of epochsis set to 15.6. NBAUC:

This is the proposed batch AUC maximization algorithm trainedon the embedded space. We solve it using the PRSVM+ algorithm [4]. Thehyper-parameter C is tuned similarly to the Primal RankSVM.7. NSAUC:

This is the proposed stochastic AUC maximization algorithmtrained on the embedded space. The hyper-parameter λ is chosen from thegrid { − , . . . , − } via 3-fold cross validation.For those algorithms that involve the k-means Nystr¨om approximation (i.e.,our proposed methods, NOAM, and NSOLAM), we compute 1600 landmarkpoints using the k-means clustering algorithm, which is implemented in C lan-guage. We select a Gaussian kernel function to be used with the k-means Nystr¨omapproximation. The bandwidth of the Gaussian function is set to be the averagesquared distance between the ﬁrst 80k instances and the mean computed overthese 80k instances. For a fair comparison, we also set the number of randomFourier features to 1600. The comparison of batch AUC maximization methods in terms of AUC classi-ﬁcation accuracy on the test set is shown in Table 2, while Table 3 comparesthese batch methods in terms of training time. For connect-4 dataset, the resultsof RBF-RankSVM are not reported because the training runs over ﬁve days.We observe that the proposed NBAUC outperforms the competing batchmethods in terms of AUC classiﬁcation accuracy. The AUC performance of RBF-RankSVM might be improved for some datasets if the best hyper-parameters are calable Nonlinear AUC Maximization Methods 11 selected on a more restricted grid of values. Nevertheless, the training of NBAUCis several orders of magnitude faster than RBF-RankSVM. The fast training ofNBAUC is clearly demonstrated on the large datasets.The proposed NBAUC shows a robust AUC performance compared to RFAUCon most datasets. This can be attributed to the robust capability of the k-meansNystr¨om method in approximating complex nonlinear structures. It also indi-cates that a better generalization can be attained by capitalizing on the datato construct the feature maps, which is the main characteristic of the Nystr¨omapproximation, while the random Fourier features are oblivious to the data.We also observe that the AUC performance of both RBF-RankSVM and itsvariant applied to random subsamples outperform the linear RankSVM, exceptfor the protein dataset. However, RBF-RankSVM methods require longer train-ing, especially for large datasets. We see that the linear RankSVM performsbetter than the kernel AUC machines on the protein dataset. This implies thatthe protein dataset is linearly separable. However, the AUC performance of theproposed method NBAUC is even better than linear RankSVM on this dataset.Table 2: Comparison of AUC performance for batch classiﬁers on the benchmarkdatasets.

Data RBF-RankSVM RBF-RankSVM (subsample)

Linear RankSVM RFAUC

NBAUC spambase 98.00 96.02 97.47 97.75 98.04usps 99.08 98.54 90.27 97.42 99.24magic04 92.18 91.34 84.47 92.83 93.06protein 80.97 77.60 83.30 58.43 84.33ijcnn1 99.68 99.35 91.56 98.86 99.57connect-4 - 91.32 88.20 91.10 94.09acoustic 93.60 93.02 87.38 91.82 94.14skin 99.92 99.92 94.81 100 99.98cod-rna 99.07 99.07 98.85 99.12 99.12covtype 93.94 94.05 87.75 95.99 96.032 (cid:0)

Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz

Table 3: Comparison of training time (in seconds) for batch classiﬁers on thebenchmark datasets.

Data RBF-RankSVM RBF-RankSVM (subsample)

Linear RankSVM RFAUC

NBAUC spambase 3.08 0.10 0.13 3.59 7.71usps 492.30 0.83 1.42 6.77 27.68magic04 518.04 3.71 0.08 21.51 25.46protein 2614.7 4.81 4.47 14.20 73.81ijcnn1 15,434 282 0.57 80.17 88.87connect-4 - 12,701 3.42 62.60 164.48acoustic 134,030 5,610 1.88 92.74 151.78skin 2037.30 78.20 0.20 73.18 23.71cod-rna 5,715 255.4 0.44 83.01 113.66covtype 133,270 11,670 2.54 273.67 220.90

We now compare our stochastic algorithm NSAUC with the state-of-the-art on-line AUC maximization methods, NOAM and NSOLAM. We also include theresults of the proposed batch algorithm NBAUC for reference. The k-meansNystr¨om approximation is implemented separately for each algorithm as in-troduced in Section 4. We experiment on the following large datasets: ijcnn1,connect-4, acoustic, skin, cod-rna, and covtype. Table 4 shows the comparisonof the proposed methods with the online AUC maximization algorithms. Noticethat the reported training time in Table 4 indicates only the time cost of thelearning steps with excluding the embedding steps.We can see that the proposed NSAUC achieves a competitive AUC per-formance compared to the proposed NBAUC, but with less training time. Onthe largest dataset covtype, the AUC performance of NSAUC is on par withNBAUC, while it only requires 49.17 seconds for training compared to morethan 18 minutes required by NBAUC. In contrast to the online methods, theproposed NSAUC is able to converge to the optimal solution obtained by thebatch method NBAUC. We attribute the robust performance of NSAUC to theeﬀectiveness of scheduling both the regularization update and averaging.We observe that the proposed NSAUC requires longer training time on somedatasets (e.g., connect-4 and acoustic) compared to the online methods; how-ever, the diﬀerence in the training time is not signiﬁcant. In addition, we seethat NSOLAM performs better than NOAM in terms of AUC classiﬁcation ac-curacy. This implies the advantage of optimizing the pairwise squared hinge lossfunction, performed by NSOLAM, over the pairwise hinge lose function, carriedout by NOAM, for one-pass AUC maximization. calable Nonlinear AUC Maximization Methods 13

Table 4: Comparison of AUC classiﬁcation accuracy and training time (in sec-onds) for the proposed algorithms with other online AUC maximization algo-rithms. The training time does not include the embedding steps.

Data Metric NOAM NSOLAM

NSAUC NBAUC ijcnn1 AUCTraining time 98.166.24 98.866.88 99.694.80 99.5740.70connect-4 AUCTraining time 85.966.97 90.607.39 94.0410.74 94.0836.96acoustic AUCTraining time 89.9010.80 91.0010.82 94.0423.80 94.1459.34skin AUCTraining time 99.986.26 99.015.66 99.986.60 99.9810.32cod-rna AUCTraining time 98.2942.09 99.1047.06 99.1934.23 99.18148.46covtype AUCTraining time 91.2961.75 92.2563.59 96.0049.17 96.601110.44

We investigate the convergence of NSAUC and its counterpart NSOLAM withrespect to the number of epochs. We also include NSVMSGD2 algorithm [3]that minimizes the pairwise hinge loss function on a feature space constructedvia the k-means Nystr¨om approximation, described in Section 4. The algorithmNSVMSGD2 is analogous to the proposed algorithm NSAUC, but with no aver-aging step. The AUC performances of these stochastic methods upon varying thenumber of epochs are depicted in Figure 1. We vary the number of epochs accord-ing to the grid { , , , , , , , , , , , } , and run the stochasticalgorithms using the same setup described in the previous subsection. In all sub-ﬁgures, the x-axis represents the number of epochs, while the y-axis is the AUCclassiﬁcation accuracy on the test data.The results show that the proposed NSAUC converges to the optimal so-lution on all datasets. We can also see that the AUC performance of NSAUCoutperforms its non-averaging variant NSVMSGD2 on four datasets (i.e., ijcnn1,cod-rna, acoustic, and connect-4), while its training time is on par with thatof NSVMSGD2. This indicates the eﬀectiveness of incorporating the scheduledaveraging technique. Furthermore, the AUC performance of NSAUC does notﬂuctuate with varying the number of epochs on all datasets. This implies thatchoosing the best number of epochs would be easy.In addition, we can observe that the AUC performance of NSOLAM does notshow signiﬁcant improvement after the ﬁrst epoch. The reason is that NSOLAM (cid:0) Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz reaches a local minimum (i.e., a saddle point) in a single pass and gets stuckthere. number of epochs A UC NSAUC (9.27 13.77)NSVMSGD2 (9.18 13.67)NSOLAM (41.30 61.52) (a) ijcnn1 number of epochs A UC NSAUC (10.45 15.64)NSVMSGD2 (10.37 15.45)NSOLAM (44.92 66.90) (b) connect-4 number of epochs A UC NSAUC (16.07 24.12)NSVMSGD2 (15.68 23.31)NSOLAM (68.14 100.70) (c) acoustic number of epochs A UC NSAUC ( 12.00 17.86)NSVMSGD2 (12.52 18.60)NSOLAM (38.88 57.91) (d) skin number of epochs A UC NSAUC (62.89 94.42)NSVMSGD2 (62.47 92.36)NSOLAM (277.18 412.44) (e) cod-rna number of epochs A UC NSAUC (94.75 133.70)NSVMSGD2 (88.87 131.98)NSOLAM (393.18 582.65) (f) covtype

Fig. 1: AUC classiﬁcation accuracy of stochastic AUC algorithms with respectto the number of epochs. We randomly pick a positive and negative instancefor each iteration in NSAUC and NSVMSGD2, where n iterations correspondto one epoch. The values in parentheses denote the averaged training time (inseconds) along with the standard deviation over all epochs. The training timeexcludes the computational time of the embedding steps. The x-axis is displayedin log-scale. In this paper, we have proposed scalable batch and stochastic nonlinear AUCmaximization algorithms. The proposed algorithms optimize linear classiﬁerson a ﬁnite-dimensional feature space constructed via the k-means Nystr¨om ap-proximation. We solve the proposed batch AUC maximization algorithm usingtruncated Newton optimization, which minimizes the pairwise squared hinge lossfunction. The proposed stochastic AUC maximization algorithm is solved using aﬁrst-order gradient descent that implements scheduled regularization update andscheduled averaging to accelerate the convergence of the classiﬁer. We show viaexperiments on several benchmark datasets that the proposed AUC maximiza-tion algorithms are more eﬃcient than the nonlinear kernel AUC machines, whiletheir AUC performances are comparable or even better than the nonlinear kernel calable Nonlinear AUC Maximization Methods 15

AUC machines. Moreover, we show experimentally that the proposed stochas-tic AUC maximization algorithm outperforms the state-of-the-art online AUCmaximization methods in terms of AUC classiﬁcation accuracy with a marginalincrease in the training time for some datasets. We demonstrate empirically thatthe proposed stochastic AUC algorithm converges to the optimal solution in afew epochs, while other online AUC maximization algorithms are susceptible tosuboptimal convergence. In the future, we plan to use the proposed algorithmsin solving large-scale multiple-instance learning.

References

1. Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., Roth, D.: Generalizationbounds for the area under the roc curve. Journal of Machine Learning Research (Apr), 393–425 (2005)2. Airola, A., Pahikkala, T., Salakoski, T.: Training linear ranking svms in linearith-mic time using red–black trees. Pattern Recognition Letters (9), 1328–1336(2011)3. Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic gra-dient descent. Journal of Machine Learning Research (Jul), 1737–1754 (2009)4. Chapelle, O., Keerthi, S.S.: Eﬃcient algorithms for ranking with svms. InformationRetrieval (3), 201–215 (2010)5. Chaudhuri, S., Theocharous, G., Ghavamzadeh, M.: Recommending advertise-ments using ranking functions (Jan 18 2016), uS Patent App. 14/997,9876. Chen, K., Li, R., Dou, Y., Liang, Z., Lv, Q.: Ranking support vector machine withkernel approximation. Computational intelligence and neuroscience , 4629534(2017)7. Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization. Advances inneural information processing systems (16), 313–320 (2004)8. Ding, Y., Liu, C., Zhao, P., Hoi, S.C.: Large scale kernel methods for online aucmaximization. In: Data Mining (ICDM), 2017 IEEE International Conference on.pp. 91–100. IEEE (2017)9. Ding, Y., Zhao, P., Hoi, S.C., Ong, Y.S.: An adaptive gradient method for onlineauc maximization. In: AAAI. pp. 2568–2574 (2015)10. Gao, W., Jin, R., Zhu, S., Zhou, Z.H.: One-pass auc optimization. In: ICML (3).pp. 906–914 (2013)11. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiveroperating characteristic (roc) curve. Radiology (1), 29–36 (1982)12. Hu, J., Yang, H., King, I., Lyu, M.R., So, A.M.C.: Kernelized online imbalancedlearning with ﬁxed budgets. In: AAAI. pp. 2666–2672 (2015)13. Joachims, T.: A support vector method for multivariate performance measures. In:Proceedings of the 22nd international conference on Machine learning. pp. 377–384.ACM (2005)14. Joachims, T.: Training linear svms in linear time. In: Proceedings of the 12th ACMSIGKDD international conference on Knowledge discovery and data mining. pp.217–226. ACM (2006)15. Kakkar, V., Shevade, S., Sundararajan, S., Garg, D.: A sparse nonlinear classiﬁerdesign using auc optimization. In: Proceedings of the 2017 SIAM InternationalConference on Data Mining. pp. 291–299. SIAM (2017)6 (cid:0) Majdi Khalid, Indrakshi Ray, and Hamidreza Chitsaz16. Keerthi, S.S., Chapelle, O., DeCoste, D.: Building support vector machines withreduced classiﬁer complexity. Journal of Machine Learning Research (Jul), 1493–1515 (2006)17. Khalid, M., Ray, I., Chitsaz, H.: Conﬁdence-weighted bipartite ranking. In: Ad-vanced Data Mining and Applications: 12th International Conference, ADMA2016, Gold Coast, QLD, Australia, December 12-15, 2016, Proceedings 12. pp.35–49. Springer (2016)18. Kotlowski, W., Dembczynski, K.J., Huellermeier, E.: Bipartite ranking throughminimization of univariate loss. In: Proceedings of the 28th International Confer-ence on Machine Learning (ICML-11). pp. 1113–1120 (2011)19. Kumar, S., Mohri, M., Talwalkar, A.: Ensemble nystrom method. In: Advances inNeural Information Processing Systems. pp. 1060–1068 (2009)20. Kuo, T.M., Lee, C.P., Lin, C.J.: Large-scale kernel ranksvm. In: Proceedings of the2014 SIAM international conference on data mining. pp. 812–820. SIAM (2014)21. Lee, C.P., Lin, C.J.: Large-scale linear ranksvm. Neural computation (4), 781–817 (2014)22. Liu, T.Y.: Learning to rank for information retrieval. Foundations and Trends inInformation Retrieval (3), 225–331 (2009)23. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averag-ing. SIAM Journal on Control and Optimization30