aa r X i v : . [ c s . A I] J un Ordering as privileged information
Tom VacekUniversity of Minnesota, Minneapolis, [email protected] 2, 2018
Abstract
We propose to accelerate the rate of convergence of the pattern recognition task by directly minimizingthe variance diameters of certain hypothesis spaces, which are critical quantities in fast-convergence results.We show that the variance diameters can be controlled by dividing hypothesis spaces into metric balls basedon a new order metric. This order metric can be minimized as an ordinal regression problem, leading toa LUPI application where we take the privileged information as some desired ordering, and construct afaster-converging hypothesis space by empirically restricting some larger hypothesis space according tothat ordering. We give a risk analysis of the approach. We discuss the difficulties with model selectionand give an innovative technique for selecting multiple model parameters. Finally, we provide some dataexperiments.
Learning using Privileged Information (first proposed by Vapnik et. al. [1]) seeks to bring in privilegedinformation to assist the learner. This information is privileged because the learner may use it choose ahypothesis, but the privileged information will be unavailable making decisions based on the hypothesis.This paper proposes a LUPI method that directly minimizes the variance diameters of the hypothesisspaces under consideration, an essential quantity in fast-convergence literature [2, 3]. This approach appliesto discriminant-based hypotheses spaces where predictions are derived from thresholding the the discrimi-nant value, which should be totally orderable. We show that the discriminant ordering defines equivalenceclasses for the elements of the space, and these classes are directly related the the variance diameters we seekto control. If we could restrict the hypothesis space to just the good equivalence classes, we would reducethe variance diameters and improve the speed of convergence.Selecting a good equivalence class requires some external definition of desirable order. This is privilegedinformation. This raises a natural question:
What is a desirable order?
If ordering information were providedby an oracle according to some true distribution, then any ordering would provide desirable variance diam-eters. However, since an empirical ordering is the best we can hope for, a good ordering is one which hasfavorable convergence properties in ordinal regression. Low ordinal loss is sufficient, while more generalcharacterizations may be possible.From another perspective, orderings according to conditional probability P ( Y | X ) have great appeal, asmodels that provide good estimates of conditional probability allow broader application than ones that do We would want the ordering to correspond to a good hypothesis for the pattern recognition problem, though this is irrelevant to the rate of convergence.
We can’t hope to provide any kind of overview of the field of order statistics. Web search made this alucrative and popular field. Nevertheless, we believe our starting point is novel:
Definition 1.
Order is any property of a set of real numbers that is invariant under any invertible increasingtransformation.
Ordering naturally defines equivalence classes on hypotheses spaces. Suppose some distribution P gen-erates feature vectors X ∈ R d , and suppose h and h are hypotheses in a space H : R d → R . If anincreasing function m exists so that for all X ∼ P , m ( h ( X )) = h ( X ) , then h and h are in the sameequivalence class.The thread which connects order to the pattern recognition task is the growth function, which measuresthe number of possible labelings of a set of points of size n by a 0/1-hypothesis set H / . We assume that H / is defined by characteristic functions of real-valued functions, as in H / = { ξ ( h ( X ) − t ) : h ∈ H : R d → R , t ∈ R } . We observe that if ˆ H contains a restricted number of order equivalence classes, thenthe growth function of H is also restricted. We propose that the relationship can be made precise by themachinery of variance-based risk bounds.As a thought experiment, consider the variance diameter of h and h when combined with an appro-priate 0/1 loss function, and restricted to one class. More formally, suppose now that P generates featurevectors and labels Y ∈ ± jointly. Definition 2.
Zero-one loss is: l (ˆ y, y ) = y > , y ≤
01 ˆ y ≤ , y > otherwise (1) Then for any h and h in the same order equivalence class for for any t and t , it is easy to show that E P ↾ Y =1 | l ( h ( X ) − t , Y ) − l ( h ( X ) − t , Y ) | (2) ≤| E P ↾ Y =1 l ( h ( X ) − t , Y ) − l ( h ( X ) − t , Y ) | (3) We might relax this to holding on a set of full measure. As a matter of boilerplate, we use loss functions as are commonly defined in machine learning literature: a loss function takes twoarguments: a prediction and a label; however, we may omit those where obvious to avoid clutter. Loss functions are uniquely identifiedby a superscript, unless intended to be taken generally. The expectation of the loss E X,Y ∼ P [ l ( h ( X ) , Y )] is depicted as L ( h ) , and theempirical risk on a finite set of size n is depicted as L n ( h ) . There are numerous more measurability and existence assumptions thatwe will not cover here. that is neededfor fast rates.There are two tasks required to extend the thought experiment to a real learning formulation. First, therelationship needs to apply in both classes simultaneously. The complicating factor to just adding the twoper-class relationships is that the absolute-value arguments in the two respective RHS’s could have oppositesigns and cancel out. This can happen when the decision threshold falls in very different places relative tothe ordering. We fix this by requiring that the loss in the two classes be balanced as a constraint on validsolutions. In effect, this is a constraint on the location of the decision boundary in the equivalence classdefinition, preventing the situations where there is cancellation.More significantly, there is no access to the equivalence classes based only on empirical information.This is the majority of the analysis in the remainder of the paper. In short, we relax the notion of theequivalence class to metric balls, and then we bound bound the deviation of empirical balls from their truediameter. We define a metric on orderings so that two hypotheses are in the same equivalence class if their metricdistance is 0. The metric, when shown to have favorable properties, allows us to create balls of restrictedvariance diameter based on a finite sample using ordinary empirical risk minimization.
Definition 3.
Let M be the set of all increasing continuous functions. Definition 4.
Let P generate vectors X ∈ R d , and consider any two functions h , h : R d → R . Then theorder distance between h and h is D ( h , h ) = sup t ∈ R inf m ∈M E P [ l ( m ◦ h ( X ) − t, h ( X ) − t )] The metric axioms (up to equivalence class elements) are not hard to check; the invertability of m isindispensable here.While many definitions would satisfy the metric axioms, we chose this definition for two reasons. First,it is a direct extension of the result we saw for equivalence classes. If D ( h ( X ) , h ( X )) ≤ d , then (3) holdswith d added to the right-hand side. Second, the fact that 0/1 loss is an underlying component allows us toborrow a great deal from the standard results in machine learning.Since we have no access to true probabilities in a statistical learning setting, to proceed we have to derivea way to get access to D . We accomplish this by extending D to measure the order distance of a function h to some ground truth ordering instead of another function. The key observation is that functions within D of the ground truth ordering are within D of each other by the triangle inequality—an empirical versionof the equivalence classes we set out to find. We will redefine this extension of D as L iso for clarity: Fast converging bounds require the following: Denote by h ′ ∈ H the minimum of the true risk over H . The following must holduniformly for all h ∈ H : Var[ l ( h ( X ) , Y ) − l ( h ′ ( X ) , Y )] ≤ E[ l ( h ( X ) , Y ) − l ( h ′ ( X ) , Y )] (4)Our presentation is slightly different because the variance can be upper bounded by the expectation of the absolute value and we havenot specialized to h ′ . Note that many presentations of fast convergence can lead an in-attentive reader to believe that h ′ must be theBayes rule. This is true if one is using the Mammen-Tsybakov noise conditions to establish the desired variance relationship, but notnecessary if the relationship can be established another way, as we do here. efinition 5. Suppose P jointly generates feature vectors X and real-valued labels Y . Let H : R d → R besome hypothesis space. L iso ( h ) = sup t ∈ R inf m ∈M E P [ l ( m ◦ h ( X ) − t, Y − t )] The task at hand is to analyze risk bounds for L iso that allows us, with high probability, to identifyfunctions where L iso ( h ) ≤ D . based on a finite sample. The key observation is that the infimum ( m ∈ M )and supremum ( t ∈ R ) in the definition of L iso can be handled by a uniform bound on the deviations of aloss function over the joint space H × M × R . We define that loss function to be l thr : Definition 6. l thr (ˆ y, y, t ) = l (ˆ y − t, y − t ) We will show that if H has bounded level VC-dimension, then the loss class for l thr over H × M × R also has bounded VC dimension, so we can derive uniform risk bounds for empirical risk minimization.We observe that level VC-dimension already satisfies an order invariance. That is, the inclusion of M doesn’t have any affect on the estimated growth function. Lemma 1.
Let MH = { m ◦ h : h ∈ H , m ∈ M} . Then the level VC dimension of MH is the same as for H . The proof is a simple shattering argument. Obviously,
MH ⊇ H , so the VC-dimension is not less. It isnot more because any rule m ◦ h − t > can be replicated by h − t ′ > for an appropriately chosen t ′ .These two observations give the following theorem: Theorem 1.
Suppose H has level VC-dimension V . There exists a universal constant C such that uniformlyfor all h ∈ H the following holds with probability at least − δ : L iso ( h ) ≤ sup t ∈ R inf m ∈M L thrn ( m ◦ h, t ) + C s V log( n ) + log δ n The bound is a straightforward application of uniform risk bounds, provided that the VC dimension of theloss class can be found. This is a straightforward shattering argument. Suppose H has level VC dimension V . Suppose there exists a set { ( x i , y i ) } V +1 i =1 and some t such that { l thr ( h ( x i ) , y i , t ) } h ∈H can attain anylabeling. Considering the two sets { i : y i > t } and { i : y i ≤ t } , one of these has size at least V + 1 by the pigeonhole principle. Call the set S . Then { h ( x i ) − t > } h ∈H ,i ∈ S is shattered, which contradictsour assumption about the level VC dimension of H . Hence, the VC-dimension is at most V . Extendingthe argument to allow separate orderings in each class is similar; there are now four pigeonholes because wehave to consider a separate threshold for each ordering, so V + 1 points will ensure a contradiction. The ordinal regression bound is the substantial piece of the puzzle, but to make use of it have to enforceclass balance. Suppose that the user provides some parameter w as a weight to favor or discourage loss inone class over another: Level VC-Dimension is an extension of VC-dimension to real-valued functions. It is the VC dimension of { h − t : h ∈ H , t ∈ R } . efinition 7. Loss balance, given a specified parameter w > , is measured by l B (ˆ y, y ; w ) = (cid:26) max( w,
1) ˆ y ≤ , y > − max(1 , w ) ˆ y > , y ≤ The VC dimension of this loss class can be analyzed in terms of the VC dimension for the underlyinghypothesis space just like in the ordinal regression application, giving a similar bound. With this in hand,we are ready for the main theorem:
Theorem 2.
Suppose some predefined ordering is provided. Consider the subset of H that satisfies the pre-defined ordering up to loss d and assume that the user provides a loss balance parameter that is empiricallysatisfied: ˆ H d = { h ∈ H : L ison ( h ) ≤ d, L Bn ( h ; w ) = 0 } . Assume that this set is not empty. Let ˆ h n be the empirical minimizer (of L n ) and ˆ h ′ be the true minimizer(of L ) over ˆ H . Then there exists a constant C such that the following holds uniformly for all h ∈ ˆ H andfor all φ ≤ with probability at least − δ − δ − δ : max (cid:0) L ( h n ) − L ( h ′ ) , L n ( h ′ ) − L n ( h n ) (cid:1) ≤ nφ (cid:18) C V log n + (1 + 2 φ ) log 1 δ (cid:19) + 128 φ d + C s V log( n ) + log δ n + 128 φC max (cid:18) w, w (cid:19) s V log( n ) + log δ n The key to interpreting this bound is to note that one tunes φ to optimize the value in n . While the firstterm on the RHS dominates, let φ = 1 , but when this drops below subsequent terms, φ can be set like n − / ,giving an overall convergence like n − / down to a constant times d , and then φ would be set like n − / thereafter. Obviously, a small d is important; that is to say, the privileged orderings should fit well to some h ∈ H under L iso . However, d could be much smaller under favorable circumstances. We believe it wouldbe possible to characterize these by extending the Mammen-Tsybakov noise conditions to L iso . A proof isgiven in the appendix. L ison We now study methods to find inf h ∈H ,m ∈M L ison ( m ◦ h ) , assuming a linear or RKHS hypothesis spacewith defined level VC-dimension. We note that for a fixed h , the optimal m can be found by means of adynamic program. A method to construct real-valued functions with defined level VC-dimension was givenby Vapnik [4, p. 359]. The technique requires the model to separate each successive example by a minimummargin. As common with zero-one loss, we relax it to hinge loss to make a convex model.Assume that examples are sorted increasing in y i and there are no duplicates (which require extra atten-tion). Define y ij = (cid:26) − y j ≤ y i y j > y i
5n the following formulation, C is a user-defined capacity control parameter and ρ = 1 can be assumed: min w,ξ,ζ,l k w k + C max i ( l i ) (5) s . t . for i = 1 . . . n : (6) l i = X j max ( ρ − y ij ( w · x j − ξ i ) , (7)for i − . . . n : (8) ξ i − + ρ ≤ ξ i (9)This is a convex quadratic programming problem, though regrettably requiring n (where n is the number ofexamples) dummy variables to compute the max in line 7. This makes the program intractable (by machinelearning standards), at least without a special solver.The regression problem can be made more tractable by relaxing it to alternative ordinal regression for-mulations, such as the one proposed by Sashua & Levin [5]. Their formulation penalizes uses optimizationvariables to define ordered slots according to the sort order of the targets y i , and a training example is pe-nalized if it does not project into its slot. It can be shown that the relaxation loss is an upper bound on theunrelaxed loss. The relaxed formulation is a quadratic program with just n constraints. The Global-Order SVM (GO-SVM) is the name for the formulation we propose. It is simply is the usualSVM hypothesis space (thick hyperplanes) and loss (hinge), but it is simultaneously optimized with theSashua & Levin [5] ordinal regression relaxation, with the SVM discriminant w constrained to be the sameas the ordering hypothesis w . This constraint implements the restriction from H (the SVM hypothesis space)to ˆ H (hypotheses that satisfy an ordinal condition), as defined in Theorem 2. Loss and capacity control aretraded off between the bi-objectives by means of user-selectable weights.Capacity control in both SVM and the ordinal regression formulations is attained by the relationshipbetween the squared norm of the predictor w and the size of the margin. In either formulation by itself,one can fix the margin size and place all capacity control in the squared norm of w , trading it off withloss. However, w serves a two-fold role in this formulation; therefore implementing different capacities forthe two learning objectives requires setting the margins. Noting that the ν -SVM formulations [6] have themargin as an optimization variable, we extended the approach so that the usual tradeoff between loss andcapacity is preserved.The formulation is min ξ ≥ w,b,g,ξζ,ρ b ,ρ o w T w + α − ν b ρ b + 1 n n X i =1 ξ i ! + (1 − α ) − ν o ρ o + 1 n ∗ n X i =1 | ζ i | ! (10) s . t . ∀ i, y i ( w · x i + b ) ≥ ρ b − ξ i (11) ∀ i, g I ( i ) + ρ o ≤ w · x i + ζ i ≤ g I ( i )+1 − ρ o (12)6ere, I is an index function that returns an in-order, unique index for each distinct oracle value for in eachclass, and g is a vector of interval boundaries. Ordering is enforced because there are no empty intervals.The within-class ordering variant in principle requires two ordinal regressions, but in practice it can be donewith a trick using the index function I by creating an empty interval.Variable w is the linear predictor, b is a bias term, ξ is the hinge loss for the classification problem, and | ζ | is hinge loss for the ordinal problem. Constant n ∗ is defined to control the feasible range of ν o . It is n − n / if there are not ties in the ordering.Parameter ν b controls the VC-dimension of the 0/1 loss class, ν o controls the maximum VC-dimensionof each subproblem in L thr . Finally, parameter α is related to choosing the size of d in Theorem 2, expressedin terms of the permissiveness of the ordinal loss compared to the 0/1 loss. We do not attempt to enforce lossbalance, as theory tells we should; from any computed solution, we can still apply the bound for as if wehad constrained it to the value achieved by the optimum; moreover, we have never known the uncontrainedoptimum to have unreasonable loss balance, and we have no reason to prefer otherwise.Like ν -SVM [6], the optimization problem can be characterized in terms of ν b and ν o and trainingdata. The It can be proved that the problem is primal and dual feasible for ν o ∈ [0 , , α ∈ [0 , , and ν b ∈ [0 , positive examples , negative examples ) /n ; and primal unbounded/dual infeasible other-wise. The Representer Theorem [7] holds for GO-SVM, so the solution can be expressed in terms of thedual variables and kernels can be used. We used Matlab’s interior-point convex quadratic programmingsolver. The baseline ν -SVM formulation was also implemented using the same solver, so that differences innumerical accuracy could not arise. The goal of evaluation is to prove that the order oracle hypothesis space allows faster convergence than alearning formulation which considers only the labels. Since GO-SVM is an extension of standard SVM, itis a logical baseline, and we compare only to that. However, these experiments are similar to other commonexperiments in LUPI literature, and we will point these to the reader where appropriate. Moreover, becauseof the construction of the GO-SVM hypothesis spaces, it cannot outperform SVM by virtue of a richerhypothesis space. Faster convergence is the only alternative explanation. The evaluation is not intended tobe a statement about the fitness of the hypothesis spaces for the learning task, but only about the ability ofthe learner to select the best element.The experimental setup is to hold out a testing set and sample remaining examples for 12 random real-izations of training and validation sets. The validation set is used for a set of model selection experiments,and results are reported on the test set, which is used for all experiments. Testing sets contained at least 1800examples. The formulations have a fixed, auto-scaling parameter ν , and we use structural risk minimizationto choose from a fixed set of parameters ν = [ . . . , . .The rbf kernel width (where used) is chosen from the [ . , . , -quantiles of the pairwise distance oftraining points. The kernel parameter was chosen by a hold-out validation on the SVM experiment and re-used in the GO-SVM formulation to cut down the size of the model search. The α parameter in the GO-SVMmethod was chosen from [ . , . , . .The first evaluation is up/down prediction of the MacKey-Glass synthetic timeseries [8]. It was used inthe LUPI setting (SVM+) in [1], where the authors used a 4-dimensional embedding ( x t − , x t − , x t − , x t ) in order to predict x t +5 > x t . Privileged information was a 4-dimensional embedding around the target: ( x t +3 , x t +4 , x t +6 , x t +7 ) . The authors compared SVM+ to SVM. We were not able to replicate their resultsfor either SVM or SVM+, which we suspect arises from the parameters used to generate the timeseries. (We7sed an integration step size of . , with points created every 10, delay constant τ = 17 , and initial value . .)We use | x t +5 − x t | as the order oracle. We use an RBF kernel for all experiments with this dataset.The second evaluation is predicting binary survival at a fixed time from onset. We create syntheticdatasets using the same procedure as Shiao & Cherkassky [9, personal communication], with noise level . and no censoring, which is given by an exponential distribution parameter / . . While censored data arean inherent aspect of survival studies, we avoid it in this case because the ordinal model can be modified toaccommodate the partial information that censored examples contain; thus, it is an experiment for anotherday. They compare SVM, SVM+, and the Cox proportional hazards model. Privileged information forSVM+ was related to the patient’s overall survival time and whether the event time is right censored (onlyknown to be greater than some value). We use the (absolute) difference in the fixed prediction horizon andthe event time for the order oracle, and we ignore whether an example is censored. We consider only linearmodels.The last evaluation is handwritten digit recognition, which was used by Vapnik and Vahist [1] for SVM+and slightly adapted by Lapin et al. for their proposed LUPI method [10]. The task is to classify downsam-pled ( × ) MNIST images based on pixel values. Lapin added human-annotated confidence scores totraining examples (available for download). We repeat the experiment using their data preparation and usingtheir annotators’ confidence scores as the order oracle. These experiments use an RBF kernel. The Go-SVM formulation considers a model space of 3 dimensions: a parameter to control the complexityof the classification problem, a parameter to control the complexity of the ordinal regression problem, and aparameter that balances the loss between these two problems. All of the parameters are chosen from fixedlists, which were detailed supra. The most basic form of model selection requires choosing the best node ofthe grid.We found that traditional hold-out model selection strategies are more difficult with GO-SVM. The trou-ble appears to be because the assumptions of structural risk minimization [4] no longer hold. In traditionalSRM, the nested hypothesis spaces ensures that the loss expectation of the empirical risk minimizer (as afunction of complexity) is coercive, making the selection of a minimum considerably more reliable than if itoccurred at random. In our framework, there is no total ordering of hypothesis complexity. Some hypothesisspaces (defined by parameters) have good convergence, while others do not. The task is to differentiatethem.We analyzed loss surfaces with respect to various dimensions of the parameter selection grid. We usedsynthetic datasets where we could to generate a great number of examples. We observed that, although thesurfaces were not coercive, they tended to be smooth. Since the parameters of the models have a directinterpretation, parameters that are similar should have similar performance in expectation when trained onthe same set. Thus, we decided to try a Gaussian filter to smooth the validation results, and then selectthe minimum in a grid search. The filter was constructed apriori, and was used for all the datasets in theevaluation.In each experiment, we computed the loss on the test set that would have been found by each of threemethods:1. A standard holdout, with the held-out validation set the same size as the training set. One might usecross-validation in practice. This is called ‘unsmooth.’2. The Gaussian smoothing technique using the same holdout. This is called ‘smoothed.’8able 1: Results (error rate) for all experiments. Winners are reported excluding the extended validationexperiments. ν -SVM ν -SVM GO-SVM GO-SVM GO-SVMexperiment std ext non-smooth smoothed extendedMacKey-Glass (20, 4000) .289 (.096) .220 (.065) .156 (.105) .163 (.118) .110 (.098)MacKey-Glass (50, 4000) .117 (.040) .096 (.032) .058 (.024) .045 (.016) .035 (.014)Survival (20, 1000) .409 (.030) .399 (.206) .391 (.035) .397 (.030) .357 (.030)Survival (40, 1000) .322 (.032) .311 (.029) .300 (.027) .287 (.024) .275 (.026)Survival (100, 1000) .243 (.020) .235 (.014) .221 (.023) .220 (.015) .207 (.009)Digits (60, 2500) .113 (.022) .108 (.020) .109 (.020) .108 (.022) .102 (.021)Digits (80, 2500) .093 (.013) .089 (.007) .094 (.017) .089 (.008) .082 (.007)3. A very large ‘oracle’ validation set which reveals how good the best hypothesis space is. This is called‘extended.’The Gaussian filter was of size 5x5x3, with the smallest dimension corresponding to the α parameter.(In experiments using a kernel parameter, we used one found via the SVM model search, so this was nota grid search parameter.) It has the property that any projection along coordinate directions of the filter isGaussian. This was convolved with the tensor of validation scores using zero-degree smooth extrapolation;that is, the tensor is padded out with constants which are the same as the nearest true element of the tensor.The convolution gives a new tensor that is the same dimension as the un-smoothed tensor.We also considered two alternative validation scenarios: first, selecting a model based on the un-smoothedtensor, and second, investigating the effect of having a much larger validation set available. The large vali-dation set is intended to point out the gap between the best hypothesis spaces that can be created using theordinal constraint technique, and the one which can in practice be selected. We point out that many LUPIresearch papers require validation sets that would not ordinarily be a reasonable split between training andtesting data (for example [11, 1]. Each row of the table gives the size of the training and test sets. Thecolumns give the model selection procedure. A table of results is given at Table (6.1). As a reminder, sizes for training and testing are given in parenthesiswith the experiment name. The std, non-smooth, and smoothed methods used a validation set the samesize as the training set. We note first that the gap between the extended validation model selection and theperformance of the typical technique is larger for GO-SVM than for standard SVM. This is a blessing in thatwe have the opportunity to find a better model, but also a curse in that the variance is higher. It appears thatthis strain of LUPI methods is bound by model selection issues. The Gaussian smoothing approach seems tohave been effective on the Digits dataset, and certainly did not hinder performance significantly where theun-smoothed model selection turned out to be superior.The MacKey-Glass dataset is the only one which has strongly significant results. Although the gains inother datasets are small, the fact that they are supported by theory implies they should not be overlooked.Moreover, they are consistent with results reported by other authors. There comparisons with other works forthe MacKey-Glass and Digits experiments. The MacKey-Glass experiment appeared in the original SVM+paper [1]. They report, based on a training set of 100 examples, that SVM had an error rate of .052, whereas9he best SVM+ formulation was at .048. Furthermore, this was based on a validation set of size 500. Wehave reached that level of performance with considerably less data. The Digits experiment is intended toreplicate one in [11]. We specifically replicated the experiment in which conditional probability weightswere created by humans with an intent to help a machine. This task is well-suited to the order invariance thatGO-SVM is built on, as humans have a fundamentally ordinal notion of confidence. In that study at a samplesize of 80, the difference between the best and worst methods under study was about .01—.073 compared to.083 (approximately). The size of the validation set in use, would be comparable to our extended experiment.Their best method, however, did not use human weights. In their experiment, the human weights informationimproved over SVM by about .003, whereas our gain is .006.In conclusion, the fact that the formulation can find faster-converging models than formulations whichdon’t consider order information supports the underlying theory. It appears likely the order information ishelpful in scenarios when the prediction task discretizes some continuous attribute, such as in the timeseriesand survival prediction tasks.
The original SVM+ paper [1] touched off a fair amount of research in the area. Most research, with limitedexceptions, has focused on developing and evaluating formulations [12, 13, 14, 15, 15] rather than attemptingto develop theory to understand when and why such a technique might be useful.Pechyony et. al. [16] analyze the SVM+ algorithm in terms of variance bounds. While it shares withthis work a major emphasis on variance bounds, that work considers the SVM+ loss function as given andderives bounds for it, whereas this paper works the other way in attempting to derive a formulation based onthe bound.Lapin et. al. [11] propose weighting examples based on class conditional probability, and is most directlysimilar to the ideas proposed here. Intuitively, the method encourages a learner to prioritize performanceon the easy examples over the hard examples. Unfortunately, the theoretical motivation for departing fromempirical risk minimization takes a tenuous path through SVM+ [1, 16], namely that SVM+ is reducible toweighted learning. The heart of their method is based on a loss function based on weights interpreted asconditional probability; however, a theoretical analysis is not provided. Our is somewhat more general inallowing order in variances.
References [1] Vladimir Vapnik and Akshay Vashist. 2009 special issue: A new learning paradigm: Learning using privilegedinformation.
Neural Netw. , 22:544–557, July 2009.[2] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis.
Ann. Statist. , 27(6):1808–1829, 121999.[3] Alexander B. Tsybakov. Optimal aggregation of classifiers in statistical learning.
Ann. Statist. , 32(1):135–166, 022004.[4] V.N. Vapnik.
Statistical learning theory . Adaptive and learning systems for signal processing, communications,and control. Wiley, 1998.[5] Amnon Shashua and Anat Levin. Ranking with large margin principle: Two approaches. In
Advances in NeuralInformation Processing Systems 15 [Neural Information Processing Systems, NIPS 2002, December 9-14, 2002,Vancouver, British Columbia, Canada] , pages 937–944, 2002.
6] Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New support vector algorithms.
Neural Comput. , 12(5):1207–1245, May 2000.[7] Bernhard Sch olkopf, Ralf Herbrich, and AlexJ. Smola. A generalized representer theorem. In David Helmbold andBob Williamson, editors,
Computational Learning Theory , volume 2111 of
Lecture Notes in Computer Science ,pages 416–426. Springer Berlin Heidelberg, 2001.[8] S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear prediction of chaotic time series using support vector machines.In
Neural Networks for Signal Processing [1997] VII. Proceedings of the 1997 IEEE Workshop , pages 511–520,Sep 1997.[9] Han-Tai Shiao and Vladimir Cherkassky. Learning using privileged information (LUPI) for modeling survival data.In , pages1042–1049, 2014.[10] Maksim Lapin, Matthias Hein, and Bernt Schiele. Learning using privileged information: SVM+ and weightedSVM.
Neural Networks , 53:95–108, 2014.[11] Maksim Lapin, Matthias Hein, and Bernt Schiele. Learning using privileged information: SVM+ and weightedSVM.
Neural Networks , 53:95–108, 2014.[12] Jixu Chen, Xiaoming Liu, and Siwei Lyu. Boosting with side information. In
Computer Vision - ACCV 2012 -11th Asian Conference on Computer Vision, Daejeon, Korea, November 5-9, 2012, Revised Selected Papers, PartI , pages 563–577, 2012.[13] Ziheng Wang, Tian Gao, and Qiang Ji. Learning with hidden information using a max-margin latent variable model.In
Pattern Recognition (ICPR), 2014 22nd International Conference on , pages 1389–1394, Aug 2014.[14] Shereen Fouad, Peter Tino, Somak Raychaudhury, and Petra Schneider. Learning using privileged information inprototype based models. In AlessandroE.P. Villa, Włodzisław Duch, Péter Érdi, Francesco Masulli, and GüntherPalm, editors,
Artificial Neural Networks and Machine Learning – ICANN 2012 , volume 7553 of
Lecture Notes inComputer Science , pages 322–329. Springer Berlin Heidelberg, 2012.[15] Ziheng Wang and Qiang Ji. Classifier learning with hidden information.
Computer Vision and Pattern Recognition,IEEE Conference on , 2015.[16] D. Pechyony and V. Vapnik. On the theory of learning with privileged information. In J. Lafferty, C. K. I. Williams,J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors,
Advances in Neural Information Processing Systems 23 ,2010.[17] Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification : a survey of some recentadvances.
ESAIM: Probability and Statistics , 9:323–375, 2005.[18] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities.
Ann. Statist. ,33(4):1497–1537, 08 2005.
The first result is a bridge between the variance conditions and the metric balls of hypotheses we can actuallydefine. This result shows that the uniform variance conditions can be relaxed by a small constant, depictedhere as d ( n ) , at the expense of the rate of convergence when the bound is small. Theorem 3.
Suppose there is a loss class F = { l ◦ h : h ∈ H} with VC-dimension V , and let f ′ attain inf f ∈F E[ f ] . There is a uniform constant C such that if the following holds uniformly for all f ∈ F , Var[ f − f ′ ] ≤ h E[ f − f ′ ] + d ( n ) hen for any φ ≤ h ≤ , the following holds uniformly with probability − δ : max (cid:16) E[ f n − f ′ ] , E n [ f ′ − f n ] (cid:17) ≤ nφ (cid:18) C V log n + (1 + 2 φ ) log 1 δ (cid:19) + 32 φd ( n ) . Proof.
I follow the definitions and notation in Bucheron et al. [17, Theorem 5.5]. In their notation it isstraightforward to show that w ( r ) ≤ q rφ + d . For VC-classes, it can be proved that ψ ( x ) ≤ Cx q Vn log n [18]. The risk bound depends on the solution of a fixed-point equation. Let ǫ ∗ be the solution of r = ψ ( w ( r )) .Let ǫ ′ = C V log nnφ + φd . The following analysis shows that ǫ ′ ≥ ψ ( w ( ǫ ′ )) , which implies ǫ ∗ ≤ ǫ ′ . ψ ( w ( ǫ ′ )) = C (cid:18) C φ V log nn + 2 d (cid:19) (cid:18) V log nn (cid:19) (13) ≤ C (cid:18) C φ V log nn (cid:19) + 12 (cid:18) C φ V log nn (cid:19) − d ! (cid:18) V log nn (cid:19) (14) = C V log nnφ + φd = ǫ ′ . (15)At step 14 I used that the first-order approximation of the square root is an upper bound. The bound ǫ ′ canbe substituted wholesale into the bound given by Bucheron in the statement of the theorem, which gives thetheorem.The next step is to show that the combination of the ordinal constraint and the balance constraint aresufficient to bound the variance diameter of the subset of a hypothesis space that satisfies those conditions. Lemma 2.
Suppose that an ordering and loss balance parameter w are provided and and that f, g ∈ ˆ H .That is, L iso ( f ) ≤ D , L iso ( g ) ≤ D , and L B ( f ) ≤ B and L B ( g ) ≤ B . Finally, Suppose L ( g ) ≤ L ( f ) . Then E | ( l ( f ( X ) , Y ) − l ( g ( X ) , Y ) | ≤ E[ l ( f ( X ) , Y ) − l ( g ( X ) , Y )] + 4( b + d ) .Proof. For the purposes of the proof, we decompose the expectation by class. Let P P = P ( X | Y = 1) and P N = P ( X | Y = − . Let p P = P ( Y = 1) and p N = P ( Y = − . Similarly, let L P and L N be lossfunctions defined by conditional expectations. The outline of the argument is shown below. We will expandeach line subsequently. E P | l ( f ( X ) , Y ) − l ( g ( X ) , Y ) | (16) = E P P | l ( f ( X ) , − l ( g ( X ) , | p P + E P N | l ( f ( X ) , − − l ( g ( X ) , − | p N (17) ≤| E P P l ( f ( X ) , − l ( g ( X ) , | p P + | E P N l ( f ( X ) , − − l ( g ( X ) , − | p N + 4 D (18) ≤ E P P [ l ( f ( X ) , − l ( g ( X ) , p P + E P N [ l ( f ( X ) , − − l ( g ( X ) , − p N + 4( D + B ) (19) = L ( f ) − L ( g ) + 4( D + B ) (20)We begin by proving the inequality at line 18. Consider only the positive class for a moment. Let δ f bethe decision (or margin, as a simple extension) boundary for f and δ g the boundary for g . Then l is for f ( X ) ≤ δ f and otherwise. 12e have noted the triangle inequality relationship between D and L iso . Suppose L iso ( f ) ≤ D and L iso ( g ) ≤ D , then D ( f, g ) ≤ D . Let m f and m g be monotone functions as defined (implicitly) in L iso .Then m fg = m − g m f is a continuous monotone function which makes the metric relationship true. We willfirst show E P P | f ( X ) ≤ δ f − g ( X ) ≤ δ g | ≤ | E P P [ f ( X ) ≤ δ f − g ( X ) ≤ δ g ] | + 2 D (21) ⇔ E P P [ f ( X ) ≤ δ f ∧ g ( X ) >δ g ] + E P P [ f ( X ) >δ f ∧ g ( X ) ≤ δ g ] (22) ≤ | E P P [ f ( X ) ≤ δ f ∧ g ( X ) >δ g ] − E P P [ f ( X ) >δ f ∧ g ( X ) ≤ δ g ] | + 2 D (23)Rewriting concisely, we wish to show a + b ≤ | a − b | + 2 D . In fact, this holds because either a ≤ D or b ≤ D , which can be proved in the following way: Expanding a we have: E[ f ( X ) ≤ δ f ∧ g ( X ) >δ g ] = E[ f ≤ δ f ∧ g>δ g ∧ g>m fg ( δ f ) ] + E[ f ≤ δ f ∧ g>δ g ∧ g ≤ m fg ( δ f ) ] (24) ≤ D + E[ g>δ g ∧ g ≤ m fg ( δ f ) ] (25)The expectation term in line 25 is if m fg ( δ f ) ≤ δ g . Repeating the procedure for b shows that thecorresponding term is if m fg ( δ f ) ≥ δ g . Since one of those conditions must be true, at least one of a or b isbounded by d , which proves line 21. A bound for the negative class is identical. This proves inequality 18.To prove inequality 19, the assumptions on class balance are needed. Since we assumed that L ( f ) >L ( g ) , then either (or both) L P ( f ) > L P ( g ) or L N ( f ) > L N ( g ) . If both are true, then the desired inequality(19) is trivial. Suppose that L P ( f ) > L P ( g ) and L N ( f ) ≤ L N ( g ) . | L P ( f ) − L P ( g ) | p P + | L N ( f ) − L N ( g ) | p N (26) = | a − b | + | c − d | (27) = a − b + d − c (28) = a − b + c − d + 2( d − c ) (29) ≤ a − b + c − d + 2 w ( b − a ) + 4 B (30) ≤ ( L P ( f ) − L P ( g )) p P + ( L N ( f ) − L N ( g )) p N + 4 B (31)where we used that b − a < , | wa − c | ≤ B , | wb − d | ≤ B , and w > . The proof if L N ( f ) > L N ( g ) and L P ( f ) ≤ L P ( g ) uses that | a − w c | ≤ B and | b − w d | ≤ B0