Robust Optimization over Multiple Domains
Qi Qian, Shenghuo Zhu, Jiasheng Tang, Rong Jin, Baigui Sun, Hao Li
RRobust Optimization over Multiple Domains
Qi Qian Shenghuo Zhu Jiasheng Tang Rong Jin Baigui Sun Hao Li
Alibaba Group, Bellevue, WA, 98004, USA { qi.qian, shenghuo.zhu, jiasheng.tjs, jinrong.jr, baigui.sbg, lihao.lh } @alibaba-inc.com Abstract
In this work, we study the problem of learning a single modelfor multiple domains. Unlike the conventional machine learn-ing scenario where each domain can have the correspondingmodel, multiple domains (i.e., applications/users) may sharethe same machine learning model due to maintenance loadsin cloud computing services. For example, a digit-recognitionmodel should be applicable to hand-written digits, housenumbers, car plates, etc. Therefore, an ideal model for cloudcomputing has to perform well at each applicable domain.To address this new challenge from cloud computing, we de-velop a framework of robust optimization over multiple do-mains. In lieu of minimizing the empirical risk, we aim tolearn a model optimized to the adversarial distribution overmultiple domains. Hence, we propose to learn the model andthe adversarial distribution simultaneously with the stochasticalgorithm for efficiency. Theoretically, we analyze the con-vergence rate for convex and non-convex models. To our bestknowledge, we first study the convergence rate of learning arobust non-convex model with a practical algorithm. Further-more, we demonstrate that the robustness of the frameworkand the convergence rate can be further enhanced by appro-priate regularizers over the adversarial distribution. The em-pirical study on real-world fine-grained visual categorizationand digits recognition tasks verifies the effectiveness and ef-ficiency of the proposed framework.
Introduction
Learning a single model for multiple domains becomes afundamental problem in machine learning and has found ap-plications in cloud computing services. Cloud computingwitnessed the development of machine learning in recentyears. Apparently, users of these cloud computing servicescan benefit from sophisticated models provided by servicecarrier, e.g., Aliyun. However, the robustness of deployedmodels becomes a challenge due to the explosive popular-ity of the cloud computing services. Specifically, to main-tain the scalability of the cloud computing service, only a single model will exist in the cloud for the same problemfrom different domains. For example, given a model for dig-its recognition in cloud, some users may call it to identifythe handwritten digits while others may try to recognize theprinted digits (e.g., house number).
Copyright c (cid:13)
Figure 1: Illustration of optimizing over multiple domains.In this example, a digit-recognition model provided by cloudservice carrier should be applicable for multiple domains,e.g., handwritten digits, printed digits.A satisfied model has to deal with both domains (i.e.,handwritten digits, printed digits) well in the modern ar-chitecture of cloud computing services. This problem is il-lustrated in Fig. 1. Note that the problem is different frommulti-task learning (Zhang and Yang 2017) that aims tolearn different models (i.e., multiple models) for differenttasks by exploiting the shared information between relatedtasks.In a conventional learning procedure, an algorithm maymix the data from multiple domains by assigning an ad-hoc weight for each example, and then learn a model ac-cordingly. The weight is pre-defined and can be uniform foreach example, which is known as empirical risk minimiza-tion (ERM). Explicitly, the learned model can handle cer-tain domains well but perform arbitrarily poor on the others.The unsatisfied performance in certain domains will result inbusiness interruption from users. Moreover, assigning evenweights for all examples can suffer from the data imbalanceproblem when the examples from certain domains dominate.Recently, distributionally robust optimization has at-tracted much attention (Chen et al. 2017; Namkoong andDuchi 2016; Shalev-Shwartz and Wexler 2016). Unlike theconventional strategy with the uniform distribution, it aimsto optimize the performance of the model in the worst casedistribution over examples. The learned model is explicitlymore robust by focusing on the hard examples. To learna robust model, many existing work apply the convex lossfunctions, while the state-of-the-art performance for severalimportant practical problems are reported from the meth- a r X i v : . [ c s . L G ] N ov ds with non-convex loss functions, e.g, deep neural net-works (He et al. 2016; Krizhevsky, Sutskever, and Hinton2012; Szegedy et al. 2015). (Chen et al. 2017) proposed analgorithm to solve the non-convex problem, but their anal-ysis relies on a near-optimal oracle for the non-convex sub-problem, which is not feasible for most non-convex prob-lems in real tasks. Besides, their algorithm has to go throughthe whole data set at least once to update the parameters atevery iteration, which makes it too expensive for the large-scale data set.In this work, we propose a framework to learn a ro-bust model over multiple domains rather than examples. Bylearning the model and the adversarial distribution simulta-neously, the algorithm can balance the performance betweendifferent domains adaptively. Compared with the previouswork, the empirical data distribution in each domain remainsunchanged and our framework only learns the distributionover multiple domains. Therefore, the learned model will notbe potentially misled by the adversarial distribution over ex-amples. Our framework is also comparatively efficient dueto the adoption of stochastic gradient descent (SGD) for op-timization. More importantly, we first prove that the pro-posed method converges with a rate of O (1 /T / ) withoutthe dependency on the oracle. To further improve the ro-bustness of the framework, we introduce a regularizer forthe adversarial distribution. We find that an appropriate reg-ularizer not only prevents the model from a trivial solutionbut also accelerates the convergence rate to O ( (cid:112) log( T ) /T ) .The detailed theoretical results are summarized in Table 1.The empirical study on pets categorization and digits recog-nition demonstrates the effectiveness and efficiency of theproposed method.Table 1: Convergence rate for the non-convex model and ad-versarial distribution (“Adv-Dist”) under different settings. Setting ConvergenceModel Adv-Dist Model Adv-DistSmooth Concave O ( T / ) O ( T / ) Smooth Strongly Concave O ( (cid:113) log ( T ) T ) O ( log ( T ) T ) Related Work
Robust optimization has been extensively studied in the pastdecades (Bertsimas, Brown, and Caramanis 2011). Recently,it has been investigated to improve the performance of themodel in the worst case data distribution, which can be in-terpreted as regularizing the variance (Duchi, Glynn, andNamkoong 2016). For a set of convex loss functions (e.g., asingle data set), (Namkoong and Duchi 2016) and (Shalev-Shwartz and Wexler 2016) proposed to optimize the maxi-mal loss, which is equivalent to minimizing the loss with theworst case distribution generated from the empirical distri-bution of data. (Namkoong and Duchi 2016) showed that forthe f -divergence constraint, a standard stochastic mirror de-scent algorithm can converge at the rate of O (1 / √ T ) for theconvex loss. In (Shalev-Shwartz and Wexler 2016), the anal-ysis indicates that minimizing the maximal loss can improve the generalization performance. In contrast to a single dataset, we focus on dealing with multiple data sets and proposeto learn the non-convex model in this work.To tackle non-convex losses, (Chen et al. 2017) proposedto apply a near-optimal oracle. At each iteration, the oracleis called to return a near-optimal model for the given dis-tribution. After that, the adversarial distribution over exam-ples is updated according to the model from the oracle. Withan α -optimal oracle, authors proved that the algorithm canconverge to the α -optimal solution at the rate of O (1 / √ T ) ,where T is the number of iterations. The limitation is thateven if we assume a near-optimal oracle is accessible forthe non-convex problem, the algorithm is too expensive forthe real-world applications. It is because that the algorithmhas to enumerate the whole data set to update the param-eters at each iteration. Without a near-optimal oracle, weprove that the proposed method can converge with a rateof O ( (cid:112) log( T ) /T ) with an appropriate regularizer and thecomputational cost is much cheaper. Robust Optimization over Multiple Domains
Given K domains, we denote the data set as { S , · · · , S K } .For the k -th domain, S k = { x ki , y ki } , x ki is an example (e.g.,an image) and y ki is the corresponding label. We aim to learna model that performs well over all domains. It can be castas a robust optimization problem as follows. min W (cid:15)s.t. ∀ k, f k ( W ) ≤ (cid:15) where W is the parameter of a prediction model. f k ( · ) is theempirical risk of the k -th domain as f k ( W ) = (cid:88) i : x ki ∈ S k | S k | (cid:96) ( x ki , y ki ; W ) and (cid:96) ( · ) can be any non-negative loss function. Since thecross entropy loss is popular in deep learning, we will adoptit in the experiments.The problem is equivalent to the following minimax prob-lem min W max p : p ∈ ∆ L ( p , W ) = p (cid:62) f ( W ) (1)where f ( W ) = [ f ( W ) , · · · , f K ( W )] (cid:62) . p is an adversarialdistribution over multiple domains and p ∈ ∆ , where ∆ isthe simplex as ∆ = { p ∈ R K | (cid:80) Kk =1 p k = 1; ∀ k, p k ≥ } .It is a game between the prediction model and the adver-sarial distribution. The minimax problem can be solved in analternating manner, which applies gradient descent to learnthe model and gradient ascent to update the adversarial dis-tribution. Considering the large number of examples in eachdata set, we adopt SGD to observe an unbiased estimationfor the gradient at each iteration, which avoids enumeratingthe whole data set. Specifically, at the t -th iteration, a mini-batch of size m is randomly sampled from each domain. Theloss of the mini-batch from the k -th domain is ˆ f tk ( W ) = 1 m m (cid:88) i =1 (cid:96) (ˆ x ki : t , ˆ y ki : t ; W ) t is apparent that E [ ˆ f tk ( W )] = f k ( W ) and E [ ∇ ˆ f tk ( W )] = ∇ f k ( W ) . Algorithm 1
Stochastic Algorithm for Robust Optimization
Input:
Data set { S , · · · , S K } , size of mini-batch m ,step-sizes η w , η p Initialize p = [1 /K, · · · , /K ] for t = 1 to T do Randomly sample m examples from each domainUpdate W t +1 as in Eqn. 2Update p t +1 as in Eqn. 3 end forreturn W = T (cid:80) t W t , ¯ p = T (cid:80) t p t After sampling, we first update the model by gradient de-scent as W t +1 = W t − η w ˆ g t ; where ˆ g t = (cid:88) k p tk ∇ ˆ f tk ( W t ) (2)Then, the distribution p is updated in an adversarial way.Since p is from the simplex, we can adopt multiplicativeupdating criterion (Arora, Hazan, and Kale 2012) to updateit as p kt +1 = p kt exp( η p ˆ f tk ( W t )) Z t ; where Z t = (cid:88) k p kt exp( η p ˆ f tk ( W t )) (3)Alg. 1 summarizes the main steps of the approach. Forthe convex loss functions, the convergence rate is wellknown (Nemirovski et al. 2009) and we provide a high prob-ability bound for completeness. All detailed proofs of thiswork can be found in the appendix. Lemma 1.
Assume the gradient of W and the function valueare bounded as ∀ t , (cid:107)∇ ˆ f tk ( W t ) (cid:107) F ≤ σ , (cid:107) ˆ f t ( W t ) (cid:107) ≤ γ and ∀ W, (cid:107) W (cid:107) F ≤ R . Let ( W , ¯ p ) denote the results returnedby Alg. 1 after T iterations. Set the step-sizes as η w = Rσ √ T and η p = √ K ) γ √ T . Then, with a probability − δ , wehave max p L ( p , W ) − min W L (¯ p , W ) ≤ c √ T + 2 c (cid:112) log(2 /δ ) √ T where c = O ( (cid:112) log( K )) and c is a constant. Lemma 1 shows that the proposed method with the con-vex loss can converge to the saddle point at the rate of O (1 / √ T ) with high probability, which is a stronger resultthan the expectation bound in (Namkoong and Duchi 2016).Note that setting η w = O ( √ T ) and η p = O ( (cid:113) log( K ) T ) willnot change the order of the convergence rate, which means σ , γ and R are not required for implementation. Non-convexity
Despite the extensive studies about the convex loss, there islittle research about the minimax problem with non-convexloss. To provide the convergence rate for the non-convexproblem, we first have the following lemma.
Lemma 2.
With the same assumptions as in Lemma 1, if (cid:96) ( · ) is non-convex but L -smoothness, we have (cid:88) t E [ (cid:107)∇ W t L ( p t , W t ) (cid:107) F ] ≤ L ( p , W ) η w + η p T γ η w + T Lη w σ (cid:88) t E [ L ( p t , W t )] ≥ max p ∈ ∆ (cid:88) t E [ L ( p , W t )] − ( log( K ) η p + T η p γ Since the loss is non-convex, the convergence is measuredby the norm of the gradient (i.e., stationary point), which isa standard criterion for the analysis in the non-convex prob-lem (Ghadimi and Lan 2013). Lemma 2 indicates that W can converge to a stationary point where p t is a qualifiedadversary by setting the step-sizes elaborately. Furthermore,it demonstrates that the convergence rate of W will be influ-enced by the convergence rate of p via η p .With Lemma 2, we have the convergence analysis of thenon-convex minimax problem as follows. Theorem 1.
With the same assumptions as in Lemma 2, ifwe set the step-sizes as η w = (cid:113) γ √ K ) σ √ L T − / and η p = √ K ) γ T − / , we have E [ 1 T (cid:88) t (cid:107)∇ W t L ( p t , W t ) (cid:107) F ] ≤ ( L ( p , W ) (cid:113) γ (cid:112) K ) + (cid:113) γ (cid:112) K )) σ √ LT − / E [ 1 T (cid:88) t L ( p t , W t )] ≥ E [max p ∈ ∆ T (cid:88) t L ( p , W t )] − γ (cid:112) log( K ) √ T − / Remark
Compared with the convex case in Lemma 1, theconvergence rate of a non-convex problem is degraded from O (1 / √ T ) to O (1 /T / ) . It is well known that the conver-gence rate of general minimization problems with a smoothnon-convex loss can be up to O (1 / √ T ) (Ghadimi and Lan2013). Our results further demonstrate that minimax prob-lems with non-convex loss is usually harder than non-convexminimization problems.Different step-sizes can lead to different convergencerates. For example, if the step-size for updating p is in-creased as η p = 1 / √ T and that for model is decreased as η w = 1 /T / , the convergence rate of p can be acceleratedto O (1 / √ T ) while the convergence rate of W will degener-ate to O (1 /T / ) . Therefore, if a sufficiently small step-sizeis applicable for p , the convergence rate of W can be sig-nificantly improved. We exploit this observation to enhancethe convergence rate in the next subsection. Regularized Non-convex Optimization
A critical problem in minimax optimization is that the for-mulation is very sensitive to the outlier. For example, if thereis a domain with significantly worse performance than oth-ers, it will dominate the learning procedure according toqn. 1 (i.e., one-hot value in p ). Besides the issue of ro-bustness, it is prevalent in real-world applications that theimportance of domains is different according to their bud-gets, popularity, etc. Incorporating the side information intothe formulation is essential for the success in practice. Givena prior distribution, the problem can be written as min W max p : p ∈ ∆ p (cid:62) f ( W ) s.t. D ( p || q ) ≤ τ where q is the prior distribution which can be a distributiondefined from the side information or a uniform distributionfor robustness. D ( · ) defines the distance between two distri-butions, e.g., L p distance or KL -divergence D L ( p || q ) = (cid:107) p − q (cid:107) ; D KL ( p || q ) = (cid:88) k p k log( p k /q k ) Since KL -divergence cannot handle the prior distributionwith zero elements, optimal transportation (OT) distance be-comes popular recently to overcome the drawback D OT ( p || q ) = min P ∈ U ( p , q ) (cid:104) P, M (cid:105)
For computational efficiency, we use the version with an en-tropy regularizer (Cuturi 2013) and we have
Proposition 1.
Define the OT regularizer as D OT ( p || q ) = max α,β min P ν (cid:88) i,j P i,j log( P ( i, j ))+ P i,j M i,j + α (cid:62) ( P K − p ) + β (cid:62) ( P K − q ) (4) and it is convex in p . According to the duality theory (Boyd and Vandenberghe2004), for each τ , we can have the equivalent problem witha specified λ min W max p : p ∈ ∆ ˆ L ( p , W ) = p (cid:62) f ( W ) − λ D ( p || q ) (5)Compared with the formulation in Eqn. 1, we introduce aregularizer for the adversarial distribution.If D ( p || q ) is convex in p , the similar convergence asin Theorem. 1 can be obtained with the same analysis.Moreover, according to the research for SGD, the stronglyconvexity is the key to achieve the optimal convergencerate (Rakhlin, Shamir, and Sridharan 2012). Hence, weadopt a strongly convex regularizer i.e., L regularizer, forthe distribution. The convergence rate for other strongly con-vex regularizers can be obtained with a similar analysis bydefining the smoothness and the strongly convexity with thecorresponding norm.Equipped with the L regularizer, the problem in Eqn. 5can be solved with projected first-order algorithm. We adoptthe projected gradient ascent to update the adversarial distri-bution as p t +1 = P ∆ ( p t + η tp ˆ h t ); where ˆ h t = ˆ f t − λ ( p t − q ) P ∆ ( p ) projects the vector p onto the simplex. The projec-tion algorithm can be found in (Duchi et al. 2008) which is based on K.K.T. condition. We also provide the gradient of OT regularizer in the appendix.Since the regularizer (i.e., − L ) is strongly concave, theconvergence of p can be accelerated dramatically, whichleads to a better convergence rate for the minimax problem.The theoretical result is as follows. Theorem 2.
With the same assumptions as in Theorem 1,if we assume ∀ t, (cid:107) ˆ h t (cid:107) ≤ µ and set step-sizes as η w = µ √ log( T ) σ √ λLT and η tp = λt , we have E [ 1 T (cid:88) t (cid:107)∇ W t ˆ L ( p t , W t ) (cid:107) F ] ≤ (cid:32) L ( p , W ) σ √ λL µ (cid:112) log( T ) + µπ σ √ λL
12 + 2 µσ (cid:112) λL log( T ) (cid:33) √ TE [ 1 T (cid:88) t ˆ L ( p t , W t )] ≥ E [max p ∈ ∆ T (cid:88) t ˆ L ( p , W t )] − µ log( T ) λT Remark
With the strongly concave regularizer, it is notsurprise to obtain the O (log( T ) /T ) convergence rate for p .As we discussed in Lemma 2, a fast convergence rate of p can improve that of W . In Theorem 2, the convergence rateof W is improved from O (1 /T / ) to O ( (cid:112) log( T ) /T ) . Itshows that the applied regularizer not only improves the ro-bustness of the proposed framework but also accelerates thelearning procedure.Moreover, the step-size for the adversarial distributionprovides a trade-off between the bias and variance of thegradient. Therefore, the convergence rate can be further im-proved by reducing the variance. We shrink the gradient witha factor c and update the distribution as p t +1 = P ∆ ( p t + η tp c/t ˆ h t ) When taking η tp = λt , the update becomes p t +1 = P ∆ ( p t + 1 λ ( t + c ) ˆ h t ) (6)With a similar analysis as Theorem 2, we have Theorem 3.
With the same assumptions as in Theorem 2, ifwe set the step-size η tp = λ ( t + c ) , we have E [ 1 T (cid:88) t ˆ L ( p t , W t )] ≥ E [max p ∈ ∆ T (cid:88) t ˆ L ( p , W t )] − ( λc + µ λ ln( Tc + 1) + µ λ ) 1 T It shows that the constant c can control the trade-off be-tween bias (i.e., λc ) and variance (i.e., µ λ ln( Tc + 1) ). Bysetting the constant appropriately, we can have the follow-ing corollary Corollary 1.
When setting c = µ λ (1+ (cid:113) µ λ T ) , the RHS inTheorem 3 is maximum. The optimality is from the fact that RHS is concave in c and detailed discussion can be found in the appendix.The algorithm for robust optimization with the regularizeris summarized in Alg. 2. lgorithm 2 Stochastic Regularized Robust Optimization
Input:
Data set { S , · · · , S K } , size of mini-batch m ,step-sizes η w , η p Initialize p = [1 /K, · · · , /K ] Compute the constant c as in Corollary 1 for t = 1 to T do Randomly sample m examples from each domainUpdate W t +1 with gradient descnet(Optional) Solve the problem in Eqn. 4 if applying D OT ( p t || q ) Update p t +1 with gradient ascentProject p t +1 onto the simplex end for Trade Efficiency for Convergence
In this subsection, we study if we can recover the optimalconvergence rate for the general non-convex problem as in(Ghadimi and Lan 2013). Note that (Chen et al. 2017) ap-plies a near-optimal oracle to achieve the O (1 / √ T ) con-vergence rate. Given a distribution, it is hard to observean oracle for the non-convex model. In contrast, obtainingthe near-optimal adversarial distribution with a fixed modelis feasible. For the original problem in Eqn. 1, the solu-tion is trivial as returning the index of the domain with thelargest empirical loss. For the problem with the regularizerin Eqn. 5, the near-optimal p can be obtained efficientlyby any first order methods (Boyd and Vandenberghe 2004).Therefore, we can change the updating criterion for the dis-tribution at the t -th iteration toObtain p t +1 such that (cid:107) p t +1 − p ∗ t +1 (cid:107) ≤ ξ t +1 where p ∗ t +1 = arg max p : p ∈ ∆ L ( p , W t ) (7)With the new updating criterion and letting F ( W ) =max p L ( p , W ) , we can have a better convergence rate asfollows. Theorem 4.
With the same assumptions as in Theorem 1,if we update p as in Eqn. 7, where ξ t = √ t , and set thestep-size as η w = √ σ √ LT , we have (cid:88) t E [ 1 T (cid:107)∇F ( W t ) (cid:107) F ] ≤ ( F ( W ) + 1) √ Lσ √ T + 2 σ √ T For the problem in Eqn. 1, ξ t can be by a single passthrough the whole data set. It shows that with an expen-sive but feasible operator as in Eqn. 7, the proposed methodcan recover the optimal convergence rate for the non-convexproblem. Experiments
We conduct the experiments on training deep neural net-works over multiple domains. The methods in the compari-son are summarized as follows. • Individual : It learns the model from an individual do-main. • Mixture
Even : It learns the model from multiple domainswith even weights, which is equivalent to fixing p as anuniform distribution. • Mixture
Opt : It implements the approach proposed inAlg. 2 that learns the model and the adversarial distribu-tion over multiple domains simultaneously.We adopt the popular cross entropy loss as the loss func-tion (cid:96) ( · ) in this work. Deep models are trained with SGDand the size of each mini-batch is set to . For the meth-ods learning with multiple domains, the number of exam-ples from different domains are the same in a mini-batchand the size is m = 200 /K . Compared with the strat-egy that samples examples according to the learned dis-tribution, the applied strategy is deterministic and will notintroduce extra noise. The method is evaluated by inves-tigating the worst case performance among multiple do-mains. For the worst case accuracy, it is defined as Acc w =min k { Acc , · · · , Acc K } . The worst case loss is defined as f w ( W ) = max k { f ( W ) , · · · , f K ( W ) } . All experimentsare implemented on an NVIDIA Tesla P100 GPU. Pets Categorization
First, we compare the methods on a fine-grained vi-sual categorization task. Given the data sets of VGGcats&dogs (Parkhi et al. 2012) and ImageNet (Russakovskyet al. 2015), we extract the shared labels between them andthen generate the subsets with desired labels from them,respectively. The resulting data set consists of 24 classesand the task is to assign the image of pets to one of theseclasses. For ImageNet, each class contains about , im-ages for training while that of VGG only has images.Therefore, we apply data augmentation by flipping (hori-zontal+vertical) and rotating ( { ◦ , · · · , ◦ } ) for VGG toavoid overfitting. After that, the number of images in VGGis similar to that of ImageNet. Some exemplar images fromthese data sets are illustrated in Fig. 3. We can find that thetask in ImageNet is more challenging than that in VGG dueto complex backgrounds.We adopt ResNet18 (He et al. 2016) as the base model inthis experiment. It is initialized with the parameters learnedfrom ILSVRC2012 (Russakovsky et al. 2015) and we setthe learning rate as η w = 0 . for fine-tuning. Consider-ing the small size of data sets, we also include the methodof (Chen et al. 2017) in comparison and it is denoted as Mixture
Oracle . Since the near-optimal oracle is infeasiblefor
Mixture
Oracle , we apply the model with
SGD iter-ations instead as suggested in (Chen et al. 2017). The priordistribution in the regularizer is set to the uniform distribu-tion.Fig. 2 summarizes the worst case training loss amongmultiple domains for the methods in the comparison. Sincethe performance of models learned from multiple domains issignificantly better than those learned from an individual set,we illustrate the results in separate figures. Fig. 2 (a) com-pares the proposed method to those with the individual dataset. It is evident that the proposed method has the superiorperformance and learning with an individual domain cannothandle the data from other domains well. Fig. 2 (b) shows
00 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches W o r s t C a s e T r a i n i ng Lo ss Individual
ImageNet
Individual
VGG
Mixture
Opt (a) Pets Categorization
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches W o r s t C a s e T r a i n i ng Lo ss Mixture
Even
Mixture
Oracle
Mixture
Opt (b) Pets Categorization
Number of Mini-batches W o r s t C a s e T r a i n i ng Lo ss Individual
MNIST
Individual
SVHN
Mixture
Opt (c) Digits Recognition
Number of Mini-batches W o r s t C a s e T r a i n i ng Lo ss Mixture
Even
Mixture
Opt (d) Digits Recognition
Figure 2: Illustration of worst case training loss.Table 2: Comparison on pets categorization. We report the loss and accuracy ( % ) on each data set. Methods ImageNet VGG Acc Tr w Acc Te w Loss Tr Acc Tr Acc Te Loss Tr Acc Tr Acc Te Individual
ImageNet .
07 98 .
95 89 .
92 0 .
85 74 .
56 80 .
44 74 .
56 80 . Individual
VGG .
90 75 .
47 77 .
92 0 .
02 100 .
00 86 .
85 75 .
47 77 . Mixture
Even .
17 95 .
56 88 .
50 0 .
05 99 .
58 89 .
85 95 .
56 88 . Mixture
Oracle .
15 96 .
04 88 .
92 0 .
06 99 .
41 89 .
99 96 .
04 88 . Mixture
Opt .
12 97 .
36 89 .
42 0 .
11 97 .
72 89 . VGGImageNet
Figure 3: Exemplar images from ImageNet and VGG.
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches -0.2-0.15-0.1-0.0500.050.10.15 D i ff e r en c e i n T r a i n i ng Lo ss Mixture
Even
Mixture
Oracle
Mixture
Opt
Figure 4: Comparison of dis-crepancy in losses. R unn i ng T i m e ( s e c ond s ) Mixture
Even
Mixture
Opt
Mixture OT Mixture
Oracle
Figure 5: Comparison of run-ning time.
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches T r a i n i ng Lo ss Mixture
Even -BestMixture
Even -WorstMixture
Opt -BestMixture
Opt -Worst (a) σ ∈ { , , , }
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches T r a i n i ng Lo ss Mixture
Even -BestMixture
Even -WorstMixture
Opt -BestMixture
Opt -Worst (b) σ ∈ { , , , } Figure 6: Illustration of best and worst training loss on Ima-geNet with Gaussian noise N (0 , σ ) .the results of the methods learning with multiple data sets.First, we find that both Mixture Oracle and Mixture
Opt can achieve the lower worst case loss than Mixture
Even , whichconfirms the effectiveness of the robust optimization. Sec-ond, Mixture
Opt performs best among all of these methodsand it demonstrates that the proposed method can optimizethe performance over the adversarial distribution. To inves-tigate the discrepancy between the performances on two do-mains, we illustrate the result in Fig. 4. The discrepancy ismeasured by the difference between the empirical loss as f ImageNet − f VGG . We can find that f ImageNet is smallerthan f VGG at the beginning but f VGG decreases faster than f ImageNet . It is because the model is initialized with theparameters pre-trained on ImageNet. However, the task inVGG is easier than that in ImageNet, and f VGG drops fasterafter a few iterations. Compared with the benchmark meth-ods, the discrepancy from the proposed method is an order ofmagnitude better throughout the learning procedure. It ver-ifies the robustness of Mixture
Opt and also shows that theproposed method can handle the drifting between multipledomains well. Finally, to compare the performance explic-itly, we include the detailed results in Table 2. Comparedwith the Mixture
Even , we observe that Mixture
Opt can paymore attention to ImageNet than VGG and trade the perfor-mance between them.To further demonstrate that Mixture
Opt can trade the per-formance effectively, we conduct the experiments with noisydata. We simulate each individual domain by adding the ran-dom Gaussian noise from N (0 , σ ) to each pixel of the im-ages from ImageNet pets. We vary the variance to generatethe different domains and obtain two tasks where each hasfour domains with σ ∈ { , , , } and σ ∈ { , , , } ,respectively. Fig. 6 compares the gap between the best andworst performance on different domains for Mixture Even and Mixture
Opt . First, we can find that the proposed methodimproves the worst-case performance significantly whilekeeping the best performance almost the same. Besides, do-mains can achieve the similar performance for the simple
00 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches -0.08-0.06-0.04-0.0200.020.040.06 D i ff e r en c e i n T r a i n i ng Lo ss =0.1 =0.05 =0.01 (a) D L ( p || q )
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches -0.6-0.4-0.200.20.40.60.8 D i ff e r en c e i n D i s t r i bu t i on =0.1 =0.05 =0.01 (b) D L ( p || q )
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches -0.2-0.15-0.1-0.0500.050.10.15 D i ff e r en c e i n T r a i n i ng Lo ss =0.001 =0.0005 =0.0001 (c) D OT ( p || q )
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Number of Mini-batches -1-0.8-0.6-0.4-0.200.20.40.6 D i ff e r en c e i n D i s t r i bu t i on =0.001 =0.0005 =0.0001 (d) D OT ( p || q ) Figure 7: Illustration of the influence of the regularizer.Table 3: Comparison on digits recognition.
Methods MNIST SVHN Acc Tr w Acc Te w Loss Tr Acc Tr Acc Te Loss Tr Acc Tr Acc Te Individual
MNIST .
001 100 .
00 98 .
81 4 .
01 30 .
80 29 .
58 30 .
80 29 . Individual
SVHN .
91 66 .
66 68 .
25 0 .
10 97 .
11 91 .
84 66 .
66 68 . Mixture
Even .
001 100 .
00 98 .
74 0 .
14 96 .
20 91 .
33 96 .
20 91 . Mixture
Opt .
03 99 .
03 98 .
13 0 .
11 97 .
05 92 . task with variance in { , , , } . For the hard task that in-cludes an extreme domain with noise from N (0 , ) , thebest performance is not sacrificed much due to the appropri-ate regularizer in Mixture Opt .After the comparison of performance, we illustrate the in-fluence of the parameter λ in Fig. 7. The parameter can befound in Eqn. 5 and it constrains the distance of the adversar-ial distribution to the prior distribution. Besides the L regu-larizer applied in Mixture Opt , we also include the results ofthe OT regularizer defined in Proposition 1 and the methodis denoted as Mixture OT . Fig. 7 (a) and (c) compare the dis-crepancy between the losses as in previous experiments. It isobvious that the smaller the λ , the smaller the gap betweentwo domains. Fig. 7 (b) and (d) summarize the drifting ina distribution, which is defined as p ImageNet − p VGG . Evi-dently, the learned adversarial distribution can switch adap-tively according to the performance of the current model andthe importance of multiple domains can be constrained wellby setting λ appropriately.Finally, we compare the running time in Fig. 5. Dueto the lightweight update for the adversarial distribution,Mixture Opt and Mixture OT have almost the same runningtime as Mixture Even . Mixture
Oracle has to enumerate thewhole data set after each
SGD iterations to update thecurrent distribution, hence, its running time with only complete iterations is nearly times slower than the pro-posed method with , iterations on these small data sets. Digits Recognition
In this experiment, we examine the methods on the taskof digits recognition, which is to identify 10 digits (i.e., - ) from images. There are two benchmark data sets forthe task: MNIST and SVHN. MNIST (LeCun et al. 1998)is collected for recognizing handwritten digits. It contains , images for training and , images for test.SVHN (Netzer et al. 2011) is for identifying the house num-bers from Google Street View images, which consists of , training images and , test images. Note thatthe examples in MNIST are × gray images while thosein SVHN are × color images. To make the formatconsistent, we resize images in MNIST to be × andrepeat the gray channel in RGB channels to generate thecolor images. Considering the task is more straightforwardthan pets categorization, we apply the AlexNet (Krizhevsky,Sutskever, and Hinton 2012) as the base model in this ex-periment and set the learning rate as η w = 0 . . With adifferent deep model, we also demonstrate that the proposedframework can incorporate with various deep models.Fig. 2 (c) and (d) show the comparison of the worst casetraining loss and Table 3 summarizes the detailed results. Wecan observe the similar conclusion as the experiments onpets categorization. Mixture Even can achieve good perfor-mance on these simple domains while the proposed methodcan further improve the worst case performance and providea more reliable model for multiple domains.
Conclusion
In this work, we propose a framework to learn a robustmodel over multiple domains, which is essential for the ser-vice of cloud computing. The introduced algorithm can learnthe model and the adversarial distribution simultaneously,for which we provide a theoretical guarantee on the conver-gence rate. The empirical study on real-world applicationsconfirms that the proposed method can obtain a robust non-convex model. In the future, we plan to examine the per-formance of the method with more applications. Besides,extending the framework to multiple domains with partialoverlapped labels is also important for real-world applica-tions.
Acknowledgments
We would like to thank Dr. Juhua Hu from Universityof Washington Tacoma and anonymous reviewers for theirvaluable suggestions that help to improve this work. eferences [Arora, Hazan, and Kale 2012] Arora, S.; Hazan, E.; andKale, S. 2012. The multiplicative weights update method:a meta-algorithm and applications.
Theory of Computing
SIAM Review
Convex optimization . Cambridgeuniversity press.[Cesa-Bianchi and Lugosi 2006] Cesa-Bianchi, N., and Lu-gosi, G. 2006.
Prediction, learning, and games . Cambridgeuniversity press.[Chen et al. 2017] Chen, R. S.; Lucier, B.; Singer, Y.; andSyrgkanis, V. 2017. Robust optimization for non-convexobjectives. In
NIPS , 4708–4717.[Cuturi 2013] Cuturi, M. 2013. Sinkhorn distances: Light-speed computation of optimal transport. In
NIPS , 2292–2300.[Duchi et al. 2008] Duchi, J. C.; Shalev-Shwartz, S.; Singer,Y.; and Chandra, T. 2008. Efficient projections onto the l ICML , 272–279.[Duchi, Glynn, and Namkoong 2016] Duchi, J. C.; Glynn,P.; and Namkoong, H. 2016. Statistics of Robust Optimiza-tion: A Generalized Empirical Likelihood Approach.
ArXive-prints .[Ghadimi and Lan 2013] Ghadimi, S., and Lan, G. 2013.Stochastic first- and zeroth-order methods for nonconvexstochastic programming.
SIAM Journal on Optimization
CVPR ,770–778.[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifi-cation with deep convolutional neural networks. In
NIPS ,1106–1114.[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.; andHaffner, P. 1998. Gradient-based learning applied to doc-ument recognition.
Proceedings of the IEEE
NIPS , 2208–2216.[Nemirovski et al. 2009] Nemirovski, A.; Juditsky, A.; Lan,G.; and Shapiro, A. 2009. Robust stochastic approxima-tion approach to stochastic programming.
SIAM Journal onOptimization
NIPSworkshop on deep learning and unsupervised feature learn-ing , volume 2011, 5. [Parkhi et al. 2012] Parkhi, O. M.; Vedaldi, A.; Zisserman,A.; and Jawahar, C. V. 2012. Cats and dogs. In
CVPR .[Rakhlin, Shamir, and Sridharan 2012] Rakhlin, A.; Shamir,O.; and Sridharan, K. 2012. Making gradient descent opti-mal for strongly convex stochastic optimization. In
ICML .[Russakovsky et al. 2015] Russakovsky, O.; Deng, J.; Su, H.;Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.;Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015.ImageNet Large Scale Visual Recognition Challenge.
IJCV
ICML , 793–801.[Szegedy et al. 2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet,P.; Reed, S. E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; andRabinovich, A. 2015. Going deeper with convolutions. In
CVPR , 1–9.[Zhang and Yang 2017] Zhang, Y., and Yang, Q. 2017. Asurvey on multi-task learning.
CoRR abs/1707.08114.
Appendix
Proof of Lemma 1
Proof.
According to the updating criterion, we have D KL ( p || p t +1 ) − D KL ( p || p t ) = − η p p (cid:62) ˆ f t + log( Z t ) (8)where D KL ( p || q ) denotes the KL-divergence between thedistribution p and q . Note that for a ∈ [0 , , we have log(1 − a (1 − exp( x ))) ≤ ax + x / Therefore (cid:88) k log(1 − p kt (1 − exp( η p ˆ f tk ))) ≤ η p p (cid:62) t ˆ f t + η p (cid:107) ˆ f t (cid:107) / Since ˆ f tk ≥ , we have − p kt (1 − exp( η p ˆ f tk )) ≥ and log( Z t ) = log(1 − (cid:88) k p kt (1 − exp( η p ˆ f tk ))) ≤ (cid:88) k log(1 − p kt (1 − exp( η p ˆ f tk ))) ≤ η p p (cid:62) t ˆ f t + η p (cid:107) ˆ f t (cid:107) Take it back to Eqn. 8 and we have ( p − p t ) (cid:62) ˆ f t ≤ D KL ( p || p t ) − D KL ( p || p t +1 ) η p + η p γ (9)Therefore, for the arbitrary distribution p , we have L ( p , W t ) − L ( p t , W t ) = ( p − p t ) (cid:62) f ( W t )= ( p − p t ) (cid:62) ˆ f t + ( p − p t ) (cid:62) ( f − ˆ f t ) ≤ η p γ D KL ( p || p t ) − D KL ( p || p t +1 ) η p + ( p − p t ) (cid:62) ( f − ˆ f t ) (10)n the other hand, due to the convexity of the loss func-tion, we have the inequality for the arbitrary model W as L ( p t , W t ) ≤ L ( p t , W ) + (cid:104) g t , W t − W (cid:105) = L ( p t , W ) + (cid:104) ˆ g t , W t − W (cid:105) + (cid:104) g t − ˆ g t , W t − W (cid:105)≤ L ( p t , W ) + (cid:107) W − W t (cid:107) F − (cid:107) W − W t +1 (cid:107) F η w + η w σ (cid:104) g t − ˆ g t , W t − W (cid:105) (11)Combine Eqn. 10 and Eqn. 11 and add t from 1 to T (cid:88) t L ( p , W t ) − L ( p t , W ) ≤ log( K ) η p + (cid:107) W − W (cid:107) η w + T η p γ T η w σ (cid:88) t ( p − p t ) (cid:62) ( f − ˆ f t ) + (cid:88) t (cid:104) g t − ˆ g t , W t − W (cid:105) where we use D KL ( p || p ) ≤ log( K ) with the fact that p is the uniform distribution.Note that ∀ t , we have E [( p − p t ) (cid:62) ( f − ˆ f t )] = 0 and | ( p − p t ) (cid:62) ( f − ˆ f t ) | ≤ (cid:107) f − ˆ f t (cid:107) (cid:107) p − p t (cid:107) ≤ γ . Ac-cording to the Hoeffding-Azuma inequality for Martingaledifference sequence (Cesa-Bianchi and Lugosi 2006), witha probability − δ , we have (cid:88) t ( p − p t ) (cid:62) ( f − ˆ f t ) ≤ (cid:112) γT log(1 /δ ) By taking the similar analysis, with a probability − δ ,we have (cid:88) t (cid:104) g t − ˆ g t , W t − W (cid:105) ≤ (cid:112) σRT log(1 /δ ) Therefore, when setting η w = Rσ √ T and η p = √ K ) γ √ T ,with a probability − δ , we have (cid:88) t L ( p , W t ) − L ( p t , W ) ≤ c √ T + 2 c (cid:112) T log(2 /δ ) where c and c are c = γ (cid:114) log( K )2 + σR ; c = √ γ + √ σR Due to the convexity of L ( · , · ) in W and concavity in p ,with a probability − δ , we have L ( p , W ) − L (¯ p , W ) ≤ T (cid:88) t L ( p , W t ) − L ( p t , W ) ≤ c √ T + 2 c (cid:112) log(2 /δ ) √ T We finish the proof by taking the desired ( p , W ) into theinequality. Proof of Lemma 2
Proof.
We first present some necessary definitions.
Definition 1.
A function F is called L -smoothness w.r.t. anorm (cid:107) · (cid:107) if there is a constant L such that for any W and W (cid:48) , it holds that F ( W (cid:48) ) ≤ F ( W ) + (cid:104)∇ F ( W ) , W (cid:48) − W (cid:105) + L (cid:107) W (cid:48) − W (cid:107) Definition 2.
A function F is called λ -strongly convex w.r.t.a norm (cid:107) · (cid:107) if there is a constant λ such that for any W and W (cid:48) , it holds that F ( W (cid:48) ) ≥ F ( W ) + (cid:104)∇ F ( W ) , W (cid:48) − W (cid:105) + λ (cid:107) W (cid:48) − W (cid:107) According to the L -smoothness of the loss function, wehave E [ L ( p t , W t +1 )] ≤ E [ L ( p t , W t ) + (cid:104) g t , W t +1 − W t (cid:105) + L (cid:107) W t +1 − W (cid:107) F ] ≤ E [ L ( p t , W t ) − η w (cid:104) g t , ˆ g t (cid:105) + Lη w (cid:107) ˆ g t (cid:107) F ] ≤ E [ L ( p t , W t )] − η w E [ (cid:107) g t (cid:107) F ] + Lη w σ So we have E [ (cid:107) g t (cid:107) F ] ≤ E [ L ( p t , W t ) − L ( p t , W t +1 )] η w + Lη w σ E [ L ( p t , W t ) − L ( p t +1 , W t +1 )] η w + E [ L ( p t +1 , W t +1 ) − L ( p t , W t +1 )] η w + Lη w σ (12)Now we try to bound the difference between L ( p t +1 , W t +1 ) and L ( p t , W t +1 ) E [ L ( p t +1 , W t +1 ) − L ( p t , W t +1 )] = E [( p t +1 − p t ) (cid:62) f ( W t +1 )] ≤ E [ (cid:107) p t +1 − p t (cid:107) (cid:107) f ( W t +1 ) (cid:107) ] ≤ γE [ (cid:107) p t +1 − p t (cid:107) ] ≤ γE [ (cid:112) D KL ( p t || p t +1 )] (13) ≤ η p γ / (14)Eqn. 13 is from the Pinsker’s inequality and Eqn. 14 is fromthe inequality in Eq.9 by letting p = p t .Adding Eqn. 12 from to T with Eqn. 14, we have (cid:88) t E [ (cid:107) g t (cid:107) F ] ≤ L ( p , W ) η w + η p T γ η w + T Lη w σ On the other hand, with the similar analysis in Eqn. 10, wehave (cid:88) t E [ L ( p , W t )] − E [ L ( p t , W t )] ≤ log( K ) η p + T η p γ Proof of Theorem 2
Proof.
Since L ( p , W ) is λ -strongly concave in p , we have E [ L ( p , W t ) − L ( p t , W t )] ≤ E [( p − p t ) (cid:62) h t − λ (cid:107) p − p t (cid:107) ]= E [( p − p t ) (cid:62) ˆ h t − λ (cid:107) p − p t (cid:107) ] ≤ η tp µ E [ (cid:107) p − p t (cid:107) ] − E [ (cid:107) p − p t +1 (cid:107) ]2 η tp − λ E [ (cid:107) p − p t (cid:107) ] aking η tp = λt and add the equation from to T , wehave E [ (cid:88) t L ( p , W t ) − L ( p t , W t )] ≤ µ λ (cid:88) t t ≤ µ log( T ) λ On the other hand, we have E [ L ( p t +1 , W t +1 ) − L ( p t , W t +1 )] ≤ E [( p t +1 − p t ) (cid:62) ∇L p t ( p t , W t +1 )]= E [( p t +1 − p t ) (cid:62) ˆ h t +1 ] + E [ λ (cid:107) p t +1 − p t (cid:107) ] ≤ η tp µ + λ ( η tp ) µ Take it back to Eqn. 12 and add t from to T , then wehave (cid:88) t E [ (cid:107) g t (cid:107) F ] ≤ L ( p , W ) η w + (cid:80) t η tp µ + λ ( η tp ) µ η w + T Lη w σ ≤ L ( p , W ) η w + ( π / T )) µ λη w + T Lη w σ We finish the proof by letting η w = µ √ log( T ) σ √ λLT . Proof of Theorem 4
Proof.
According to the L -smoothness of the loss function,we have E [ F ( W t +1 )] ≤ E [ F ( W t ) + (cid:104)∇F ( W t ) , W t +1 − W t (cid:105) + L (cid:107) W t +1 − W (cid:107) F ] ≤ E [ F ( W t ) − η w (cid:104)∇F ( W t ) , ˆ g t (cid:105) + Lη w (cid:107) ˆ g t (cid:107) F ] ≤ E [ F ( W t )] − η w E [ (cid:107)∇F ( W t ) (cid:107) F ]+ η w (cid:104)∇F ( W t ) , ∇F ( W t ) − ˆ g t (cid:105) + Lη w σ ≤ E [ F ( W t )] − η w E [ (cid:107)∇F ( W t ) (cid:107) F ]+ η w σ (cid:107) p ∗ t − p t (cid:107) + Lη w σ So we have E [ (cid:107)∇F ( W t ) (cid:107) F ] ≤ E [ F ( W t ) − F ( W t +1 )] η w + σ ξ t + Lη w σ Adding inequalities from to T , we have (cid:88) t E [ (cid:107)∇F ( W t ) (cid:107) F ] ≤ F ( W ) η w + 2 σ √ T + T Lη w σ We complete the proof by setting η w = √ √ T Lσ . Proof of Proposition 1
By taking the closed-form solution for P , we have D OT ( p || q ) = max α,β − (cid:88) i,j λ exp( − − λ ( m ij + α i + β j )) − α (cid:62) p − β (cid:62) q Figure 8: Illustration of the improvement from the step-size.Given two distributions p x and p y and let t ∈ [0 , , thenwe have D OT ( t p x + (1 − t ) p y || q )= max α,β − (cid:88) i,j λ exp( − − λ ( m ij − α i − β j )) − α (cid:62) ( t p x + (1 − t ) p y ) − β (cid:62) q = max α,β − t (cid:88) i,j λ exp( − − λ ( m ij − α i − β j )) − (1 − t ) (cid:88) i,j λ exp( − − λ ( m ij − α i − β j )) − tα (cid:62) p x − (1 − t ) α (cid:62) p y − tβ (cid:62) q − (1 − t ) β (cid:62) q ≤ tD OT ( p x || q ) + (1 − t ) D OT ( p y || q ) Therefore D OT ( p || q ) is convex in p and ∇D OT ( p || q ) = α ∗ where α ∗ is the optimal solution for the maximizing problemwith the prior distribution q . It can be obtained by Sinkhorn-Knopp’s fixed point iteration efficiently (Cuturi 2013). Proof of Corollary 1
Proof.
First, we show that RHS of Theorem 3 is concave in c . Let f ( c ) = λc + µ λ ln( Tc + 1) + µ λ It is a convex function when c > , because f (cid:48)(cid:48) ( c ) = µ ( T + 2 T c )2 λ ( T c + c ) ≥ Therefore − f ( c ) is concave and the optimal value can beobtained by setting the gradient to zero as f (cid:48) ( c ) = λ − µ T λ ( T c + c ) = 0 c has the closed-form solution as c = (cid:113) T + µ Tλ − T µ λ (1 + (cid:113) µ λ T ) o illustrate the influence of c , we show an example when T = 1 e , µ = 1 e and λ = 1 in Fig. 8. First, we define theregret of the algorithm as Regret = max p ∈ ∆ (cid:88) t ˆ L ( p , W t ) − (cid:88) t ˆ L ( p t , W t )) The baseline is the regret of the conventional step-size η tp = λt , which is µ λ (ln( T ) + 1) . The regret of the proposed step-size is denoted by the red line and it shows the regret canbe significantly reduced when setting the constant cc