[PDF] Margin-Based Transfer Bounds for Meta Learning with Deep Feature Embedding

Abstract

By transferring knowledge learned from seen/previous tasks, meta learning aims to generalize well to unseen/future tasks. Existing meta-learning approaches have shown promising empirical performance on various multiclass classification problems, but few provide theoretical analysis on the classifiers' generalization ability on future tasks. In this paper, under the assumption that all classification tasks are sampled from the same meta-distribution, we leverage margin theory and statistical learning theory to establish three margin-based transfer bounds for meta-learning based multiclass classification (MLMC). These bounds reveal that the expected error of a given classification algorithm for a future task can be estimated with the average empirical error on a finite number of previous tasks, uniformly over a class of preprocessing feature maps/deep neural networks (i.e. deep feature embeddings). To validate these bounds, instead of the commonly-used cross-entropy loss, a multi-margin loss is employed to train a number of representative MLMC models. Experiments on three benchmarks show that these margin-based models still achieve competitive performance, validating the practical value of our margin-based theoretical analysis.

Full PDF

MMargin-Based Transfer Bounds for Meta Learning with Deep Feature Embedding

Jiechao Guan Zhiwu Lu * Tao Xiang Timothy Hospedales Beijing Key Laboratory of Big Data Management and Analysis Methods, Gaoling School of Artiﬁcial Intelligence, RenminUniversity of China Department of Electrical and Electronic Engineering, University of Surrey School of Informatics, The University of Edinburghabel [email protected] [email protected]

Abstract

By transferring knowledge learned from seen/previous tasks,meta learning aims to generalize well to unseen/future tasks.Existing meta-learning approaches have shown promisingempirical performance on various multiclass classiﬁcationproblems, but few provide theoretical analysis on the clas-siﬁers’ generalization ability on future tasks. In this paper,under the assumption that all classiﬁcation tasks are sampledfrom the same meta-distribution, we leverage margin theoryand statistical learning theory to establish three margin-basedtransfer bounds for meta-learning based multiclass classiﬁca-tion (MLMC). These bounds reveal that the expected errorof a given classiﬁcation algorithm for a future task can beestimated with the average empirical error on a ﬁnite num-ber of previous tasks, uniformly over a class of preprocessingfeature maps/deep neural networks (i.e. deep feature embed-dings). To validate these bounds, instead of the commonly-used cross-entropy loss, a multi-margin loss is employed totrain a number of representative MLMC models. Experimentson three benchmarks show that these margin-based modelsstill achieve competitive performance, validating the practi-cal value of our margin-based theoretical analysis.

Introduction

Inspired by human’s ability to recognizing an unseen/newobject category, meta-learning based multiclass classiﬁca-tion (MLMC), one instantiation of which is k -way s -shotclassiﬁcation (e.g., k = 5 and s = 10 ) (Vinyals et al.2016), has been studied intensively in the past few years.It is often cast into a meta-learning scenario (Sebastian andLorien 1998), in which a meta-learner learns prior knowl-edge from several training tasks and then facilitates a base-learner to generalize well on unseen/future tasks. Recentmeta-learning based classiﬁcation models normally employa deep convolutional neural network to learn each task. Asa typical example, the meta-learner learns a prior which isoften in the form of a feature extractor (i.e., the lower layersof the network, often called as deep feature embedding), andthe base-learner relies on the prior (with the feature extractorfrozen) to update the fully connected layer for classiﬁcationgiven a new task. The goal of MLMC is thus to learn opti-mal deep feature embedding from observed tasks and train anew classiﬁer which performs well on a future task. * Corresponding author

Existing meta-learning based classiﬁcation approaches(Vinyals et al. 2016; Snell, Swersky, and Zemel 2017; Sunget al. 2018; Finn, Abbeel, and Levine 2017) have shownpromising empirical performance on several benchmarks,but few provide theoretical analysis on the expected per-formance of their learned classiﬁers on new/unseen tasks,which may have different data distributions from that ofprevious training tasks. To overcome this limitation, the fo-cus of this paper is thus on providing theoretical guaranteeson how well a meta-learning based classiﬁcation model cangeneralize to new/unseen tasks.The central assumption of our theoretical results is that thelearner is embedded within a distribution of related learn-ing tasks. Since all learning tasks, either from seen classesor unseen ones, share some similarities (e.g., all tasks onthe CUB dataset (Wah et al. 2011) are about recognizingﬁne-grained bird species), it is reasonable to assume thatthese tasks are sampled from a common meta-distribution,which is referred to as environment in (Baxter 2000). Giventhat each task represents a data distribution, an environmentcan be considered as a distribution of distributions. Start-ing with a steam of datasets drawn from different trainingtasks in this environment, we aim to learn a multiclass clas-siﬁer which minimizes the transfer risk (Baxter 2000; Mau-rer 2009; Maurer, Pontil, and Romera-Paredes 2016) for anew task randomly sampled from the same environment.Under this assumption, we leverage the margin theory(Vapnik 1982) and the statistical learning theory (Valiant1984) to derive a high probability bound of transfer risk. Themargin theory has been utilized to study behaviours of manymachine learning models (Schapire et al. 1997; Koltchin-skii and Panchenko 2002; Bartlett, Foster, and Telgarsky2017), and is a standard tool to analyze multiclass classi-ﬁcation problems (see Chapter 9 in (Mohri, Rostamizadeh,and Talwalkar 2012)). Under the probably approximatelycorrect (PAC) learning framework, we establish a margin-based transfer bound with Gaussian complexity (Bartlettand Mendelson 2002) for MLMC (see Theorem 2). Bound-ing the Gaussian complexity with chaining (Dudley 1967)and other statistical learning techniques, we further pro-vide a margin-based transfer bound with covering number(Anthony and Bartlett 2002) (in Theorem 6) and a transferbound based on the VC-dimension (van der Vaart and Well-ner 1996) of the given scoring function class (in Theorem 1), a r X i v : . [ c s . L G ] D ec espectively. These results demonstrate that, for any ﬁxedpreprocessing deep neural networks (meta-learner) and anygiven classiﬁcation algorithm (base-learner), the expectederror for a future MLMC task can be controlled by the av-erage empirical margin loss on the training tasks. This theo-retical analysis is applicable to any MLMC methods whichdo not update the parameters of the meta learner/feature ex-tractor when performing classiﬁcation in a new/unseen task.It thus covers a quite wide range of methods (see examplesof such MLMC variants in Sect. Experiments).Our main contributions are three-fold: (1) Our main theorem (i.e. Theorem 1) gives a rigorous the-oretical statement that an MLMC model’s transfer risk onan unseen task can be bounded by the empirical error onprevious tasks plus a complexity part. This transfer boundguarantees that under certain constraints (e.g., the scoringfunction class’s VC-dimension v is ﬁnite and the task num-ber n becomes large), the average empirical margin loss isa proper estimation of the expected loss on a new task. Fur-ther, our main theorem also reveals that the obtained transferbound admits only a linear dependency on the number k ofclassiﬁcation categories. To obtain this transfer bound, weﬁrst provide the margin-based transfer bounds for MLMCwith Gaussian complexity and covering number, in Theo-rem 2 and Theorem 6, respectively. (2) Our transfer bounds (e.g. Theorem 1) are all dimen-sion free when deep feature embedding is used for MLMC.Importantly, for a meta-learning problem in which a meta-learner learns a neural network from previous tasks and abase-learner learns a new classiﬁer to generalize on a futuretask, our theoretical results reveal that the signiﬁcant stepis to learn a proper deep feature embedding function (com-bined with any multiclass classiﬁcation algorithm) whichcan induce a large family of classiﬁers containing a goodsolution for any task sampled from the same environment.In addition, for meta learning with deep feature embedding,the sample efﬁciency per task can actually be guaranteed byour theoretical results (see more detailed discussion at theend of the related work section). (3)

We adopt the multi-margin loss (the surrogate of the mar-gin loss), rather than the cross-entropy loss commonly usedin multiclass classiﬁcation problems, to train existing typicalMLMC models. The experimental results (see Tables 1–3 inSect. Experiments) demonstrate that the trained models withmargin loss still achieve competitive performance on threebenchmark datasets (miniImageNet (Vinyals et al. 2016),CUB (Wah et al. 2011) and miniImageNet → CUB). Thisclearly validates the practical value of our margin-based the-oretical analysis for MLMC.The remainder of this paper is organized as follows.Sect. Preliminary provides the background information andnotations. Sect. Theoretical Results presents our main the-oretical results on the transfer bound of MLMC, followedby the empirical experiments in Sect. Experiments. The dif-ferences between our theoretical results and closely-relatedworks are discussed in Sect. Related Work. Sect. Conclusionand Future Work draws conclusions and points out future re-search directions, respectively.

Preliminary

Learning Setup

Tasks and Samples . In meta-learning multiclass classiﬁ-cation (MLMC) problem, a task is a probability measure µ ∈ M ( X × Y ) , where X is an input space, Y is anoutput space (which is { , ..., k } in multiclass classiﬁca-tion), and M ( H ) generally denotes the set of probabilitymeasures on a space H . Information about the task µ is obtained by independently sampling a ﬁnite number m of training examples ( x i , y i ) ∼ µ . Such an m -tuple (( x , y ) , ..., ( x m , y m )) ∼ µ m is called a sample , whichis also denoted as ( x , y ) = (( x , y ) , ..., ( x m , y m )) with x = ( x , ..., x m ) and y = ( y , ..., y m ) . Algorithms . We consider the classiﬁcation problem withthe hypothesis space F of scoring functions. A meta-learning classiﬁcation algorithm (e.g., based on SVM) is afunction f : ( X × Y ) m → F . Let D be a space of alternativefeature maps/neural networks, and for deep learning basedmethods, we need to choose a feature map ϕ ( ∈ D ) topreprocess the input data. This induces a new MLMCclassiﬁcation algorithm f ϕ : ( X × Y ) m → F . We canregard ϕ and f as the meta-learner and the base-learner,respectively. From the training sample ( x , y ) , this algorithmlearns a scoring function f ϕ ( x , y ) ∈ F : for each input-outputpair ( x i , y i ) , the scoring function outputs the predictionscore f ϕ ( x , y ) ( x i , y ) , i.e., the probability of x i belonging tothe class label y ( y ∈ Y ). Without loss of generality, weassume that there exits a positive number b > , such that | f ϕ ( x , y ) ( x, y ) | ≤ b, ∀ f ϕ ( x , y ) ∈ F , ( x, y ) ∼ µ . Environments . The encounter with a task µ is itself a ran-dom event, corresponding to a draw µ ∼ ε where ε is a prob-ability measure on the set of tasks, i.e., ε ∈ M ( M ( X ×Y )) . In this work, such probability measures are called environments as in (Baxter 2000; Maurer 2009). Informationabout the environment ε is obtained by independently draw-ing a ﬁnite number n of tasks { µ l : µ l ∼ ε, l = 1 , ..., n } :each task µ l is represented by a sample ( x l , y l ) ∼ ( µ l ) m , ( x l , y l ) = (( x l , y l ) , ..., ( x lm , y lm )) , with the understandingthat x l = ( x l , ..., x lm ) and y l = ( y l , ..., y lm ) . Under theMLMC setting, the size m of each task is kept the same to fa-cilitate the analysis. Let ( X , Y ) = (( x , y ) , ..., ( x n , y n )) be the training data generated in this manner. Further, we de-ﬁne a probability measure ˆ ε on the set of samples ( X × Y ) m by letting expectation E ˆ ε ( g ) = E µ ∼ ε E ( x , y ) ∼ µ m g ( x , y ) forevery Borel measurable function g on ( X × Y ) m . The en-tire training data ( X , Y ) can thus be considered to be gen-erated in n independent draws from ˆ ε , that is, ( X , Y ) =(( x , y ) , ..., ( x n , y n )) ∼ ˆ ε n . Margin Loss and Transfer Risk

The margin of a scoring function f ϕ ( x , y ) (trained from thesample ( x , y ) ) at a labeled data ( x i , y i ) is ρ f ϕ ( x , y ) ( x i , y i ) (cid:44) f ϕ ( x , y ) ( x i , y i ) − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) . A real-valued function associated with any algorithm f ϕ onthe training sample ( x , y ) is its empirical loss ˆ (cid:96) f ϕ : ( X × ) m → R + , deﬁned by ˆ (cid:96) f ϕ ( x , y ) = 1 m m (cid:88) i =1 (cid:96) f ϕ ( x , y ) ( x i , y i ) where (cid:96) f ϕ ( x , y ) ( x i , y i ) = Φ ρ ◦ ρ f ϕ ( x , y ) ( x i , y i ) and ◦ denotesthe operator of function composition, and for any ρ > ,the margin loss Φ ρ ( x ) = min (cid:0) , max(0 , − xρ ) (cid:1) . To eval-uate the performance of a MLMC algorithm f ϕ ( ϕ ∈ D ) inan environment ε , the following steps are taken: (i) make arandom choice of a task µ ∼ ε , (ii) draw a training sample ( x , y ) ∼ µ m , (iii) select a test pair ( x, y ) ∼ µ , (iv) run thealgorithm f ϕ to obtain scoring function f ϕ ( x , y ) , (v) returnthe loss (cid:96) f ϕ ( x , y ) ( x, y ) . The expected output of this proce-dure can be used to measure the generalization ability of theMLMC algorithm in the given environment. This motivatesthe following deﬁnition of the expected transfer risk associ-ated with the learning algorithm f ϕ : R ε ( f ϕ ) = E µ ∼ ε E ( x , y ) ∼ µ m E ( x,y ) ∼ µ (cid:96) f ϕ ( x , y ) ( x, y ) . (1)Given all training data ( X , Y ) ∼ ˆ ε n , we utilize ( X , Y ) to select a ϕ ( X , Y ) ∈ D and ﬁx it on the future task,so that the expected transfer risk R ε ( f ϕ ( X , Y ) ) of themodiﬁed algorithm f ϕ ( X , Y ) is minimal or near minimal.The conceptually simplest way is to select ϕ ( X , Y ) =arg min ϕ ∈D n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) , which minimizes the av-erage empirical risk on the available training data. In thispaper, we give a high probability bound of R ε ( f ϕ ) in termsof n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) , and such bound uniformly holds forall ϕ ∈ D , not just for ϕ ( X , Y ) (see Theorem 1). VC-dimension and Gaussian Complexity

Deﬁnition 0.1 (VC-dimension, 2.6.1 in (van der Vaart andWellner 1996)) . Let C be a collection of subsets of a set X . C is said to shatter { x , ..., x n } if each of its n subsets canbe expressed as the form C (cid:84) { x , ..., x n } for a C in C . TheVC-dimension of the class C is the largest n for which a setof size n is shattered by C . Deﬁnition 0.2 (VC-dimension of Real-Valued FunctionClass) . The subgraph of a function f ( ∈ F ) : X → R isthe subset of X × R given by { ( x, t ) : t < f ( x ) } . Thenthe VC-dimension of the function class F is deﬁned as theVC-dimension of the set of subgraphs of functions in F . Deﬁnition 0.3 (Gaussian Complexity (Bartlett and Mendel-son 2002)) . For a subset A ⊆ R m , the Gaussian complex-ity of A is deﬁned as Γ( A ) = E γ sup x ∈ A /m (cid:80) mi =1 γ i x i ,where { γ i } i ≥ is a sequence of independent standard Gaus-sian variables (i.e. γ i ∼ N (0 , ). If F is a class of real-valued functions on the space X and x = ( x , ..., x m ) ∈ X m , we deﬁne F ( x ) = F ( x , ..., x m ) = { ( f ( x ) , ..., f ( x m )) : f ∈ F} ⊆ R m . The empirical Gaussian complexity of F on x is Γ( F ( x )) . Let µ ∈ M ( X ) be a probability mea-sure on X , and the corresponding expected complexity is E ( x , y ) ∼ µ m Γ( F ( x )) . F ( X ) and its expected Gaussian com-plexity E ( X , Y ) ∼ ˆ ε n Γ( F ( X )) can be deﬁned in a similar way. Theoretical Results

In this section, we present our theoretical results. The mostimportant one is Theorem 1, which reveals that the learningbound of an algorithm on the new MLMC tasks (which mayhave different data distributions from that of previous tasks)can be controlled by the empirical loss on previous tasks plusthe complexity term.

All detailed proofs for our theoreti-cal results can be found in the supplementary material . Theorem 1 ( Margin-Based Transfer Bound for MLMCwith VC-dimension ) . Assume that the VC-dimension ofthe real-valued function class Π F = { x (cid:55)→ g ( x, y ) | g ∈F , y ∈ Y , |Y| = k } is v , and Π F is uniformly bounded by b > . Given a classiﬁcation algorithm f and a margin pa-rameter ρ > , for any environment ε ∈ M ( M ( X × Y )) and for any δ > , with probability at least − δ on the data ( X , Y ) ∼ ˆ ε n , we have for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n +( kρ √ m + kρ √ n )( C √ v + C ) , where constants C = 24 √ πb (1 + (cid:112) log(16 e ) + 2 √ and C = 24 √ πb ( √ log C + (cid:112) log(16 e )) . C is the uniformconstant deﬁned in Theorem 7. The main proof is based on Theorem 6. To obtain Theo-rem 6, we ﬁrst give the Gaussian complexity transfer boundin Theorem 2, which is accomplished by using Slepian’sLemma (Ledoux and Talagrand 1991) to bound the functionclass G ϕ = { ( x , y ) (cid:55)→ ˆ (cid:96) f ϕ ( x , y ) } . This is not straightfor-ward and thus contributes our major technical novelty. Gaussian Complexity Transfer Bound for MLMC

Theorem 2 ( Margin-Based Transfer Bound for MLMCwith Gaussian Complexity ) . Let F be a hypothesis of scor-ing functions. Given a classiﬁcation algorithm f and a mar-gin parameter ρ > , for any environment ε ∈ M ( M ( X ×Y )) and for any δ > , with probability at least − δ on thedata ( X , Y ) ∼ ˆ ε n , we have for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X ))+ (cid:114) ln(1 /δ )2 n + k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) , where Π F ( X ) = { (cid:0) f ϕ ( X , Y ) ( x , y ) , ..., f ϕ ( X , Y ) ( x m , y ) ,..., f ϕ ( X , Y ) ( x n , y ) , ..., f ϕ ( X , Y ) ( x nm , y ) (cid:1) : y ∈ Y , ϕ ∈ D} , Π F ( x ) = { (cid:0) f ϕ ( x , y ) ( x , y ) , ..., f ϕ ( x , y ) ( x m , y ) (cid:1) : y ∈Y , ϕ ∈ D} , and the scoring function f ϕ ( X , Y ) is deﬁned as: f ϕ ( X , Y ) ( x li , y ) = f ϕ ( x l , y l ) ( x li , y ) , ∀ i ∈ [ m ] , l ∈ [ n ] . The main proof strategy is to rewrite R ε ( f ϕ ) − n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) as the following form (cid:16) R ε ( f ϕ ) − E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) (cid:17) + (cid:16) E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) − n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) (cid:17) , (2)nd then bound the two terms separately. The ﬁrst termis called estimation difference expected for the future task(Maurer 2009). To bound the ﬁrst term, we need to utilizethe Lipschitz property of the margin loss and the proper-ties of Gaussian complexity. The second part of Eq (2) isthe estimation difference between the expected empirical er-ror of the MLMC classiﬁer’s output on a new task and theaverage empirical errors on the data of the past tasks. Wechoose to use PAC learning techniques and Gaussian con-traction inequality (Wainwright 2019) to obtain this term’supper bound. The obtained results are shown in Theorem 3and Theorem 4, respectively. Theorem 3.

Let F and Π F ( x ) be the same as in previoustheorems. For ρ > , we have R ε ( f ϕ ) ≤ E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y )+ k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) . Theorem 4.

Let F and Π F ( X ) be the same as in previoustheorems. For any δ > , we have, with probability at least − δ on the draw of the sample (( x , y ) , ..., ( x n , y n )) , E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X )) . With these two theorems, we obtain the Gaussian com-plexity transfer bound for MLMC (Theorem 2).

Covering Number Transfer Bound for MLMC

Although we have shown a margin-based transfer bound inTheorem 2, the value of the Gaussian complexity is still im-plicit. Moreover, we are more interested in the variation of

Γ(Π F ( X )) with the growth of m and n . To this end, weneed another more intuitive indicator to measure the com-plexity of hypothesis space Π F ( X ) . In this work, we usethe following covering number (Zhou 2002). Deﬁnition 0.4 (Covering Number) . Let ( M, d ) be a metricspace. A subset ˆ T is called an (cid:15) -cover of T ⊆ M if ∀ t ∈ T , ∃ t (cid:48) ∈ ˆ T such that d ( t, t (cid:48) ) ≤ (cid:15) . The covering numberof T is the cardinality of the smallest (cid:15) -cover of T , that is, N ( (cid:15), T, d ) (cid:44) min (cid:110) | ˆ T | (cid:12)(cid:12)(cid:12) ˆ T is an (cid:15) − cover of T (cid:111) . Let ( F x ,...,x m , L ( (cid:98) D )) be the data-dependent L met-ric space given by metric d ( f, ˆ f ) (cid:44) (cid:107) f − ˆ f (cid:107) = (cid:113) m (cid:80) mi =1 (cid:0) f ( x i ) − ˆ f ( x i ) (cid:1) , where x = ( x , ..., x m ) is a sample from space X and F x ,...,x m representsfor the restriction of real-valued function class F tothat sample. N ( (cid:15), F , L ( X )) can be deﬁned in a sim-ilar way with the data-dependent metric d ( f, ˆ f ) = (cid:113) mn (cid:80) ni =1 (cid:80) mj =1 (cid:0) f ( x ij ) − ˆ f ( x ij ) (cid:1) ( f, ˆ f ∈ F ) . Usingthe chaining technique (Talagrand 2014), the following re-ﬁned theorem reveals the relationship between the Gaussiancomplexity and covering number. Theorem 5 (Reﬁned Dudley Entropy Bound) . For any real-valued function class F containing function f : X → R ,assume that sup f ∈F (cid:107) f (cid:107) is bounded under the L ( x ) and L ( X ) metric respectively. Then Γ( F ( x )) ≤ √ m (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( x ))d τ, Γ( F ( X )) ≤ √ nm (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( X ))d τ. Bounding the Gaussian complexity in Theorem 2 with theDudley integral in Theorem 5, we obtain the following cov-ering number transfer bound for MLMC.

Theorem 6 ( Margin-Based Transfer Bound for MLMCwith Covering Number ) . Let F and Π F be the same as inprevious theorems. Given a classiﬁcation algorithm f and amargin parameter ρ > , let L = sup f ∈ Π F (cid:107) f (cid:107) , then forany environment ε ∈ M ( M ( X × Y )) and for any δ > ,with probability at least − δ on the data ( X , Y ) ∼ ˆ ε n , wehave for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + 24 k √ πρ √ n E ( X , Y ) ∼ ˆ ε n (cid:90) L (cid:112) log N ( τ, Π F , L ( X ))d τ + 24 k √ πρ √ m E µ ∼ ε E ( x , y ) ∼ µ m (cid:90) L (cid:112) log N ( τ, Π F , L ( x ))d τ. Theorem 7 (Theorem 2.6.7 in (van der Vaart and Wellner1996)) . Let F be a real-valued function class on X withVC-dimension v . Assume that F is uniformly bounded by b > . Then, for any probability distribution Q on X , N ( τ, F , (cid:107) · (cid:107) L p ( Q ) ) ≤ C ( v + 1)(16 e ) v +1 ( bτ ) pv , where C > is a uniform constant, and for any f, g ∈F , (cid:107) f − g (cid:107) L p ( Q ) = ( (cid:82) | f − g | p dQ ) /p , p ≥ . Further, since | f ( x i , y ) | ≤ b , we have sup f ∈ Π F (cid:107) f (cid:107) =sup f ∈ Π F (cid:113) m (cid:80) mi =1 f ( x i , y ) ≤ (cid:113) m (cid:80) mi =1 b = b .Combining Theorem 6 and Theorem 7, we then obtain ourmost important theoretical result in Theorem 1. From Theory to Implementation

In practical implementation, a multi-margin loss (Paszkeet al. 2017) is preferable in the multiclass classiﬁcationproblem, because of the convexity of the loss function(Mohri, Rostamizadeh, and Talwalkar 2012). The multi-margin loss on an input-output pair ( x i , y i ) is Ψ( x i , y i ) = k − (cid:80) ky (cid:54) = y i max (cid:0) , − ( f ϕ ( x i , y i ) − f ϕ ( x i , y )) /ρ (cid:1) . De-ﬁne the empirical multi-margin loss on one trainingtask ˜ (cid:96) f ϕ ( x l , y l ) = m (cid:80) mi =1 Ψ( x li , y li ) . Due to the re-lationship between margin loss and multi-margin lossable 1: The 5-way s -shot classiﬁcation results on the miniImageNet dataset. We report the average accuracy (%, top-1) aswell as the 95% conﬁdence interval over all 600 test episodes. We compare the Multi-Margin loss with the

Cross-Entropy loss. We also give the results of Baseline++ and MAML (with ‡ ), to which our current theoretical analysis is nevertheless notapplicable, for comprehensive performance comparison. Baseline++ ‡ (Chen et al. 2019) . ± .

63 67 . ± .

66 73 . ± .

59 73 . ± .

60 77 . ± .

57 77 . ± . MAML ‡ (Finn, Abbeel, and Levine 2017) . ± .

70 60 . ± .

70 65 . ± .

74 62 . ± .

77 65 . ± .

77 63 . ± . MatchingNet (Vinyals et al. 2016) . ± .

68 64 . ± .

66 68 . ± .

66 69 . ± .

67 72 . ± .

61 73 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .

73 66 . ± .

68 70 . ± .

63 71 . ± .

61 74 . ± .

75 75 . ± . RelationNet (Sung et al. 2018) . ± .

64 65 . ± .

67 69 . ± .

63 68 . ± .

64 72 . ± .

59 71 . ± . MetaOptNet (Lee et al. 2019) . ± .

66 67 . ± .

65 72 . ± .

64 71 . ± .

59 77 . ± .

54 76 . ± . Φ( ρ f ϕ ( x i , y i )) ≤ ( k − x i , y i ) , we can replace the em-pirical margin loss ˆ (cid:96) f ϕ ( x l , y l ) in Theorem 1 with ˜ (cid:96) f ( x l , y l ) : R ε ( f ϕ ) ≤ k − n n (cid:88) l =1 ˜ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + ( kρ √ m + kρ √ n )( C √ v + C ) . The multi-margin loss can be considered as the surrogate ofthe margin loss for easier model optimization. Next, we willuse the multi-margin loss to train MLMC models.

Experiments

In this section, we conduct experiments on three bench-marks to evaluate the performance of existing MLMC meth-ods when the commonly-used cross-entropy loss is replacedby the multi-margin loss. Our main goal is to validate thepractical value of our margin-based theoretical analysis formeta-learning based multiclass classiﬁcation, since in somecases the margin loss or hinge loss may cause the problemof gradient vanishing in stochastic gradient descent.

Experiment Setup

Datasets and Backbone . (1) miniImageNet . This datasetis widely used for the conventional MLMC setting, whichconsists of 100 classes selected from ILSVRC-2012 (Rus-sakovsky et al. 2015). Each class has 600 images. We use64/16/20 classes for training/validation/test. (2)

CUB . Un-der the ﬁne-grained MLMC setting, we choose the CUB-200-2011 (CUB) dataset (Wah et al. 2011), which has 200classes and 11,788 images of birds. As in (Chen et al. 2019),we use 100/50/50 classes for training/validation/test. (3) miniImageNet → CUB . Under the cross-domain MLMCsetting, we use 100 classes from miniImageNet for training,and 50/50 classes from CUB for validation/test, as in (Chenet al. 2019). For all experiments, we use a four-layer convo-lutional neural network (Conv-4) (Vinyals et al. 2016) as thebackbone with the input size of × . Episode Sampling . Under the standard MLMC setting(Vinyals et al. 2016; Snell, Swersky, and Zemel 2017), anepisode is actually a training sample drawn from one task inan environment. A k -way s -shot q -query episode contains k ( s + q ) images (MLMC is thus instantiated as k -way s -shot classiﬁcation). We set k = 5 , q = 15 , s = 5 / / during both training and test stages. Particularly, we trainBaseline++ (Chen et al. 2019) (mini-batch-training based)with 400 epochs (batch size = 256), and train meta-learningbased methods with 40,000 episodes. When applied to ana-lyze the k -way s -shot setting, our transfer bound becomes ( √ k/ ( ρ √ s + q ) + k/ ( ρ √ n ))( C √ v + C ) due to m = k ( s + q ) . Evaluation Protocols . We make performance evaluation onthe test set under the 5-way 5-shot, 5-way 10-shot and 5-way20-shot settings (15-query for each test episode). Concretely,we randomly sample 600 episodes from the test set, and thenreport the average accuracy (%, top-1) as well as the 95%conﬁdence interval over all the test episodes.

Baselines for Comparison . We select six representativebaselines for k -way s -shot classiﬁcation: (i) mini-batch-training method: Baseline++ (Chen et al. 2019) ﬁrst learnsthe classiﬁer with the standard supervised training strategy,and then ﬁnetunes it to each task in the test stage. (ii) Meta-learning based methods: three metric-learning based meth-ods (MatchingNet (Vinyals et al. 2016), ProtoNet (Snell,Swersky, and Zemel 2017), RelationNet (Sung et al. 2018))which just run a forward process to obtain the test accura-cies during the test stage, without updating the parametersof the feature extractor; one classiﬁer-learning based method(MetaOptNet (Lee et al. 2019)) which ﬁxes the parametersof the feature extractor to extract image features and trainsa new SVM classiﬁer in each new task; one gradient basedmethod (MAML (Finn, Abbeel, and Levine 2017)) that up-dates the parameters of both the feature extractor and thefully connected layer in each new task. Note that our theo-retical analysis only holds for metric-learning and classiﬁer-learning based models. For performance comparison, boththe cross-entropy loss and the multi-margin loss are used forthe above six representative baselines. Implementation Details . Our implementation is based onPyTorch (Paszke et al. 2017). We train all models fromscratch and use the Adam optimizer with the initial learn-ing rate − . We select other hyperparameters (including ρ ) by performing validation on the validation set. Main Results

Performance of Multi-Margin Loss . We focus on com-paring the multi-margin loss with the cross-entropy lossable 2: The 5-way s -shot classiﬁcation results on the CUB dataset. We report the average accuracy (%, top-1) as well as the95% conﬁdence interval over all 600 test episodes.

Baseline++ ‡ (Chen et al. 2019) . ± .

64 66 . ± .

78 84 . ± .

58 78 . ± .

63 86 . ± .

61 80 . ± . MAML ‡ (Finn, Abbeel, and Levine 2017) . ± .

73 74 . ± .

70 79 . ± .

68 77 . ± .

75 81 . ± .

68 78 . ± . MatchingNet (Vinyals et al. 2016) . ± .

64 79 . ± .

62 83 . ± .

56 84 . ± .

56 86 . ± .

45 87 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .

67 79 . ± .

68 83 . ± .

59 84 . ± .

59 86 . ± .

52 87 . ± . RelationNet (Sung et al. 2018) . ± .

65 77 . ± .

68 81 . ± .

57 80 . ± .

61 82 . ± .

54 81 . ± . MetaOptNet (Lee et al. 2019) . ± .

71 78 . ± .

67 83 . ± .

60 82 . ± .

64 87 . ± .

57 86 . ± . Number of training tasks T e s t A cc u r a c y ( % ) × (a) Range of ρ T e s t A cc u r a c y ( % ) (b) Figure 1: (a) The 5-way test accuracies of ProtoNet with different numbers of training tasks on miniImageNet ( ρ = 1 ). (b) The5-way test accuracies of ProtoNet with different choices of ρ on miniImageNet ( n = s -shot are also considered.Table 3: Comparative results on the cross-domain miniIma-geNet → CUB dataset. Average 5-way 5-shot classiﬁcationaccuracies (%) with 95% conﬁdence intervals are computedhere. More s -shot classiﬁcation results can be found in thesupplementary material. Model Cross-Entropy Multi-Margin

Baseline++ (Chen et al. 2019) . ± .

70 67 . ± . MAML (Finn, Abbeel, and Levine 2017) . ± .

57 60 . ± . MatchingNet (Vinyals et al. 2016) . ± .

73 63 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .

75 64 . ± . RelationNet (Sung et al. 2018) . ± .

74 65 . ± . MetaOptNet (Lee et al. 2019) . ± .

73 64 . ± . when both are used for k -way s -shot classiﬁcation. FromTables 1–3, we have the following observations: (i) Inmost cases, the classiﬁcation performance obtained withthe multi-margin loss is comparable to that obtained withthe cross-entropy loss. (ii) On the CUB dataset, the infe-rior performance of Baseline++ with the multi-margin losssuggests that the margin loss may be unsuitable for mini-batch-training model optimization in ﬁne-grained classiﬁ-cation. (iii) For meta-learning based classiﬁcation methods,the multi-margin loss generally leads to competitive results (w.r.t. the the cross-entropy loss), which is consistent withour margin-based theoretical analysis. Inﬂuence of Hyperparameters . We further conduct exper-iments to study the inﬂuence of three hyperparameters – thenumber of the training tasks n , the sample size per episode m (= ks ) and the margin parameter ρ – on the performanceof meta-learning based classiﬁcation. We select ProtoNet asthe baseline and show the 5-way test accuracies (i.e. k = 5 )on miniImageNet in Figure 1. We can ﬁnd that: (i) Meta-learning based classiﬁcation with the multi-margin loss canachieve higher test accuracies with the growth of n and m ,but the improvements are incremental when n or m is large(e.g. n ≥ , , m ≥ (i.e. s ≥ )). (ii) The perfor-mance of meta-learning based classiﬁcation with the multi-margin loss is not so sensitive to the variations of ρ . Related Work

The pioneering work (Baxter 2000) on meta-learning the-ory differs from our results in several aspects: (1) (Baxter2000) is not focused on the multiclass classiﬁcation prob-lem and does not explicitly reveal the relationship betweenthe transfer bound and the number k of classiﬁcation cate-gories, but we show that our transfer bounds admit only alinear dependency on k . (2) Its transfer bound with a neu-al network feature map (Theorem 8 in (Baxter 2000)) de-pends on the feature dimension, while our transfer bounds(e.g. Theorem 1) are dimension free. (3) The theoretical re-sults about feature maps in (Baxter 2000) pay more attentionto simple two-layer neural networks and mainly use the neu-ral network for feature dimension reduction. However, thispaper focuses on the role of deep feature embedding in im-age feature extraction, which is the modern development offeature engineering in machine learning community.Following (Baxter 2000), recent theoretical works canbe divided into two main groups: one group exploresPAC-Bayes theory (Pentina and Lampert 2014; Dziugaiteand Roy 2017; Amit and Meir 2018) for meta learning,and the other utilizes conventional PAC learning techniques(i.e. without prior assumptions) to give transfer boundsfor different models (e.g. regression learner (Maurer 2009;Maurer, Pontil, and Romera-Paredes 2016; Denevi et al.2018) or algorithm stability (Maurer 2005)). Below, wedetail our differences from the two most related works(Amit and Meir 2018; Maurer 2009). PAC-Bayes Meta Learning Theory . It assumes a priordistribution over priors, a ‘hyper-prior’ P ( P ) , and af-ter training outputs a distribution over priors, a ‘hyper-posterior’ Q ( P ) . Let er ( Q , ε ) = E P ∼Q er ( P, ε ) be the ex-pected loss to measure the quality of the prior Q , where er ( P, ε ) (cid:44) E µ ∼ ε E S ∼ µ m E f ∼ Q ( S,P ) E ( x,y ) ∼ µ l ( f, x, y ) ( l is the loss function). S i = ( x i , y i ) is a sample and er ( Q, S i ) = E f ∼ Q m (cid:80) mj =1 l ( f, x ij , y ij ) is the empiricalloss. D ( Q || P ) = E f ∼ Q ln Q ( f ) P ( f ) represents the Kullback-Leibler Divergence between two distributions Q and P . Theorem 8 ( Theorem 2 in (Amit and Meir 2018)) . Let Q : S × F → F be a base-learner and Q i (cid:44) Q ( S i , P ) . Forany hyper-posterior Q and for any δ > , with probabilityas least − δ , the inequality holds: er ( Q , ε ) ≤ n n (cid:88) i =1 E P ∼Q ˆ er i ( Q i , S i )++ (cid:113) D ( Q||P ) + log nδ √ n −

2+ 1 n n (cid:88) i =1 (cid:114) D ( Q||P ) + E P ∼Q D ( Q i || P ) + log nmδ √ m − . The transfer bound in Theorem 8 differs from oursin Theorem 1 in two aspects: (i) PAC-Bayes bound is ex-pressed on the averaging of multiple hypothesises (weightedby a posterior distribution), but our bound is expressed forany hypothesis. (ii) Even ignoring the KL-divergence termbetween two distinct distributions, the PAC-Bayes boundhas still a complexity part O ( √ log( nm ) √ m + √ log n √ n ) , while wederive a bound O ( k ( (cid:112) vm + (cid:112) vn )) which is generally lower. Transfer Bounds for Linear Regression . Another closelyrelated work is (Maurer 2009) which focuses on linearregression problems. Below, we show the transfer boundof the regression learners in (Maurer 2009), and compareit with our Theorem 1. Considering the regularized least squares regression, we have the weight vectors ω ( x , y ) =arg min ω ∈ H (cid:0) m (cid:80) mi =1 ( (cid:104) ω, x i (cid:105) − y i ) + (cid:107) ω (cid:107) (cid:1) . The empir-ical error is ˆ (cid:96) ω ( x , y ) = m (cid:80) mi =1 ( (cid:104) ω ( x , y ) , x i (cid:105) − y i ) . P d is the set of orthogonal projections P with d -dimension, and ˆ (cid:96) ω λ − P ( x l , y l ) = ˆ (cid:96) ω ( λ − / P / x l , y l ) ( λ > is the reg-ularization parameter). Let (cid:107) C (cid:107) ∞ be the largest eigenvalueof the covariance operator C for the total input distribution. Theorem 9 ( Theorem 1 in (Maurer 2009)) . For any δ > ,we have with probability at least − δ on the data ( X , Y ) =(( x , y ) , ..., ( x n , y n )) that for all feature maps P ∈ P d R ε ( ω λ − P ) ≤ n n (cid:88) l =1 ˆ (cid:96) ω λ − P ( x l , y l ) + (cid:114) ln(1 /δ )2 n + √ πdλ (cid:0) (cid:114) (cid:107) C (cid:107) ∞ m + (cid:114) n (cid:1) . Note that the transfer bounds for regression and classi-ﬁcation problems have a similar form O ( √ m + √ n ) , if weignore the dependence on the conﬁdence parameter δ in The-orem 9 & Theorem 1 and assume that the VC-dimension v inTheorem 1 is ﬁnite. That is, although our work is quite dif-ferent from (Maurer 2009) (i.e. nonlinear classiﬁcation vs.linear regression), our transfer bounds are somewhat similarto the transfer bound of (Maurer 2009), indirectly showingthe correct derivation of our transfer bounds.Finally, we stress that the sample efﬁciency per task canalso be guaranteed by our transfer bounds for meta learningwith deep feature embedding. Speciﬁcally, given the accu-racy (cid:15) , to let the inequality R ε ( f ϕ ) ≤ n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l )+ (cid:15) hold, which means our transfer bound O ( k ( (cid:112) vm + (cid:112) vn )) = ak ( (cid:112) vm + (cid:112) vn ) ≤ (cid:15) ( a is a constant), we must let the num-ber of examples per task required for good generalizationobey m ≥ a k v ( (cid:15) − ak √ vn ) ( (cid:44) φ ( n )) . Since φ ( n ) is a monoton-ically decreasing function (w.r.t. n ), the number m of exam-ples per task required for good generalization will decreaseas the number n of tasks increases. Therefore, our theoreti-cal results guarantee the sample efﬁciency per task in metalearning. Similarly, such sample efﬁciency per task can alsobe guaranteed by (Maurer 2009), though it focuses on theregression problem instead. Conclusion and Future Work

We have derived margin-based transfer bounds for meta-learning based multiclass classiﬁcation, showing that its ex-pected error on a future task can be properly estimated by itsempirical error on previous tasks. We show that our transferbounds only admit a linear dependency on classiﬁcation cat-egories, and point out the importance of the choice of deepfeature embedding in meta-learning. The experiment resultsdemonstrate the practical signiﬁcance of our margin-basedtheoretical analysis. Our ongoing research includes: ( i ) Thecross-entropy loss is the most common choice for multiclassclassiﬁcation, and performs better than the multi-margin lossin some cases. One research direction is to explore whethere can use the cross-entropy loss to obtain similar theo-retical results. ( ii ) Our most important theoretical result isgiven by using Gaussian complexity and Slepian’s Lemmain Gaussian process. Is it possible to obtain a similar ortighter transfer bound via more concise theoretical analysis? References

Amit, R.; and Meir, R. 2018. Meta-Learning by AdjustingPriors Based on Extended PAC-Bayes Theory. In Dy, J. G.;and Krause, A., eds.,

ICML , volume 80, 205–214.Anthony, M.; and Bartlett, P. L. 2002.

Neural NetworkLearning - Theoretical Foundations . Cambridge UniversityPress.Bartlett, P. L.; Foster, D. J.; and Telgarsky, M. 2017.Spectrally-normalized margin bounds for neural networks.In

NeurIPS , 6240–6249.Bartlett, P. L.; and Mendelson, S. 2002. Rademacher andGaussian Complexities: Risk Bounds and Structural Results.

Journal of Machine Learning Research

3: 463–482.Baxter, J. 2000. A Model of Inductive Bias Learning.

Jour-nal of Artiﬁcial Intelligence Research

12: 149–198.Chen, W.; Liu, Y.; Kira, Z.; Wang, Y. F.; and Huang, J. 2019.A Closer Look at Few-shot Classiﬁcation. In

ICLR .Denevi, G.; Ciliberto, C.; Stamos, D.; and Pontil, M. 2018.Learning To Learn Around A Common Mean. In Bengio, S.;Wallach, H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi,N.; and Garnett, R., eds.,

NeurIPS , 10190–10200.Dudley, R. 1967. The sizes of compact subsets of Hilbertspace and continuity of Gaussian processes.

Journal ofFunctional Analysis

UAI .Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-AgnosticMeta-Learning for Fast Adaptation of Deep Networks. In

ICML , 1126–1135.Koltchinskii, V.; and Panchenko, D. 2002. Empirical Mar-gin Distributions and Bounding the Generalization Error ofCombined Classiﬁers.

The Annals of Statistics

Probability in BanachSpaces . Berlin: Springer.Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019.Meta-Learning With Differentiable Convex Optimization. In

CVPR , 10657–10665.Maurer, A. 2005. Algorithmic Stability and Meta-Learning.

Journal of Machine Learning Research

6: 967–994.Maurer, A. 2009. Transfer bounds for linear feature learning.

Machine Learning

J. Mach.Learn. Res.

17: 81:1–81:32. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012.

Foundations of Machine Learning . Adaptive computationand machine learning. MIT Press.Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.;DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer,A. 2017. Automatic differentiation in PyTorch. In

NeurIPS,Workshop .Pentina, A.; and Lampert, C. H. 2014. A PAC-Bayesianbound for Lifelong Learning. In

ICML , volume 32, 991–999.Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Vi-sual Recognition Challenge.

International Journal of Com-puter Vision (IJCV)

ICML , 322–330.Sebastian, T.; and Lorien, P. 1998.

Learning to Learn .Springer.Snell, J.; Swersky, K.; and Zemel, R. 2017. PrototypicalNetworks for Few-shot Learning. In

NeurIPS , 4077–4087.Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; andHospedales, T. M. 2018. Learning to Compare: RelationNetwork for Few-Shot Learning. In

CVPR , 1199–1208.Talagrand, M. 2014.

Upper and Lower Bounds for Stochas-tic Processes: Modern Methods and Classical Problems .Springer Science and Business Media.Valiant, L. G. 1984. A Theory of the Learnable.

Commun.ACM

Weak Conver-gence and Empirical Processes: With Applications to Statis-tics . Berlin: Springer.Vapnik, V. 1982.

Estimation of Dependences Based on Em-pirical Data . Springer-Verlag New York.Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.;and Wierstra, D. 2016. Matching Networks for One ShotLearning. In

NeurIPS , 3630–3638.Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie,S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Tech-nical report.Wainwright, M. J. 2019.

High-Dimensional Statistics: ANon-Asymptotic Viewpoint . Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press.doi:10.1017/9781108627771.Zhou, D. 2002. The covering number in learning theory.

J.Complex.

UPPLEMENTARY MATERIAL

In the supplementary material, Section Auxiliary Resultsprovides auxiliary results which help prove our main the-orems for meta-learning based multiclass classiﬁcation(MLMC). Section Gaussian Complexity Transfer Boundand Section Covering Number Transfer Bound show ourtechnical proofs for transfer bounds with Gaussian com-plexity and covering number, respectively. Section VC-dimension Transfer Bound gives the demonstrations forour most important result, the margin-based transfer boundwith VC-dimension (i.e. Theorem 1 in main paper). Sec-tion Cross-Domain Experiment Results gives the compre-hensive experiment results on the cross-domain dataset miniImageNet → CUB . Auxiliary Results

We use { σ i : i ∈ N } to denote a sequence of inde-pendent Bernoulli variables (i.e. σ i ∈ {− , +1 } ) and use { γ i : i ∈ N } as a sequence of independent standard Gaus-sian variables (i.e. γ i ∼ N (0 , ), which are also indepen-dent of { σ i } . For A ⊆ R m we deﬁne the Rademacher com-plexity and Gaussian complexity of A as R ( A ) = E σ sup x ∈ A m m (cid:88) i =1 σ i x i , Γ( A ) = E γ sup x ∈ A m m (cid:88) i =1 γ i x i . Jensen’s inequality implies the relationship between thesetwo kinds of complexities Γ( A ) = E sup x ∈ A m m (cid:88) i =1 γ i x i = E sup x ∈ A m m (cid:88) i =1 | γ i | σ i x i ≥ E sup x ∈ A m m (cid:88) i =1 E | γ i | σ i x i = (cid:114) π R ( A ) . (3)The following Theorem is fundamental to derive our results(e.g. Theorem 14, 15 and 16). For the readers’ beneﬁt wepresent them with a sketch of proof and the detailed proofcan be found in (van der Vaart and Wellner 1996) for (i) and(Koltchinskii and Panchenko 2002) for (ii). Theorem 10.

Let F be a real-valued function class on aspace X and µ ∈ M ( X ) . For x = ( x , ..., x m ) ∈ X m deﬁne Φ( x ) = sup f ∈F (cid:18) E x ∼ µ [ f ( x )] − m m (cid:88) i =1 f ( x i ) (cid:19) . (i) E x ∼ µ m [Φ( x )] ≤ E x ∼ µ m R ( F ( x )) .(ii) If F is [0,1]-valued then ∀ δ > we have with proba-bility greater than − δ in x ∼ µ m that Φ( x ) ≤ E x ∼ µ m R ( F ( x )) + (cid:114) ln(1 /δ )2 m . (iii) R ( F ( x )) can be replaced by (cid:112) π/ F ( x )) in (i)and (ii). Proof.

For any Rademacher variables σ = { σ i } mi =1 E x ∼ µ m [Φ( x )] = E x ∼ µ m sup f ∈F m E x (cid:48) ∼ µ m m (cid:88) i =1 (cid:0) f ( x (cid:48) i ) − f ( x i ) (cid:1) ≤ E x , x (cid:48) ∼ µ m × µ m sup f ∈F m m (cid:88) i =1 σ i (cid:0) f ( x (cid:48) i ) − f ( x i ) (cid:1) The last inequality holds due to the symmetry of the mea-sure µ m × µ m and the interchangeability between x i and x (cid:48) i .Taking the expectation of σ and using the triangle inequalitywe obtain (i). Then, applying the McDiarmid concentrationinequality to Φ( x ) we have with probability at least − δ , Φ( x ) ≤ E x ∼ µ m [Φ( x )] + (cid:113) ln(1 /δ )2 m , and recall (i) we have(ii). Finally, Eq. (3) gives (iii). (cid:4) The following theorems (Theorem 11, 12 and 13) aboutGaussian complexity or Gaussian Process are needed to ob-tain our results (like Theorem 15 and 16).

Theorem 11 (Slepian’s Lemma (Ledoux and Talagrand1991)) . Let Ω and Ξ be mean zero, separable Gaussian pro-cesses indexed by a common set T , such that E (Ω s − Ω t ) ≤ E (Ξ s − Ξ t ) ∀ s, t ∈ T. Then E sup t ∈ T Ω t ≤ E sup t ∈ T Ξ t . Theorem 12 (Gaussian Contraction Inequality, Exercise5.12 in (Wainwright 2019)) . Consider a bounded subset T ⊆ R m , and let { γ i } i ≥ be independent N (0 , randomvariables. Let Φ i : R → R be (cid:96) -Lipschitz contractions, i.e., ∀ x, y ∈ R , | Φ i ( x ) − Φ i ( y ) | ≤ (cid:96) | x − y | . Then we have E sup t ∈ T m (cid:88) i =1 γ i Φ i ( t i ) ≤ (cid:96) E sup t ∈ T m (cid:88) i =1 γ i t i . Proof.

Deﬁne Gaussian processes { Ω t } t ∈ T , { Ξ t } t ∈ T ,where Ω t = (cid:80) mi =1 γ i Φ i ( t i ) = (cid:104) γ , Φ ( t ) (cid:105) , Φ ( t ) =(Φ ( t ) , ..., Φ m ( t m )) T and Ξ t = (cid:80) mi =1 (cid:96)γ i t i . We have E (Ω s − Ω t ) = E ( (cid:104) γ , Φ ( s ) − Φ ( t ) (cid:105) ) = E (cid:0) Φ ( s ) − Φ ( t ) (cid:1) T γγ T (cid:0) Φ ( s ) − Φ ( t ) (cid:1) = (cid:0) Φ ( s ) − Φ ( t ) (cid:1) T (cid:0) Φ ( s ) − Φ ( t ) (cid:1) ( E (cid:2) γγ T (cid:3) = I )= m (cid:88) i =1 (cid:0) Φ i ( s i ) − Φ i ( t i ) (cid:1) ≤ m (cid:88) i =1 (cid:96) ( s i − t i ) (Lipschitz Property)= E (Ξ s − Ξ t ) . From Theorem 11 we have E sup t ∈ T Ω t ≤ E sup t ∈ T Ξ t . (cid:4) Recall that Γ( F ( x )) = 2 /m E sup f ∈F (cid:80) mi =1 γ i f ( x i ) . If (cid:96) -Lipschitz contractions Φ i = Φ for all i ∈ [ m ] , let Φ ◦ F ( x ) = { (cid:0) Φ ◦ f ( x ) , ..., Φ ◦ f ( x m ) (cid:1) : f ∈ F} ) , then from Theo-rem 12, we have Γ(Φ ◦ F ( x )) ≤ (cid:96) Γ( F ( x )) . (4) heorem 13. Let F , ..., F l be l hypothesis sets in R X , l ≥ , and let G = { max { f , ..., f l } : f i ∈ F i , i ∈ [ l ] } . Thenfor any sample x of size m, we have Γ( G ( x )) ≤ l (cid:88) i =1 Γ( F i ( x )) . Proof.

The main idea is to notice that max { f , f } = ( f + f + | f − f | ) / and use the sub-additivity of sup function.The proof is similar to that for Lemma 9.1 in (Mohri, Ros-tamizadeh, and Talwalkar 2012) (Rademacher complexityversion results). The only difference is that for the Gaussiancomplexity, we use Gaussian Contraction Inequality whichis proved in Theorem 12. The detailed demonstration is leftto readers. (cid:4) To demonstrate the reﬁned Dudley entropy bound (Tala-grand 2014) of Gaussian complexity (see Theorem 17 inSect. ), we need the following reﬁned Massart lemma whichfocuses on bounding the ﬁnite set’s Gaussian complexitywith the set’s cardinality.

Lemma 1 (Reﬁned Massart Lemma) . Let A = { a , ..., a N } be a ﬁnite set of vectors in R m . Deﬁne ¯ a = N (cid:80) Ni =1 a i , wehave Γ( A ) ≤ max a ∈ A (cid:107) a − ¯ a (cid:107) √ Nm Proof.

Without loss of generality, we assume that ¯ a = 0 . ∀ λ > , let A (cid:48) = { λ a , ..., λ a N } , then m A (cid:48) ) = E γ sup a ∈ A (cid:48) (cid:104) a , γ (cid:105) = E γ log max a ∈ A (cid:48) e (cid:104) a , γ (cid:105) ≤ log E γ max a ∈ A (cid:48) e (cid:104) a , γ (cid:105) (Jensen) ≤ log E γ (cid:88) a ∈ A (cid:48) e (cid:104) a , γ (cid:105) = log (cid:88) a ∈ A (cid:48) m (cid:89) i =1 E γ i e a i γ i = log (cid:88) a ∈ A (cid:48) m (cid:89) i =1 e a i ( (cid:90) R √ π e − x22 e ax dx = e a22 )= log (cid:88) a ∈ A (cid:48) e (cid:107) a (cid:107) ≤ log (cid:0) N max a ∈ A (cid:48) e (cid:107) a (cid:107) (cid:1) = log N + max a ∈ A (cid:48) (cid:107) a (cid:107) . (5)Let L = max a ∈ A (cid:107) a (cid:107) , then we have ∀ λ > A ) ≤ Γ( A (cid:48) ) λ ≤ Nλ + λL m by Eq . (5) Plugging λ = √ N /L into the above the inequality, wehave Γ( A ) ≤ L √ Nm . (cid:4) Gaussian Complexity Transfer Bound

Theorem 14 ( Margin-based Transfer Bound for MLMCwith Gaussian Complexity, Theorem 2 in main paper ) . Let F be a hypothesis of scoring functions. Given a classiﬁ-cation algorithm f and a margin parameter ρ > , for anyenvironment ε ∈ M ( M ( X × Y )) and for any δ > , withprobability at least − δ on the data ( X , Y ) ∼ ˆ ε n , we havefor all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X ))+ (cid:114) ln(1 /δ )2 n + k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) , where Π F ( X ) = { (cid:0) f ϕ ( X , Y ) ( x , y ) , ..., f ϕ ( X , Y ) ( x m , y ) ,..., f ϕ ( X , Y ) ( x n , y ) , ..., f ϕ ( X , Y ) ( x nm , y ) (cid:1) : y ∈ Y , ϕ ∈ D} , Π F ( x ) = { (cid:0) f ϕ ( x , y ) ( x , y ) , ..., f ϕ ( x , y ) ( x m , y ) (cid:1) : y ∈Y , ϕ ∈ D} , and the scoring function f ϕ ( X , Y ) is deﬁned as: f ϕ ( X , Y ) ( x li , y ) = f ϕ ( x l , y l ) ( x li , y ) , ∀ i ∈ [ m ] , l ∈ [ n ] . The proof strategy is to rewrite R ε ( f ϕ ) − n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) as the following form (cid:16) R ε ( f ϕ ) − E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) (cid:17) + (cid:16) E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) − n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) (cid:17) , (6)and bound the two terms with Theorem 15 and Theorem 16respectively. We ﬁrst give Lemma 2 to prove Theorem 15. Lemma 2.

A random variable σ obeys a Bernoulli distribu-tion, where P{ σ = +1 } = p, P{ σ = − } = q ( p + q = 1) . Another independent random variable γ obeys a standardGaussian distribution where γ ∼ N (0 , . Then the multi-plication ξ = σγ still obeys standard Gaussian distribution.Proof. ∀ z ∈ R , P{ ξ ≤ z } = P{ σγ ≤ z } = P{ σ > , γ ≤ zσ } + P{ σ < , γ ≥ zσ } = P{ σ = +1 , γ ≤ z } + P{ σ = − , γ ≥ − z } (i) = P{ σ = +1 }P{ γ ≤ z } + P{ σ = − }P{ γ ≥ − z } = p (cid:90) z −∞ √ π e − x dx + q (cid:90) ∞− z √ π e − x dx. (i) holds due to the independence of σ and γ . The densityfunction of ξ is f ξ ( z ) = d P{ ξ ≤ z } d z = p dd z (cid:90) z −∞ √ π e − x dx + q dd z (cid:90) ∞− z √ π e − x dx = p √ π e − z + q ( − √ π e − z = 1 √ π e − z . herefore, ξ obeys standard Gaussian distribution. (cid:4) Theorem 15 ( Theorem 3 in main paper ) . Let F and Π F ( x ) be the same as in previous theorems. For ρ > ,we have R ε ( f ϕ ) ≤ E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y )+ k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) . Proof.

Deﬁne vector spaces F ρ ( x , y ) = { (cid:0) ρ f ϕ ( x , y ) ( x , y ) , ..., ρ f ϕ ( x , y ) ( x m , y m ) (cid:1) : ϕ ∈ D} , Φ ρ ◦ F ρ ( x , y ) = { (cid:0) Φ ρ ◦ ρ f ϕ ( x , y ) ( x , y ) , ..., Φ ρ ◦ ρ f ϕ ( x , y ) ( x m , y m ) (cid:1) : ϕ ∈ D} ,we then have R ε ( f ϕ ) − E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y )= E µ ∼ ε E ( x , y ) ∼ µ m (cid:2) E ( x,y ) ∼ µ (cid:96) f ϕ ( x , y ) ( x, y ) − ˆ (cid:96) f ϕ ( x , y ) (cid:3) ≤ E µ ∼ ε E ( x , y ) ∼ µ m (cid:2) sup ϕ ∈D (cid:18) E ( x,y ) ∼ µ (cid:96) f ϕ ( x , y ) ( x, y ) − ˆ (cid:96) f ϕ ( x , y ) (cid:19)(cid:3) = E µ ∼ ε E ( x , y ) ∼ µ m (cid:2) sup ϕ ∈D (cid:18) E ( x,y ) ∼ µ Φ ρ ◦ ρ f ϕ ( x , y ) ( x, y ) − m m (cid:88) i =1 Φ ρ ◦ ρ f ϕ ( x , y ) ( x i , y i ) (cid:19)(cid:3) (i) ≤ (cid:114) π E µ ∼ ε E ( x , y ) ∼ µ m Γ(Φ ◦ F ρ ( x , y )) (ii) ≤ (cid:114) π E µ ∼ ε E ( x , y ) ∼ µ m ρ Γ( F ρ ( x , y )) . The inequality (i) holds due to Theorem 10, (i) and (iii). Andthe inequality (ii) uses Eq. (4). Using the sub-additivity ofthe sup function , we can bound the Gaussian complexity Γ( F ρ ( x , y )) = E γ sup ϕ ∈D m m (cid:88) i =1 ρ f ϕ ( x , y ) ( x i , y i ) γ i = E γ sup ϕ ∈D m m (cid:88) i =1 (cid:0) f ϕ ( x , y ) ( x i , y i ) − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) (cid:1) γ i ≤ E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y i ) γ i + E γ sup ϕ ∈D m m (cid:88) i =1 − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) γ i . (7)With the sub-additivity of sup , the ﬁrst term of the above result can be bounded by E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y i ) γ i = E γ sup ϕ ∈D m m (cid:88) i =1 (cid:88) y ∈Y γ i f ϕ ( x , y ) ( x i , y ) y = y i (i) ≤ (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 γ i f ϕ ( x , y ) ( x i , y ) (cid:0) (cid:15) i + 12 (cid:1) ≤ (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y ) (cid:15) i γ i (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y ) γ i k Γ(Π F ( x )) . (8)Inequality (i) uses the fact that (cid:15) i = 2 y = y i − ∈ {− , +1 } .The last inequality holds because from Lemma 2 we know (cid:15) i γ i and γ i admits the same distribution, and |Y| = k . Sim-ilarly, we can obtain the upper bound of the second term inthe r.h.s of Eq. (7) E γ sup ϕ ∈D m m (cid:88) i =1 − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) γ i ≤ E γ sup ϕ ∈D m m (cid:88) i =1 max y ∈Y f ϕ ( x , y ) ( x i , y ) γ i ≤ (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y ) γ i (Theorem 13)= k Γ(Π F ( x )) . (9)Combining Eqs. (7)-(9), we derive the expected result. (cid:4) Theorem 16 ( Theorem 4 in main paper ) . Let F and Π F ( X ) be the same as in previous theorems. For any δ > , we have, with probability at least − δ on the drawof the sample (( x , y ) , ..., ( x n , y n )) , E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X )) . Proof.

Fix a meta-sample ( X , Y ) =(( x , y ) , ..., ( x n , y n )) . Deﬁne Gaussian processes Ω ϕ and Ξ ϕ indexed by ϕ as follows: Ω ϕ = n (cid:88) l =1 γ l ˆ (cid:96) f ϕ ( x l , y l ) Ξ ϕ = n (cid:88) l =1 m (cid:88) i =1 γ li √ mρ ρ f ϕ ( x l, y l ) ( x li , y li ) , where γ l and γ li are mutually independent standard Gaussiandistributed variables. Deﬁne function class G ϕ = { ( x , y ) (cid:55)→ ˆ (cid:96) f ϕ ( x , y ) } and observe that (2 /n ) E sup ϕ ∈D Ω ϕ =( G ϕ ( X , Y )) . Then ∀ ϕ , ϕ ∈ D by using the orthogonal-ity of the γ l we have E (Ω ϕ − Ω ϕ ) = E γ (cid:18) n (cid:88) l =1 γ l (cid:0) ˆ (cid:96) ρ fϕ ( x l , y l ) − ˆ (cid:96) ρ fϕ ( x l , y l ) (cid:1)(cid:19) = n (cid:88) l =1 (cid:0) ˆ (cid:96) ρ fϕ ( x l , y l ) − ˆ (cid:96) ρ fϕ ( x l , y l ) (cid:1) = n (cid:88) l =1 (cid:18) m m (cid:88) i =1 Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) − Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) (cid:19) ≤ n (cid:88) l =1 (cid:18) m m (cid:88) i =1 | Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) − Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) | (cid:19) ≤ n (cid:88) l =1 m ρ (cid:18) m (cid:88) i =1 | ρ f ϕ x l, y l ) ( x li , y li ) − ρ f ϕ x l, y l ) ( x li , y li ) | (cid:19) ≤ n (cid:88) l =1 mρ m (cid:88) i =1 (cid:0) ρ f ϕ x l, y l ) ( x li , y li ) − ρ f ϕ x l, y l ) ( x li , y li ) (cid:1) = E (Ξ ϕ − Ξ ϕ ) Inequality (i) uses the Lipschitz Property of margin loss andinequality applies Mean Value Inequality. Then from Theo-rem 11 we have E ϕ ∈D Ω ϕ ≤ E ϕ ∈D Ξ ϕ . Multiplying with /n this becomes Γ( G ϕ ( X , Y )) ≤ n E f ∈F n (cid:88) l =1 m (cid:88) i =1 γ li √ mρ ρ f ϕ ( x l, y l ) ( x li , y li )= √ mρ Γ( F ρ ( X , Y )) where F ρ ( X , Y ) = { (cid:0) ρ f ϕ ( X , Y ) ( x , y ) , ..., ρ f ϕ ( X , Y ) ( x m , y m ) , ..., ρ f ϕ ( X , Y ) ( x n , y n ) , ..., ρ f ϕ ( X , Y ) ( x nm , y nm ) (cid:1) : ϕ ∈ D} and we deﬁne the scoring function ρ f ϕ ( X , Y ) ( x li , y li ) = ρ f ϕ ( x l, y l ) ( x li , y li ) , ∀ i ∈ [ m ] , l ∈ [ n ] . Analogous to thedemonstration process to bound the Gaussian complexity Γ( F ρ ( x , y )) in Theorem 15, we can bound Γ( F ρ ( X , Y )) with k Γ(Π F ( X )) and draw the conclusion Γ( G ϕ ( X , Y )) ≤ k √ mρ Γ(Π F ( X )) . (10)Combining Eq. (10) and Theorem 10 (ii), (iii) completes theproof. (cid:4) Covering Number Transfer Bound

Theorem 17 (Reﬁned Dudley Entropy Bound, Theorem 5in main paper) . For any real-valued function class F con-taining function f : X → R , assume that sup f ∈F (cid:107) f (cid:107) isbounded under the L ( x ) and L ( X ) metric respectively.Then Γ( F ( x )) ≤ √ m (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( x ))d τ, Γ( F ( X )) ≤ √ nm (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( X ))d τ. Proof.

We just prove the ﬁrst part. The main idea of theproof is to use generic chaining technique. Let α =sup f ∈F (cid:107) f (cid:107) , α i = 2 − i sup f ∈F (cid:107) f (cid:107) ( i ≥ . T = { } is the α -cover of F , and let T i ( i ≥ be an α i -cover of F with the smallest cardinality. Then ∀ i ≥ , we choose ˆ f i from T i , such that we have (cid:107) f − ˆ f i (cid:107) ≤ α i and can rewrite f as a ‘chain’ f = f − ˆ f N + (cid:80) Ni =1 ( ˆ f i − ˆ f i − ) . Γ( F ( x )) = 2 m E γ sup f ∈F m (cid:88) i =1 γ i f ( x i ) = 2 m E γ sup f ∈F (cid:104) γ , f ( x ) (cid:105) = 2 m E γ sup f ∈F (cid:16) (cid:104) γ , f − ˆ f N (cid:105) + (cid:104) γ , N (cid:88) i =1 ˆ f i − ˆ f i − (cid:105) (cid:17) ≤ m E γ sup f ∈F (cid:104) γ , f − ˆ f N (cid:105) + N (cid:88) i =1 m E γ sup f ∈F (cid:104) γ , ˆ f i − ˆ f i − (cid:105) . (11)To bound the ﬁrst term of the above equation, we have m E γ sup f ∈F (cid:104) γ , f − ˆ f N (cid:105) = 2 m E γ sup f ∈F m (cid:88) i =1 γ i (cid:0) f ( x i ) − ˆ f N ( x i ) (cid:1) (i) ≤ m E γ sup f ∈F (cid:0) m (cid:88) i =1 γ i (cid:1) (cid:0) m (cid:88) i =1 (cid:0) f ( x i ) − ˆ f N ( x i ) (cid:1) (cid:1) =2 (cid:0) E γ (cid:107) γ (cid:107) (cid:1)(cid:0) sup f ∈F (cid:13)(cid:13) f − ˆ f N (cid:107) (cid:1) (ii) ≤ (cid:0)(cid:113) E γ (cid:107) γ (cid:107) (cid:1)(cid:0) sup f ∈F (cid:13)(cid:13) f − ˆ f N (cid:107) (cid:1) ≤ α N . (12)(i) and (ii) use Cauchy-Schwarz and Jensen inequality re-spectively. The last inequality of Eq. (12) holds because E γ (cid:107) γ (cid:107) = E m (cid:80) mi =1 γ i = 1 and T N is a α N -cover of F . To bound the second term in the r.h.s. of Eq. (11), withtriangle inequality we have (cid:107) ˆ f i − ˆ f i − (cid:107) ≤ (cid:107) ˆ f i − f (cid:107) + (cid:107) f − ˆ f i − (cid:107) ≤ α i + α i − = 3 α i . Deﬁne function class ˆ F i = { ˆ f i − ˆ f i − : ˆ f i ∈ T i , ˆ f i − ∈ T i − } , then we have N (cid:88) i =1 m E γ sup f ∈F (cid:104) γ , ˆ f i − ˆ f i − (cid:105) = N (cid:88) i =1 Γ( ˆ F i ) ≤ N (cid:88) i =1 α i (cid:112) | T i | · | T i − |√ m (Lemma 1) ≤ N (cid:88) i =1 α i (cid:112) log | T i |√ m = 24 √ m N (cid:88) i =1 (cid:0) α i − α i +1 (cid:1)(cid:112) log | T i |≤ √ m N (cid:88) i =1 (cid:0) α i − α i +1 (cid:1)(cid:112) log N ( τ, F , L ( x )) ( τ < α i ) ≤ √ m N (cid:88) i =1 (cid:90) α i α i +1 (cid:112) log N ( τ, F , L ( x ))d τ = 24 √ m (cid:90) α α N +1 (cid:112) log N ( τ, F , L ( x ))d τ. (13) (cid:15) > , let N = sup { i : α i > (cid:15) } . Then we have (cid:15) <α N +1 ≤ (cid:15) and α N = 2 α N +1 ≤ (cid:15) . Combining Eqs. (11)-(13) and recall that α = sup f ∈F (cid:107) f (cid:107) , we have Γ( F ( x )) ≤ (cid:15) + 24 √ m (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:15) (cid:112) log N ( τ, F , L ( x ))d τ. (14)Let (cid:15) → in the right hand side of Eq. (14), we obtain theﬁnal result. (cid:4) Combining Theorem 14 and Theorem 18 we immediatelyobtain the following margin-based covering number boundfor few-shot learning.

Theorem 18 ( Margin-based Transfer Bound for MLMCwith Covering Number, Theorem 6 in main paper ) . Let F and Π F be the same as in previous theorems. Given aclassiﬁcation algorithm f and a margin parameter ρ > ,for any environment ε ∈ M ( M ( X × Y )) and for any δ > , with probability at least − δ on the data ( X , Y ) ∼ ˆ ε n ,we have for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + 24 k √ πρ √ n E ( X , Y ) ∼ ˆ ε n (cid:90) L (cid:112) log N ( τ, Π F , L ( X ))d τ + 24 k √ πρ √ m E µ ∼ ε E ( x , y ) ∼ µ m (cid:90) L (cid:112) log N ( τ, Π F , L ( x ))d τ. VC-dimension Transfer Bound

In this section, we give the bound of the covering number N (Π F ( X )) and N (Π F ( x )) in Theorem 18 with the VC-dimension of the hypothesis space Π F . Then we yield ourmost import theoretical result in Theorem 20. Theorem 19 (Theorem 2.6.7 in (van der Vaart and Wellner1996)) . Let F be a real-valued function class on X with VC-dimension v . Assume that F is uniformly bounded by b > .Then, for any probability distribution Q on X and for p ≥ N ( τ, F , (cid:107) · (cid:107) L p ( Q ) ) ≤ C ( v + 1)(16 e ) v +1 ( bτ ) pv , for some uniform constant C > , where for any f, g ∈F , (cid:107) f − g (cid:107) L p ( Q ) = ( (cid:82) | f − g | p dQ ) /p . Theorem 20 ( Margin-based Transfer Bound for MLMCwith VC-dimension, Theorem 1 in main paper ) . Let theVC-dimension of Π F deﬁned in previous theorems be v ,and Π F is uniformly bounded by b > . Given a classiﬁ-cation algorithm f and a margin parameter ρ > , for anyenvironment ε ∈ M ( M ( X × Y )) and for any δ > , withprobability at least − δ on the data ( X , Y ) ∼ ˆ ε n , we havefor all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n +( kρ √ m + kρ √ n )( C √ v + C ) , where constants C = 24 √ πb (1 + (cid:112) log(16 e ) + 2 √ and C = 24 √ πb ( √ log C + (cid:112) log(16 e )) . C is the uniformconstant in Theorem 19.Proof. Notice that for all y ∈ Y sup f ∈ Π F (cid:107) f (cid:107) = sup f ∈ Π F (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 f ( x i , y ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 b = b. From Theorem 19, we know there exits a uniform constant C such that N ( τ, Π F , L ( x )) ≤ C ( v + 1)(16 e ) v +1 ( bτ ) v . Then the integral of the log covering number in Theorem 18can be bounded by I = (cid:90) sup f ∈ Π1 F (cid:107) f (cid:107) (cid:112) log N ( τ, Π F , L ( x ))d τ ≤ (cid:90) b (cid:114) log C ( v + 1)(16 e ) v +1 ( bτ ) v d τ = (cid:90) b (cid:114) log C + log( v + 1) + ( v + 1) log(16 e ) + 2 v log( bτ )d τ (i) ≤ (cid:90) b (cid:114) log C + v + ( v + 1) log(16 e ) + 2 v bτ d τ (ii) ≤ (cid:90) b (cid:112) log C + v + ( v + 1) log(16 e ) + (cid:114) v bτ d τ ≤ α √ v + β, (i) and (ii) hold due to the basic inequalities ln( x +1) ≤ x and √ x + y ≤ √ x + √ y . Further, α =(1 + (cid:112) log(16 e ) + 2 √ b, β = ( √ log C + (cid:112) log(16 e )) b .Similarly we can give the same bound for the integral: (cid:82) sup f ∈ Π1 F (cid:107) f (cid:107) (cid:112) N ( τ, Π F , L ( X ))d τ ≤ α √ v + β . Thencombining with Theorem 18, we have with probability atleast − δ on the data ( X , Y ) ∼ ˆ ε n R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n +( kρ √ m + kρ √ n )( C √ v + C ) , where C = 24 √ πb (1 + (cid:112) log(16 e ) + 2 √ and C = 24 √ πb ( √ log C + (cid:112) log(16 e )) . (cid:4) Cross-Domain Experiment Results

We provide more experiment results on the cross-domaindataset miniImageNet → CUB in Table 4. Different5-way classiﬁcation settings are considered here. Wecan still ﬁnd that the results obtained with the multi-margin loss are comparable to those obtained with cross-entropy loss in all cases. We conduct all our experi-ments based on the code released in https://github.com/wyharveychen/CloserLookFewShot and https://github.com/kjunelee/MetaOptNet.able 4: The 5-way s -shot classiﬁcation results on the miniImageNet → CUB dataset. We report the average accuracy (%,top-1) as well as the 95% conﬁdence interval over all 600 test episodes. We compare the

Multi-Margin loss with the

Cross-Entropy loss.

Baseline++ (Chen et al. 2019) . ± .

70 67 . ± .

69 75 . ± .

61 76 . ± .

60 81 . ± .

53 83 . ± . MAML (Finn, Abbeel, and Levine 2017) . ± .

57 60 . ± .

54 70 . ± .

74 69 . ± .

77 76 . ± .

72 75 . ± . MatchingNet (Vinyals et al. 2016) . ± .

73 63 . ± .

71 72 . ± .

71 71 . ± .

69 78 . ± .

61 77 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .

75 64 . ± .

75 72 . ± .

66 72 . ± .

71 79 . ± .

63 79 . ± . RelationNet (Sung et al. 2018) . ± .

74 65 . ± .

74 69 . ± .

67 70 . ± .

71 75 . ± .

65 75 . ± . MetaOptNet (Lee et al. 2019) . ± .

73 64 . ± .

74 74 . ± .

68 73 . ± .

67 80 . ± .

63 80 . ± ..