Margin-Based Transfer Bounds for Meta Learning with Deep Feature Embedding
MMargin-Based Transfer Bounds for Meta Learning with Deep Feature Embedding
Jiechao Guan Zhiwu Lu * Tao Xiang Timothy Hospedales Beijing Key Laboratory of Big Data Management and Analysis Methods, Gaoling School of Artificial Intelligence, RenminUniversity of China Department of Electrical and Electronic Engineering, University of Surrey School of Informatics, The University of Edinburghabel [email protected] [email protected]
Abstract
By transferring knowledge learned from seen/previous tasks,meta learning aims to generalize well to unseen/future tasks.Existing meta-learning approaches have shown promisingempirical performance on various multiclass classificationproblems, but few provide theoretical analysis on the clas-sifiers’ generalization ability on future tasks. In this paper,under the assumption that all classification tasks are sampledfrom the same meta-distribution, we leverage margin theoryand statistical learning theory to establish three margin-basedtransfer bounds for meta-learning based multiclass classifica-tion (MLMC). These bounds reveal that the expected errorof a given classification algorithm for a future task can beestimated with the average empirical error on a finite num-ber of previous tasks, uniformly over a class of preprocessingfeature maps/deep neural networks (i.e. deep feature embed-dings). To validate these bounds, instead of the commonly-used cross-entropy loss, a multi-margin loss is employed totrain a number of representative MLMC models. Experimentson three benchmarks show that these margin-based modelsstill achieve competitive performance, validating the practi-cal value of our margin-based theoretical analysis.
Introduction
Inspired by human’s ability to recognizing an unseen/newobject category, meta-learning based multiclass classifica-tion (MLMC), one instantiation of which is k -way s -shotclassification (e.g., k = 5 and s = 10 ) (Vinyals et al.2016), has been studied intensively in the past few years.It is often cast into a meta-learning scenario (Sebastian andLorien 1998), in which a meta-learner learns prior knowl-edge from several training tasks and then facilitates a base-learner to generalize well on unseen/future tasks. Recentmeta-learning based classification models normally employa deep convolutional neural network to learn each task. Asa typical example, the meta-learner learns a prior which isoften in the form of a feature extractor (i.e., the lower layersof the network, often called as deep feature embedding), andthe base-learner relies on the prior (with the feature extractorfrozen) to update the fully connected layer for classificationgiven a new task. The goal of MLMC is thus to learn opti-mal deep feature embedding from observed tasks and train anew classifier which performs well on a future task. * Corresponding author
Existing meta-learning based classification approaches(Vinyals et al. 2016; Snell, Swersky, and Zemel 2017; Sunget al. 2018; Finn, Abbeel, and Levine 2017) have shownpromising empirical performance on several benchmarks,but few provide theoretical analysis on the expected per-formance of their learned classifiers on new/unseen tasks,which may have different data distributions from that ofprevious training tasks. To overcome this limitation, the fo-cus of this paper is thus on providing theoretical guaranteeson how well a meta-learning based classification model cangeneralize to new/unseen tasks.The central assumption of our theoretical results is that thelearner is embedded within a distribution of related learn-ing tasks. Since all learning tasks, either from seen classesor unseen ones, share some similarities (e.g., all tasks onthe CUB dataset (Wah et al. 2011) are about recognizingfine-grained bird species), it is reasonable to assume thatthese tasks are sampled from a common meta-distribution,which is referred to as environment in (Baxter 2000). Giventhat each task represents a data distribution, an environmentcan be considered as a distribution of distributions. Start-ing with a steam of datasets drawn from different trainingtasks in this environment, we aim to learn a multiclass clas-sifier which minimizes the transfer risk (Baxter 2000; Mau-rer 2009; Maurer, Pontil, and Romera-Paredes 2016) for anew task randomly sampled from the same environment.Under this assumption, we leverage the margin theory(Vapnik 1982) and the statistical learning theory (Valiant1984) to derive a high probability bound of transfer risk. Themargin theory has been utilized to study behaviours of manymachine learning models (Schapire et al. 1997; Koltchin-skii and Panchenko 2002; Bartlett, Foster, and Telgarsky2017), and is a standard tool to analyze multiclass classi-fication problems (see Chapter 9 in (Mohri, Rostamizadeh,and Talwalkar 2012)). Under the probably approximatelycorrect (PAC) learning framework, we establish a margin-based transfer bound with Gaussian complexity (Bartlettand Mendelson 2002) for MLMC (see Theorem 2). Bound-ing the Gaussian complexity with chaining (Dudley 1967)and other statistical learning techniques, we further pro-vide a margin-based transfer bound with covering number(Anthony and Bartlett 2002) (in Theorem 6) and a transferbound based on the VC-dimension (van der Vaart and Well-ner 1996) of the given scoring function class (in Theorem 1), a r X i v : . [ c s . L G ] D ec espectively. These results demonstrate that, for any fixedpreprocessing deep neural networks (meta-learner) and anygiven classification algorithm (base-learner), the expectederror for a future MLMC task can be controlled by the av-erage empirical margin loss on the training tasks. This theo-retical analysis is applicable to any MLMC methods whichdo not update the parameters of the meta learner/feature ex-tractor when performing classification in a new/unseen task.It thus covers a quite wide range of methods (see examplesof such MLMC variants in Sect. Experiments).Our main contributions are three-fold: (1) Our main theorem (i.e. Theorem 1) gives a rigorous the-oretical statement that an MLMC model’s transfer risk onan unseen task can be bounded by the empirical error onprevious tasks plus a complexity part. This transfer boundguarantees that under certain constraints (e.g., the scoringfunction class’s VC-dimension v is finite and the task num-ber n becomes large), the average empirical margin loss isa proper estimation of the expected loss on a new task. Fur-ther, our main theorem also reveals that the obtained transferbound admits only a linear dependency on the number k ofclassification categories. To obtain this transfer bound, wefirst provide the margin-based transfer bounds for MLMCwith Gaussian complexity and covering number, in Theo-rem 2 and Theorem 6, respectively. (2) Our transfer bounds (e.g. Theorem 1) are all dimen-sion free when deep feature embedding is used for MLMC.Importantly, for a meta-learning problem in which a meta-learner learns a neural network from previous tasks and abase-learner learns a new classifier to generalize on a futuretask, our theoretical results reveal that the significant stepis to learn a proper deep feature embedding function (com-bined with any multiclass classification algorithm) whichcan induce a large family of classifiers containing a goodsolution for any task sampled from the same environment.In addition, for meta learning with deep feature embedding,the sample efficiency per task can actually be guaranteed byour theoretical results (see more detailed discussion at theend of the related work section). (3)
We adopt the multi-margin loss (the surrogate of the mar-gin loss), rather than the cross-entropy loss commonly usedin multiclass classification problems, to train existing typicalMLMC models. The experimental results (see Tables 1–3 inSect. Experiments) demonstrate that the trained models withmargin loss still achieve competitive performance on threebenchmark datasets (miniImageNet (Vinyals et al. 2016),CUB (Wah et al. 2011) and miniImageNet → CUB). Thisclearly validates the practical value of our margin-based the-oretical analysis for MLMC.The remainder of this paper is organized as follows.Sect. Preliminary provides the background information andnotations. Sect. Theoretical Results presents our main the-oretical results on the transfer bound of MLMC, followedby the empirical experiments in Sect. Experiments. The dif-ferences between our theoretical results and closely-relatedworks are discussed in Sect. Related Work. Sect. Conclusionand Future Work draws conclusions and points out future re-search directions, respectively.
Preliminary
Learning Setup
Tasks and Samples . In meta-learning multiclass classifi-cation (MLMC) problem, a task is a probability measure µ ∈ M ( X × Y ) , where X is an input space, Y is anoutput space (which is { , ..., k } in multiclass classifica-tion), and M ( H ) generally denotes the set of probabilitymeasures on a space H . Information about the task µ is obtained by independently sampling a finite number m of training examples ( x i , y i ) ∼ µ . Such an m -tuple (( x , y ) , ..., ( x m , y m )) ∼ µ m is called a sample , whichis also denoted as ( x , y ) = (( x , y ) , ..., ( x m , y m )) with x = ( x , ..., x m ) and y = ( y , ..., y m ) . Algorithms . We consider the classification problem withthe hypothesis space F of scoring functions. A meta-learning classification algorithm (e.g., based on SVM) is afunction f : ( X × Y ) m → F . Let D be a space of alternativefeature maps/neural networks, and for deep learning basedmethods, we need to choose a feature map ϕ ( ∈ D ) topreprocess the input data. This induces a new MLMCclassification algorithm f ϕ : ( X × Y ) m → F . We canregard ϕ and f as the meta-learner and the base-learner,respectively. From the training sample ( x , y ) , this algorithmlearns a scoring function f ϕ ( x , y ) ∈ F : for each input-outputpair ( x i , y i ) , the scoring function outputs the predictionscore f ϕ ( x , y ) ( x i , y ) , i.e., the probability of x i belonging tothe class label y ( y ∈ Y ). Without loss of generality, weassume that there exits a positive number b > , such that | f ϕ ( x , y ) ( x, y ) | ≤ b, ∀ f ϕ ( x , y ) ∈ F , ( x, y ) ∼ µ . Environments . The encounter with a task µ is itself a ran-dom event, corresponding to a draw µ ∼ ε where ε is a prob-ability measure on the set of tasks, i.e., ε ∈ M ( M ( X ×Y )) . In this work, such probability measures are called environments as in (Baxter 2000; Maurer 2009). Informationabout the environment ε is obtained by independently draw-ing a finite number n of tasks { µ l : µ l ∼ ε, l = 1 , ..., n } :each task µ l is represented by a sample ( x l , y l ) ∼ ( µ l ) m , ( x l , y l ) = (( x l , y l ) , ..., ( x lm , y lm )) , with the understandingthat x l = ( x l , ..., x lm ) and y l = ( y l , ..., y lm ) . Under theMLMC setting, the size m of each task is kept the same to fa-cilitate the analysis. Let ( X , Y ) = (( x , y ) , ..., ( x n , y n )) be the training data generated in this manner. Further, we de-fine a probability measure ˆ ε on the set of samples ( X × Y ) m by letting expectation E ˆ ε ( g ) = E µ ∼ ε E ( x , y ) ∼ µ m g ( x , y ) forevery Borel measurable function g on ( X × Y ) m . The en-tire training data ( X , Y ) can thus be considered to be gen-erated in n independent draws from ˆ ε , that is, ( X , Y ) =(( x , y ) , ..., ( x n , y n )) ∼ ˆ ε n . Margin Loss and Transfer Risk
The margin of a scoring function f ϕ ( x , y ) (trained from thesample ( x , y ) ) at a labeled data ( x i , y i ) is ρ f ϕ ( x , y ) ( x i , y i ) (cid:44) f ϕ ( x , y ) ( x i , y i ) − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) . A real-valued function associated with any algorithm f ϕ onthe training sample ( x , y ) is its empirical loss ˆ (cid:96) f ϕ : ( X × ) m → R + , defined by ˆ (cid:96) f ϕ ( x , y ) = 1 m m (cid:88) i =1 (cid:96) f ϕ ( x , y ) ( x i , y i ) where (cid:96) f ϕ ( x , y ) ( x i , y i ) = Φ ρ ◦ ρ f ϕ ( x , y ) ( x i , y i ) and ◦ denotesthe operator of function composition, and for any ρ > ,the margin loss Φ ρ ( x ) = min (cid:0) , max(0 , − xρ ) (cid:1) . To eval-uate the performance of a MLMC algorithm f ϕ ( ϕ ∈ D ) inan environment ε , the following steps are taken: (i) make arandom choice of a task µ ∼ ε , (ii) draw a training sample ( x , y ) ∼ µ m , (iii) select a test pair ( x, y ) ∼ µ , (iv) run thealgorithm f ϕ to obtain scoring function f ϕ ( x , y ) , (v) returnthe loss (cid:96) f ϕ ( x , y ) ( x, y ) . The expected output of this proce-dure can be used to measure the generalization ability of theMLMC algorithm in the given environment. This motivatesthe following definition of the expected transfer risk associ-ated with the learning algorithm f ϕ : R ε ( f ϕ ) = E µ ∼ ε E ( x , y ) ∼ µ m E ( x,y ) ∼ µ (cid:96) f ϕ ( x , y ) ( x, y ) . (1)Given all training data ( X , Y ) ∼ ˆ ε n , we utilize ( X , Y ) to select a ϕ ( X , Y ) ∈ D and fix it on the future task,so that the expected transfer risk R ε ( f ϕ ( X , Y ) ) of themodified algorithm f ϕ ( X , Y ) is minimal or near minimal.The conceptually simplest way is to select ϕ ( X , Y ) =arg min ϕ ∈D n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) , which minimizes the av-erage empirical risk on the available training data. In thispaper, we give a high probability bound of R ε ( f ϕ ) in termsof n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) , and such bound uniformly holds forall ϕ ∈ D , not just for ϕ ( X , Y ) (see Theorem 1). VC-dimension and Gaussian Complexity
Definition 0.1 (VC-dimension, 2.6.1 in (van der Vaart andWellner 1996)) . Let C be a collection of subsets of a set X . C is said to shatter { x , ..., x n } if each of its n subsets canbe expressed as the form C (cid:84) { x , ..., x n } for a C in C . TheVC-dimension of the class C is the largest n for which a setof size n is shattered by C . Definition 0.2 (VC-dimension of Real-Valued FunctionClass) . The subgraph of a function f ( ∈ F ) : X → R isthe subset of X × R given by { ( x, t ) : t < f ( x ) } . Thenthe VC-dimension of the function class F is defined as theVC-dimension of the set of subgraphs of functions in F . Definition 0.3 (Gaussian Complexity (Bartlett and Mendel-son 2002)) . For a subset A ⊆ R m , the Gaussian complex-ity of A is defined as Γ( A ) = E γ sup x ∈ A /m (cid:80) mi =1 γ i x i ,where { γ i } i ≥ is a sequence of independent standard Gaus-sian variables (i.e. γ i ∼ N (0 , ). If F is a class of real-valued functions on the space X and x = ( x , ..., x m ) ∈ X m , we define F ( x ) = F ( x , ..., x m ) = { ( f ( x ) , ..., f ( x m )) : f ∈ F} ⊆ R m . The empirical Gaussian complexity of F on x is Γ( F ( x )) . Let µ ∈ M ( X ) be a probability mea-sure on X , and the corresponding expected complexity is E ( x , y ) ∼ µ m Γ( F ( x )) . F ( X ) and its expected Gaussian com-plexity E ( X , Y ) ∼ ˆ ε n Γ( F ( X )) can be defined in a similar way. Theoretical Results
In this section, we present our theoretical results. The mostimportant one is Theorem 1, which reveals that the learningbound of an algorithm on the new MLMC tasks (which mayhave different data distributions from that of previous tasks)can be controlled by the empirical loss on previous tasks plusthe complexity term.
All detailed proofs for our theoreti-cal results can be found in the supplementary material . Theorem 1 ( Margin-Based Transfer Bound for MLMCwith VC-dimension ) . Assume that the VC-dimension ofthe real-valued function class Π F = { x (cid:55)→ g ( x, y ) | g ∈F , y ∈ Y , |Y| = k } is v , and Π F is uniformly bounded by b > . Given a classification algorithm f and a margin pa-rameter ρ > , for any environment ε ∈ M ( M ( X × Y )) and for any δ > , with probability at least − δ on the data ( X , Y ) ∼ ˆ ε n , we have for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n +( kρ √ m + kρ √ n )( C √ v + C ) , where constants C = 24 √ πb (1 + (cid:112) log(16 e ) + 2 √ and C = 24 √ πb ( √ log C + (cid:112) log(16 e )) . C is the uniformconstant defined in Theorem 7. The main proof is based on Theorem 6. To obtain Theo-rem 6, we first give the Gaussian complexity transfer boundin Theorem 2, which is accomplished by using Slepian’sLemma (Ledoux and Talagrand 1991) to bound the functionclass G ϕ = { ( x , y ) (cid:55)→ ˆ (cid:96) f ϕ ( x , y ) } . This is not straightfor-ward and thus contributes our major technical novelty. Gaussian Complexity Transfer Bound for MLMC
Theorem 2 ( Margin-Based Transfer Bound for MLMCwith Gaussian Complexity ) . Let F be a hypothesis of scor-ing functions. Given a classification algorithm f and a mar-gin parameter ρ > , for any environment ε ∈ M ( M ( X ×Y )) and for any δ > , with probability at least − δ on thedata ( X , Y ) ∼ ˆ ε n , we have for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X ))+ (cid:114) ln(1 /δ )2 n + k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) , where Π F ( X ) = { (cid:0) f ϕ ( X , Y ) ( x , y ) , ..., f ϕ ( X , Y ) ( x m , y ) ,..., f ϕ ( X , Y ) ( x n , y ) , ..., f ϕ ( X , Y ) ( x nm , y ) (cid:1) : y ∈ Y , ϕ ∈ D} , Π F ( x ) = { (cid:0) f ϕ ( x , y ) ( x , y ) , ..., f ϕ ( x , y ) ( x m , y ) (cid:1) : y ∈Y , ϕ ∈ D} , and the scoring function f ϕ ( X , Y ) is defined as: f ϕ ( X , Y ) ( x li , y ) = f ϕ ( x l , y l ) ( x li , y ) , ∀ i ∈ [ m ] , l ∈ [ n ] . The main proof strategy is to rewrite R ε ( f ϕ ) − n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) as the following form (cid:16) R ε ( f ϕ ) − E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) (cid:17) + (cid:16) E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) − n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) (cid:17) , (2)nd then bound the two terms separately. The first termis called estimation difference expected for the future task(Maurer 2009). To bound the first term, we need to utilizethe Lipschitz property of the margin loss and the proper-ties of Gaussian complexity. The second part of Eq (2) isthe estimation difference between the expected empirical er-ror of the MLMC classifier’s output on a new task and theaverage empirical errors on the data of the past tasks. Wechoose to use PAC learning techniques and Gaussian con-traction inequality (Wainwright 2019) to obtain this term’supper bound. The obtained results are shown in Theorem 3and Theorem 4, respectively. Theorem 3.
Let F and Π F ( x ) be the same as in previoustheorems. For ρ > , we have R ε ( f ϕ ) ≤ E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y )+ k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) . Theorem 4.
Let F and Π F ( X ) be the same as in previoustheorems. For any δ > , we have, with probability at least − δ on the draw of the sample (( x , y ) , ..., ( x n , y n )) , E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X )) . With these two theorems, we obtain the Gaussian com-plexity transfer bound for MLMC (Theorem 2).
Covering Number Transfer Bound for MLMC
Although we have shown a margin-based transfer bound inTheorem 2, the value of the Gaussian complexity is still im-plicit. Moreover, we are more interested in the variation of
Γ(Π F ( X )) with the growth of m and n . To this end, weneed another more intuitive indicator to measure the com-plexity of hypothesis space Π F ( X ) . In this work, we usethe following covering number (Zhou 2002). Definition 0.4 (Covering Number) . Let ( M, d ) be a metricspace. A subset ˆ T is called an (cid:15) -cover of T ⊆ M if ∀ t ∈ T , ∃ t (cid:48) ∈ ˆ T such that d ( t, t (cid:48) ) ≤ (cid:15) . The covering numberof T is the cardinality of the smallest (cid:15) -cover of T , that is, N ( (cid:15), T, d ) (cid:44) min (cid:110) | ˆ T | (cid:12)(cid:12)(cid:12) ˆ T is an (cid:15) − cover of T (cid:111) . Let ( F x ,...,x m , L ( (cid:98) D )) be the data-dependent L met-ric space given by metric d ( f, ˆ f ) (cid:44) (cid:107) f − ˆ f (cid:107) = (cid:113) m (cid:80) mi =1 (cid:0) f ( x i ) − ˆ f ( x i ) (cid:1) , where x = ( x , ..., x m ) is a sample from space X and F x ,...,x m representsfor the restriction of real-valued function class F tothat sample. N ( (cid:15), F , L ( X )) can be defined in a sim-ilar way with the data-dependent metric d ( f, ˆ f ) = (cid:113) mn (cid:80) ni =1 (cid:80) mj =1 (cid:0) f ( x ij ) − ˆ f ( x ij ) (cid:1) ( f, ˆ f ∈ F ) . Usingthe chaining technique (Talagrand 2014), the following re-fined theorem reveals the relationship between the Gaussiancomplexity and covering number. Theorem 5 (Refined Dudley Entropy Bound) . For any real-valued function class F containing function f : X → R ,assume that sup f ∈F (cid:107) f (cid:107) is bounded under the L ( x ) and L ( X ) metric respectively. Then Γ( F ( x )) ≤ √ m (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( x ))d τ, Γ( F ( X )) ≤ √ nm (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( X ))d τ. Bounding the Gaussian complexity in Theorem 2 with theDudley integral in Theorem 5, we obtain the following cov-ering number transfer bound for MLMC.
Theorem 6 ( Margin-Based Transfer Bound for MLMCwith Covering Number ) . Let F and Π F be the same as inprevious theorems. Given a classification algorithm f and amargin parameter ρ > , let L = sup f ∈ Π F (cid:107) f (cid:107) , then forany environment ε ∈ M ( M ( X × Y )) and for any δ > ,with probability at least − δ on the data ( X , Y ) ∼ ˆ ε n , wehave for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + 24 k √ πρ √ n E ( X , Y ) ∼ ˆ ε n (cid:90) L (cid:112) log N ( τ, Π F , L ( X ))d τ + 24 k √ πρ √ m E µ ∼ ε E ( x , y ) ∼ µ m (cid:90) L (cid:112) log N ( τ, Π F , L ( x ))d τ. Theorem 7 (Theorem 2.6.7 in (van der Vaart and Wellner1996)) . Let F be a real-valued function class on X withVC-dimension v . Assume that F is uniformly bounded by b > . Then, for any probability distribution Q on X , N ( τ, F , (cid:107) · (cid:107) L p ( Q ) ) ≤ C ( v + 1)(16 e ) v +1 ( bτ ) pv , where C > is a uniform constant, and for any f, g ∈F , (cid:107) f − g (cid:107) L p ( Q ) = ( (cid:82) | f − g | p dQ ) /p , p ≥ . Further, since | f ( x i , y ) | ≤ b , we have sup f ∈ Π F (cid:107) f (cid:107) =sup f ∈ Π F (cid:113) m (cid:80) mi =1 f ( x i , y ) ≤ (cid:113) m (cid:80) mi =1 b = b .Combining Theorem 6 and Theorem 7, we then obtain ourmost important theoretical result in Theorem 1. From Theory to Implementation
In practical implementation, a multi-margin loss (Paszkeet al. 2017) is preferable in the multiclass classificationproblem, because of the convexity of the loss function(Mohri, Rostamizadeh, and Talwalkar 2012). The multi-margin loss on an input-output pair ( x i , y i ) is Ψ( x i , y i ) = k − (cid:80) ky (cid:54) = y i max (cid:0) , − ( f ϕ ( x i , y i ) − f ϕ ( x i , y )) /ρ (cid:1) . De-fine the empirical multi-margin loss on one trainingtask ˜ (cid:96) f ϕ ( x l , y l ) = m (cid:80) mi =1 Ψ( x li , y li ) . Due to the re-lationship between margin loss and multi-margin lossable 1: The 5-way s -shot classification results on the miniImageNet dataset. We report the average accuracy (%, top-1) aswell as the 95% confidence interval over all 600 test episodes. We compare the Multi-Margin loss with the
Cross-Entropy loss. We also give the results of Baseline++ and MAML (with ‡ ), to which our current theoretical analysis is nevertheless notapplicable, for comprehensive performance comparison. Baseline++ ‡ (Chen et al. 2019) . ± .
63 67 . ± .
66 73 . ± .
59 73 . ± .
60 77 . ± .
57 77 . ± . MAML ‡ (Finn, Abbeel, and Levine 2017) . ± .
70 60 . ± .
70 65 . ± .
74 62 . ± .
77 65 . ± .
77 63 . ± . MatchingNet (Vinyals et al. 2016) . ± .
68 64 . ± .
66 68 . ± .
66 69 . ± .
67 72 . ± .
61 73 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .
73 66 . ± .
68 70 . ± .
63 71 . ± .
61 74 . ± .
75 75 . ± . RelationNet (Sung et al. 2018) . ± .
64 65 . ± .
67 69 . ± .
63 68 . ± .
64 72 . ± .
59 71 . ± . MetaOptNet (Lee et al. 2019) . ± .
66 67 . ± .
65 72 . ± .
64 71 . ± .
59 77 . ± .
54 76 . ± . Φ( ρ f ϕ ( x i , y i )) ≤ ( k − x i , y i ) , we can replace the em-pirical margin loss ˆ (cid:96) f ϕ ( x l , y l ) in Theorem 1 with ˜ (cid:96) f ( x l , y l ) : R ε ( f ϕ ) ≤ k − n n (cid:88) l =1 ˜ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + ( kρ √ m + kρ √ n )( C √ v + C ) . The multi-margin loss can be considered as the surrogate ofthe margin loss for easier model optimization. Next, we willuse the multi-margin loss to train MLMC models.
Experiments
In this section, we conduct experiments on three bench-marks to evaluate the performance of existing MLMC meth-ods when the commonly-used cross-entropy loss is replacedby the multi-margin loss. Our main goal is to validate thepractical value of our margin-based theoretical analysis formeta-learning based multiclass classification, since in somecases the margin loss or hinge loss may cause the problemof gradient vanishing in stochastic gradient descent.
Experiment Setup
Datasets and Backbone . (1) miniImageNet . This datasetis widely used for the conventional MLMC setting, whichconsists of 100 classes selected from ILSVRC-2012 (Rus-sakovsky et al. 2015). Each class has 600 images. We use64/16/20 classes for training/validation/test. (2)
CUB . Un-der the fine-grained MLMC setting, we choose the CUB-200-2011 (CUB) dataset (Wah et al. 2011), which has 200classes and 11,788 images of birds. As in (Chen et al. 2019),we use 100/50/50 classes for training/validation/test. (3) miniImageNet → CUB . Under the cross-domain MLMCsetting, we use 100 classes from miniImageNet for training,and 50/50 classes from CUB for validation/test, as in (Chenet al. 2019). For all experiments, we use a four-layer convo-lutional neural network (Conv-4) (Vinyals et al. 2016) as thebackbone with the input size of × . Episode Sampling . Under the standard MLMC setting(Vinyals et al. 2016; Snell, Swersky, and Zemel 2017), anepisode is actually a training sample drawn from one task inan environment. A k -way s -shot q -query episode contains k ( s + q ) images (MLMC is thus instantiated as k -way s -shot classification). We set k = 5 , q = 15 , s = 5 / / during both training and test stages. Particularly, we trainBaseline++ (Chen et al. 2019) (mini-batch-training based)with 400 epochs (batch size = 256), and train meta-learningbased methods with 40,000 episodes. When applied to ana-lyze the k -way s -shot setting, our transfer bound becomes ( √ k/ ( ρ √ s + q ) + k/ ( ρ √ n ))( C √ v + C ) due to m = k ( s + q ) . Evaluation Protocols . We make performance evaluation onthe test set under the 5-way 5-shot, 5-way 10-shot and 5-way20-shot settings (15-query for each test episode). Concretely,we randomly sample 600 episodes from the test set, and thenreport the average accuracy (%, top-1) as well as the 95%confidence interval over all the test episodes.
Baselines for Comparison . We select six representativebaselines for k -way s -shot classification: (i) mini-batch-training method: Baseline++ (Chen et al. 2019) first learnsthe classifier with the standard supervised training strategy,and then finetunes it to each task in the test stage. (ii) Meta-learning based methods: three metric-learning based meth-ods (MatchingNet (Vinyals et al. 2016), ProtoNet (Snell,Swersky, and Zemel 2017), RelationNet (Sung et al. 2018))which just run a forward process to obtain the test accura-cies during the test stage, without updating the parametersof the feature extractor; one classifier-learning based method(MetaOptNet (Lee et al. 2019)) which fixes the parametersof the feature extractor to extract image features and trainsa new SVM classifier in each new task; one gradient basedmethod (MAML (Finn, Abbeel, and Levine 2017)) that up-dates the parameters of both the feature extractor and thefully connected layer in each new task. Note that our theo-retical analysis only holds for metric-learning and classifier-learning based models. For performance comparison, boththe cross-entropy loss and the multi-margin loss are used forthe above six representative baselines. Implementation Details . Our implementation is based onPyTorch (Paszke et al. 2017). We train all models fromscratch and use the Adam optimizer with the initial learn-ing rate − . We select other hyperparameters (including ρ ) by performing validation on the validation set. Main Results
Performance of Multi-Margin Loss . We focus on com-paring the multi-margin loss with the cross-entropy lossable 2: The 5-way s -shot classification results on the CUB dataset. We report the average accuracy (%, top-1) as well as the95% confidence interval over all 600 test episodes.
Baseline++ ‡ (Chen et al. 2019) . ± .
64 66 . ± .
78 84 . ± .
58 78 . ± .
63 86 . ± .
61 80 . ± . MAML ‡ (Finn, Abbeel, and Levine 2017) . ± .
73 74 . ± .
70 79 . ± .
68 77 . ± .
75 81 . ± .
68 78 . ± . MatchingNet (Vinyals et al. 2016) . ± .
64 79 . ± .
62 83 . ± .
56 84 . ± .
56 86 . ± .
45 87 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .
67 79 . ± .
68 83 . ± .
59 84 . ± .
59 86 . ± .
52 87 . ± . RelationNet (Sung et al. 2018) . ± .
65 77 . ± .
68 81 . ± .
57 80 . ± .
61 82 . ± .
54 81 . ± . MetaOptNet (Lee et al. 2019) . ± .
71 78 . ± .
67 83 . ± .
60 82 . ± .
64 87 . ± .
57 86 . ± . Number of training tasks T e s t A cc u r a c y ( % ) × (a) Range of ρ T e s t A cc u r a c y ( % ) (b) Figure 1: (a) The 5-way test accuracies of ProtoNet with different numbers of training tasks on miniImageNet ( ρ = 1 ). (b) The5-way test accuracies of ProtoNet with different choices of ρ on miniImageNet ( n = s -shot are also considered.Table 3: Comparative results on the cross-domain miniIma-geNet → CUB dataset. Average 5-way 5-shot classificationaccuracies (%) with 95% confidence intervals are computedhere. More s -shot classification results can be found in thesupplementary material. Model Cross-Entropy Multi-Margin
Baseline++ (Chen et al. 2019) . ± .
70 67 . ± . MAML (Finn, Abbeel, and Levine 2017) . ± .
57 60 . ± . MatchingNet (Vinyals et al. 2016) . ± .
73 63 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .
75 64 . ± . RelationNet (Sung et al. 2018) . ± .
74 65 . ± . MetaOptNet (Lee et al. 2019) . ± .
73 64 . ± . when both are used for k -way s -shot classification. FromTables 1–3, we have the following observations: (i) Inmost cases, the classification performance obtained withthe multi-margin loss is comparable to that obtained withthe cross-entropy loss. (ii) On the CUB dataset, the infe-rior performance of Baseline++ with the multi-margin losssuggests that the margin loss may be unsuitable for mini-batch-training model optimization in fine-grained classifi-cation. (iii) For meta-learning based classification methods,the multi-margin loss generally leads to competitive results (w.r.t. the the cross-entropy loss), which is consistent withour margin-based theoretical analysis. Influence of Hyperparameters . We further conduct exper-iments to study the influence of three hyperparameters – thenumber of the training tasks n , the sample size per episode m (= ks ) and the margin parameter ρ – on the performanceof meta-learning based classification. We select ProtoNet asthe baseline and show the 5-way test accuracies (i.e. k = 5 )on miniImageNet in Figure 1. We can find that: (i) Meta-learning based classification with the multi-margin loss canachieve higher test accuracies with the growth of n and m ,but the improvements are incremental when n or m is large(e.g. n ≥ , , m ≥ (i.e. s ≥ )). (ii) The perfor-mance of meta-learning based classification with the multi-margin loss is not so sensitive to the variations of ρ . Related Work
The pioneering work (Baxter 2000) on meta-learning the-ory differs from our results in several aspects: (1) (Baxter2000) is not focused on the multiclass classification prob-lem and does not explicitly reveal the relationship betweenthe transfer bound and the number k of classification cate-gories, but we show that our transfer bounds admit only alinear dependency on k . (2) Its transfer bound with a neu-al network feature map (Theorem 8 in (Baxter 2000)) de-pends on the feature dimension, while our transfer bounds(e.g. Theorem 1) are dimension free. (3) The theoretical re-sults about feature maps in (Baxter 2000) pay more attentionto simple two-layer neural networks and mainly use the neu-ral network for feature dimension reduction. However, thispaper focuses on the role of deep feature embedding in im-age feature extraction, which is the modern development offeature engineering in machine learning community.Following (Baxter 2000), recent theoretical works canbe divided into two main groups: one group exploresPAC-Bayes theory (Pentina and Lampert 2014; Dziugaiteand Roy 2017; Amit and Meir 2018) for meta learning,and the other utilizes conventional PAC learning techniques(i.e. without prior assumptions) to give transfer boundsfor different models (e.g. regression learner (Maurer 2009;Maurer, Pontil, and Romera-Paredes 2016; Denevi et al.2018) or algorithm stability (Maurer 2005)). Below, wedetail our differences from the two most related works(Amit and Meir 2018; Maurer 2009). PAC-Bayes Meta Learning Theory . It assumes a priordistribution over priors, a ‘hyper-prior’ P ( P ) , and af-ter training outputs a distribution over priors, a ‘hyper-posterior’ Q ( P ) . Let er ( Q , ε ) = E P ∼Q er ( P, ε ) be the ex-pected loss to measure the quality of the prior Q , where er ( P, ε ) (cid:44) E µ ∼ ε E S ∼ µ m E f ∼ Q ( S,P ) E ( x,y ) ∼ µ l ( f, x, y ) ( l is the loss function). S i = ( x i , y i ) is a sample and er ( Q, S i ) = E f ∼ Q m (cid:80) mj =1 l ( f, x ij , y ij ) is the empiricalloss. D ( Q || P ) = E f ∼ Q ln Q ( f ) P ( f ) represents the Kullback-Leibler Divergence between two distributions Q and P . Theorem 8 ( Theorem 2 in (Amit and Meir 2018)) . Let Q : S × F → F be a base-learner and Q i (cid:44) Q ( S i , P ) . Forany hyper-posterior Q and for any δ > , with probabilityas least − δ , the inequality holds: er ( Q , ε ) ≤ n n (cid:88) i =1 E P ∼Q ˆ er i ( Q i , S i )++ (cid:113) D ( Q||P ) + log nδ √ n −
2+ 1 n n (cid:88) i =1 (cid:114) D ( Q||P ) + E P ∼Q D ( Q i || P ) + log nmδ √ m − . The transfer bound in Theorem 8 differs from oursin Theorem 1 in two aspects: (i) PAC-Bayes bound is ex-pressed on the averaging of multiple hypothesises (weightedby a posterior distribution), but our bound is expressed forany hypothesis. (ii) Even ignoring the KL-divergence termbetween two distinct distributions, the PAC-Bayes boundhas still a complexity part O ( √ log( nm ) √ m + √ log n √ n ) , while wederive a bound O ( k ( (cid:112) vm + (cid:112) vn )) which is generally lower. Transfer Bounds for Linear Regression . Another closelyrelated work is (Maurer 2009) which focuses on linearregression problems. Below, we show the transfer boundof the regression learners in (Maurer 2009), and compareit with our Theorem 1. Considering the regularized least squares regression, we have the weight vectors ω ( x , y ) =arg min ω ∈ H (cid:0) m (cid:80) mi =1 ( (cid:104) ω, x i (cid:105) − y i ) + (cid:107) ω (cid:107) (cid:1) . The empir-ical error is ˆ (cid:96) ω ( x , y ) = m (cid:80) mi =1 ( (cid:104) ω ( x , y ) , x i (cid:105) − y i ) . P d is the set of orthogonal projections P with d -dimension, and ˆ (cid:96) ω λ − P ( x l , y l ) = ˆ (cid:96) ω ( λ − / P / x l , y l ) ( λ > is the reg-ularization parameter). Let (cid:107) C (cid:107) ∞ be the largest eigenvalueof the covariance operator C for the total input distribution. Theorem 9 ( Theorem 1 in (Maurer 2009)) . For any δ > ,we have with probability at least − δ on the data ( X , Y ) =(( x , y ) , ..., ( x n , y n )) that for all feature maps P ∈ P d R ε ( ω λ − P ) ≤ n n (cid:88) l =1 ˆ (cid:96) ω λ − P ( x l , y l ) + (cid:114) ln(1 /δ )2 n + √ πdλ (cid:0) (cid:114) (cid:107) C (cid:107) ∞ m + (cid:114) n (cid:1) . Note that the transfer bounds for regression and classi-fication problems have a similar form O ( √ m + √ n ) , if weignore the dependence on the confidence parameter δ in The-orem 9 & Theorem 1 and assume that the VC-dimension v inTheorem 1 is finite. That is, although our work is quite dif-ferent from (Maurer 2009) (i.e. nonlinear classification vs.linear regression), our transfer bounds are somewhat similarto the transfer bound of (Maurer 2009), indirectly showingthe correct derivation of our transfer bounds.Finally, we stress that the sample efficiency per task canalso be guaranteed by our transfer bounds for meta learningwith deep feature embedding. Specifically, given the accu-racy (cid:15) , to let the inequality R ε ( f ϕ ) ≤ n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l )+ (cid:15) hold, which means our transfer bound O ( k ( (cid:112) vm + (cid:112) vn )) = ak ( (cid:112) vm + (cid:112) vn ) ≤ (cid:15) ( a is a constant), we must let the num-ber of examples per task required for good generalizationobey m ≥ a k v ( (cid:15) − ak √ vn ) ( (cid:44) φ ( n )) . Since φ ( n ) is a monoton-ically decreasing function (w.r.t. n ), the number m of exam-ples per task required for good generalization will decreaseas the number n of tasks increases. Therefore, our theoreti-cal results guarantee the sample efficiency per task in metalearning. Similarly, such sample efficiency per task can alsobe guaranteed by (Maurer 2009), though it focuses on theregression problem instead. Conclusion and Future Work
We have derived margin-based transfer bounds for meta-learning based multiclass classification, showing that its ex-pected error on a future task can be properly estimated by itsempirical error on previous tasks. We show that our transferbounds only admit a linear dependency on classification cat-egories, and point out the importance of the choice of deepfeature embedding in meta-learning. The experiment resultsdemonstrate the practical significance of our margin-basedtheoretical analysis. Our ongoing research includes: ( i ) Thecross-entropy loss is the most common choice for multiclassclassification, and performs better than the multi-margin lossin some cases. One research direction is to explore whethere can use the cross-entropy loss to obtain similar theo-retical results. ( ii ) Our most important theoretical result isgiven by using Gaussian complexity and Slepian’s Lemmain Gaussian process. Is it possible to obtain a similar ortighter transfer bound via more concise theoretical analysis? References
Amit, R.; and Meir, R. 2018. Meta-Learning by AdjustingPriors Based on Extended PAC-Bayes Theory. In Dy, J. G.;and Krause, A., eds.,
ICML , volume 80, 205–214.Anthony, M.; and Bartlett, P. L. 2002.
Neural NetworkLearning - Theoretical Foundations . Cambridge UniversityPress.Bartlett, P. L.; Foster, D. J.; and Telgarsky, M. 2017.Spectrally-normalized margin bounds for neural networks.In
NeurIPS , 6240–6249.Bartlett, P. L.; and Mendelson, S. 2002. Rademacher andGaussian Complexities: Risk Bounds and Structural Results.
Journal of Machine Learning Research
3: 463–482.Baxter, J. 2000. A Model of Inductive Bias Learning.
Jour-nal of Artificial Intelligence Research
12: 149–198.Chen, W.; Liu, Y.; Kira, Z.; Wang, Y. F.; and Huang, J. 2019.A Closer Look at Few-shot Classification. In
ICLR .Denevi, G.; Ciliberto, C.; Stamos, D.; and Pontil, M. 2018.Learning To Learn Around A Common Mean. In Bengio, S.;Wallach, H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi,N.; and Garnett, R., eds.,
NeurIPS , 10190–10200.Dudley, R. 1967. The sizes of compact subsets of Hilbertspace and continuity of Gaussian processes.
Journal ofFunctional Analysis
UAI .Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-AgnosticMeta-Learning for Fast Adaptation of Deep Networks. In
ICML , 1126–1135.Koltchinskii, V.; and Panchenko, D. 2002. Empirical Mar-gin Distributions and Bounding the Generalization Error ofCombined Classifiers.
The Annals of Statistics
Probability in BanachSpaces . Berlin: Springer.Lee, K.; Maji, S.; Ravichandran, A.; and Soatto, S. 2019.Meta-Learning With Differentiable Convex Optimization. In
CVPR , 10657–10665.Maurer, A. 2005. Algorithmic Stability and Meta-Learning.
Journal of Machine Learning Research
6: 967–994.Maurer, A. 2009. Transfer bounds for linear feature learning.
Machine Learning
J. Mach.Learn. Res.
17: 81:1–81:32. Mohri, M.; Rostamizadeh, A.; and Talwalkar, A. 2012.
Foundations of Machine Learning . Adaptive computationand machine learning. MIT Press.Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.;DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer,A. 2017. Automatic differentiation in PyTorch. In
NeurIPS,Workshop .Pentina, A.; and Lampert, C. H. 2014. A PAC-Bayesianbound for Lifelong Learning. In
ICML , volume 32, 991–999.Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Vi-sual Recognition Challenge.
International Journal of Com-puter Vision (IJCV)
ICML , 322–330.Sebastian, T.; and Lorien, P. 1998.
Learning to Learn .Springer.Snell, J.; Swersky, K.; and Zemel, R. 2017. PrototypicalNetworks for Few-shot Learning. In
NeurIPS , 4077–4087.Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H.; andHospedales, T. M. 2018. Learning to Compare: RelationNetwork for Few-Shot Learning. In
CVPR , 1199–1208.Talagrand, M. 2014.
Upper and Lower Bounds for Stochas-tic Processes: Modern Methods and Classical Problems .Springer Science and Business Media.Valiant, L. G. 1984. A Theory of the Learnable.
Commun.ACM
Weak Conver-gence and Empirical Processes: With Applications to Statis-tics . Berlin: Springer.Vapnik, V. 1982.
Estimation of Dependences Based on Em-pirical Data . Springer-Verlag New York.Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.;and Wierstra, D. 2016. Matching Networks for One ShotLearning. In
NeurIPS , 3630–3638.Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie,S. 2011. The Caltech-UCSD Birds-200-2011 Dataset. Tech-nical report.Wainwright, M. J. 2019.
High-Dimensional Statistics: ANon-Asymptotic Viewpoint . Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press.doi:10.1017/9781108627771.Zhou, D. 2002. The covering number in learning theory.
J.Complex.
UPPLEMENTARY MATERIAL
In the supplementary material, Section Auxiliary Resultsprovides auxiliary results which help prove our main the-orems for meta-learning based multiclass classification(MLMC). Section Gaussian Complexity Transfer Boundand Section Covering Number Transfer Bound show ourtechnical proofs for transfer bounds with Gaussian com-plexity and covering number, respectively. Section VC-dimension Transfer Bound gives the demonstrations forour most important result, the margin-based transfer boundwith VC-dimension (i.e. Theorem 1 in main paper). Sec-tion Cross-Domain Experiment Results gives the compre-hensive experiment results on the cross-domain dataset miniImageNet → CUB . Auxiliary Results
We use { σ i : i ∈ N } to denote a sequence of inde-pendent Bernoulli variables (i.e. σ i ∈ {− , +1 } ) and use { γ i : i ∈ N } as a sequence of independent standard Gaus-sian variables (i.e. γ i ∼ N (0 , ), which are also indepen-dent of { σ i } . For A ⊆ R m we define the Rademacher com-plexity and Gaussian complexity of A as R ( A ) = E σ sup x ∈ A m m (cid:88) i =1 σ i x i , Γ( A ) = E γ sup x ∈ A m m (cid:88) i =1 γ i x i . Jensen’s inequality implies the relationship between thesetwo kinds of complexities Γ( A ) = E sup x ∈ A m m (cid:88) i =1 γ i x i = E sup x ∈ A m m (cid:88) i =1 | γ i | σ i x i ≥ E sup x ∈ A m m (cid:88) i =1 E | γ i | σ i x i = (cid:114) π R ( A ) . (3)The following Theorem is fundamental to derive our results(e.g. Theorem 14, 15 and 16). For the readers’ benefit wepresent them with a sketch of proof and the detailed proofcan be found in (van der Vaart and Wellner 1996) for (i) and(Koltchinskii and Panchenko 2002) for (ii). Theorem 10.
Let F be a real-valued function class on aspace X and µ ∈ M ( X ) . For x = ( x , ..., x m ) ∈ X m define Φ( x ) = sup f ∈F (cid:18) E x ∼ µ [ f ( x )] − m m (cid:88) i =1 f ( x i ) (cid:19) . (i) E x ∼ µ m [Φ( x )] ≤ E x ∼ µ m R ( F ( x )) .(ii) If F is [0,1]-valued then ∀ δ > we have with proba-bility greater than − δ in x ∼ µ m that Φ( x ) ≤ E x ∼ µ m R ( F ( x )) + (cid:114) ln(1 /δ )2 m . (iii) R ( F ( x )) can be replaced by (cid:112) π/ F ( x )) in (i)and (ii). Proof.
For any Rademacher variables σ = { σ i } mi =1 E x ∼ µ m [Φ( x )] = E x ∼ µ m sup f ∈F m E x (cid:48) ∼ µ m m (cid:88) i =1 (cid:0) f ( x (cid:48) i ) − f ( x i ) (cid:1) ≤ E x , x (cid:48) ∼ µ m × µ m sup f ∈F m m (cid:88) i =1 σ i (cid:0) f ( x (cid:48) i ) − f ( x i ) (cid:1) The last inequality holds due to the symmetry of the mea-sure µ m × µ m and the interchangeability between x i and x (cid:48) i .Taking the expectation of σ and using the triangle inequalitywe obtain (i). Then, applying the McDiarmid concentrationinequality to Φ( x ) we have with probability at least − δ , Φ( x ) ≤ E x ∼ µ m [Φ( x )] + (cid:113) ln(1 /δ )2 m , and recall (i) we have(ii). Finally, Eq. (3) gives (iii). (cid:4) The following theorems (Theorem 11, 12 and 13) aboutGaussian complexity or Gaussian Process are needed to ob-tain our results (like Theorem 15 and 16).
Theorem 11 (Slepian’s Lemma (Ledoux and Talagrand1991)) . Let Ω and Ξ be mean zero, separable Gaussian pro-cesses indexed by a common set T , such that E (Ω s − Ω t ) ≤ E (Ξ s − Ξ t ) ∀ s, t ∈ T. Then E sup t ∈ T Ω t ≤ E sup t ∈ T Ξ t . Theorem 12 (Gaussian Contraction Inequality, Exercise5.12 in (Wainwright 2019)) . Consider a bounded subset T ⊆ R m , and let { γ i } i ≥ be independent N (0 , randomvariables. Let Φ i : R → R be (cid:96) -Lipschitz contractions, i.e., ∀ x, y ∈ R , | Φ i ( x ) − Φ i ( y ) | ≤ (cid:96) | x − y | . Then we have E sup t ∈ T m (cid:88) i =1 γ i Φ i ( t i ) ≤ (cid:96) E sup t ∈ T m (cid:88) i =1 γ i t i . Proof.
Define Gaussian processes { Ω t } t ∈ T , { Ξ t } t ∈ T ,where Ω t = (cid:80) mi =1 γ i Φ i ( t i ) = (cid:104) γ , Φ ( t ) (cid:105) , Φ ( t ) =(Φ ( t ) , ..., Φ m ( t m )) T and Ξ t = (cid:80) mi =1 (cid:96)γ i t i . We have E (Ω s − Ω t ) = E ( (cid:104) γ , Φ ( s ) − Φ ( t ) (cid:105) ) = E (cid:0) Φ ( s ) − Φ ( t ) (cid:1) T γγ T (cid:0) Φ ( s ) − Φ ( t ) (cid:1) = (cid:0) Φ ( s ) − Φ ( t ) (cid:1) T (cid:0) Φ ( s ) − Φ ( t ) (cid:1) ( E (cid:2) γγ T (cid:3) = I )= m (cid:88) i =1 (cid:0) Φ i ( s i ) − Φ i ( t i ) (cid:1) ≤ m (cid:88) i =1 (cid:96) ( s i − t i ) (Lipschitz Property)= E (Ξ s − Ξ t ) . From Theorem 11 we have E sup t ∈ T Ω t ≤ E sup t ∈ T Ξ t . (cid:4) Recall that Γ( F ( x )) = 2 /m E sup f ∈F (cid:80) mi =1 γ i f ( x i ) . If (cid:96) -Lipschitz contractions Φ i = Φ for all i ∈ [ m ] , let Φ ◦ F ( x ) = { (cid:0) Φ ◦ f ( x ) , ..., Φ ◦ f ( x m ) (cid:1) : f ∈ F} ) , then from Theo-rem 12, we have Γ(Φ ◦ F ( x )) ≤ (cid:96) Γ( F ( x )) . (4) heorem 13. Let F , ..., F l be l hypothesis sets in R X , l ≥ , and let G = { max { f , ..., f l } : f i ∈ F i , i ∈ [ l ] } . Thenfor any sample x of size m, we have Γ( G ( x )) ≤ l (cid:88) i =1 Γ( F i ( x )) . Proof.
The main idea is to notice that max { f , f } = ( f + f + | f − f | ) / and use the sub-additivity of sup function.The proof is similar to that for Lemma 9.1 in (Mohri, Ros-tamizadeh, and Talwalkar 2012) (Rademacher complexityversion results). The only difference is that for the Gaussiancomplexity, we use Gaussian Contraction Inequality whichis proved in Theorem 12. The detailed demonstration is leftto readers. (cid:4) To demonstrate the refined Dudley entropy bound (Tala-grand 2014) of Gaussian complexity (see Theorem 17 inSect. ), we need the following refined Massart lemma whichfocuses on bounding the finite set’s Gaussian complexitywith the set’s cardinality.
Lemma 1 (Refined Massart Lemma) . Let A = { a , ..., a N } be a finite set of vectors in R m . Define ¯ a = N (cid:80) Ni =1 a i , wehave Γ( A ) ≤ max a ∈ A (cid:107) a − ¯ a (cid:107) √ Nm Proof.
Without loss of generality, we assume that ¯ a = 0 . ∀ λ > , let A (cid:48) = { λ a , ..., λ a N } , then m A (cid:48) ) = E γ sup a ∈ A (cid:48) (cid:104) a , γ (cid:105) = E γ log max a ∈ A (cid:48) e (cid:104) a , γ (cid:105) ≤ log E γ max a ∈ A (cid:48) e (cid:104) a , γ (cid:105) (Jensen) ≤ log E γ (cid:88) a ∈ A (cid:48) e (cid:104) a , γ (cid:105) = log (cid:88) a ∈ A (cid:48) m (cid:89) i =1 E γ i e a i γ i = log (cid:88) a ∈ A (cid:48) m (cid:89) i =1 e a i ( (cid:90) R √ π e − x22 e ax dx = e a22 )= log (cid:88) a ∈ A (cid:48) e (cid:107) a (cid:107) ≤ log (cid:0) N max a ∈ A (cid:48) e (cid:107) a (cid:107) (cid:1) = log N + max a ∈ A (cid:48) (cid:107) a (cid:107) . (5)Let L = max a ∈ A (cid:107) a (cid:107) , then we have ∀ λ > A ) ≤ Γ( A (cid:48) ) λ ≤ Nλ + λL m by Eq . (5) Plugging λ = √ N /L into the above the inequality, wehave Γ( A ) ≤ L √ Nm . (cid:4) Gaussian Complexity Transfer Bound
Theorem 14 ( Margin-based Transfer Bound for MLMCwith Gaussian Complexity, Theorem 2 in main paper ) . Let F be a hypothesis of scoring functions. Given a classifi-cation algorithm f and a margin parameter ρ > , for anyenvironment ε ∈ M ( M ( X × Y )) and for any δ > , withprobability at least − δ on the data ( X , Y ) ∼ ˆ ε n , we havefor all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X ))+ (cid:114) ln(1 /δ )2 n + k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) , where Π F ( X ) = { (cid:0) f ϕ ( X , Y ) ( x , y ) , ..., f ϕ ( X , Y ) ( x m , y ) ,..., f ϕ ( X , Y ) ( x n , y ) , ..., f ϕ ( X , Y ) ( x nm , y ) (cid:1) : y ∈ Y , ϕ ∈ D} , Π F ( x ) = { (cid:0) f ϕ ( x , y ) ( x , y ) , ..., f ϕ ( x , y ) ( x m , y ) (cid:1) : y ∈Y , ϕ ∈ D} , and the scoring function f ϕ ( X , Y ) is defined as: f ϕ ( X , Y ) ( x li , y ) = f ϕ ( x l , y l ) ( x li , y ) , ∀ i ∈ [ m ] , l ∈ [ n ] . The proof strategy is to rewrite R ε ( f ϕ ) − n (cid:80) nl =1 ˆ (cid:96) f ϕ ( x l , y l ) as the following form (cid:16) R ε ( f ϕ ) − E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) (cid:17) + (cid:16) E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) − n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) (cid:17) , (6)and bound the two terms with Theorem 15 and Theorem 16respectively. We first give Lemma 2 to prove Theorem 15. Lemma 2.
A random variable σ obeys a Bernoulli distribu-tion, where P{ σ = +1 } = p, P{ σ = − } = q ( p + q = 1) . Another independent random variable γ obeys a standardGaussian distribution where γ ∼ N (0 , . Then the multi-plication ξ = σγ still obeys standard Gaussian distribution.Proof. ∀ z ∈ R , P{ ξ ≤ z } = P{ σγ ≤ z } = P{ σ > , γ ≤ zσ } + P{ σ < , γ ≥ zσ } = P{ σ = +1 , γ ≤ z } + P{ σ = − , γ ≥ − z } (i) = P{ σ = +1 }P{ γ ≤ z } + P{ σ = − }P{ γ ≥ − z } = p (cid:90) z −∞ √ π e − x dx + q (cid:90) ∞− z √ π e − x dx. (i) holds due to the independence of σ and γ . The densityfunction of ξ is f ξ ( z ) = d P{ ξ ≤ z } d z = p dd z (cid:90) z −∞ √ π e − x dx + q dd z (cid:90) ∞− z √ π e − x dx = p √ π e − z + q ( − √ π e − z = 1 √ π e − z . herefore, ξ obeys standard Gaussian distribution. (cid:4) Theorem 15 ( Theorem 3 in main paper ) . Let F and Π F ( x ) be the same as in previous theorems. For ρ > ,we have R ε ( f ϕ ) ≤ E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y )+ k √ πρ E µ ∼ ε E ( x , y ) ∼ µ m Γ(Π F ( x )) . Proof.
Define vector spaces F ρ ( x , y ) = { (cid:0) ρ f ϕ ( x , y ) ( x , y ) , ..., ρ f ϕ ( x , y ) ( x m , y m ) (cid:1) : ϕ ∈ D} , Φ ρ ◦ F ρ ( x , y ) = { (cid:0) Φ ρ ◦ ρ f ϕ ( x , y ) ( x , y ) , ..., Φ ρ ◦ ρ f ϕ ( x , y ) ( x m , y m ) (cid:1) : ϕ ∈ D} ,we then have R ε ( f ϕ ) − E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y )= E µ ∼ ε E ( x , y ) ∼ µ m (cid:2) E ( x,y ) ∼ µ (cid:96) f ϕ ( x , y ) ( x, y ) − ˆ (cid:96) f ϕ ( x , y ) (cid:3) ≤ E µ ∼ ε E ( x , y ) ∼ µ m (cid:2) sup ϕ ∈D (cid:18) E ( x,y ) ∼ µ (cid:96) f ϕ ( x , y ) ( x, y ) − ˆ (cid:96) f ϕ ( x , y ) (cid:19)(cid:3) = E µ ∼ ε E ( x , y ) ∼ µ m (cid:2) sup ϕ ∈D (cid:18) E ( x,y ) ∼ µ Φ ρ ◦ ρ f ϕ ( x , y ) ( x, y ) − m m (cid:88) i =1 Φ ρ ◦ ρ f ϕ ( x , y ) ( x i , y i ) (cid:19)(cid:3) (i) ≤ (cid:114) π E µ ∼ ε E ( x , y ) ∼ µ m Γ(Φ ◦ F ρ ( x , y )) (ii) ≤ (cid:114) π E µ ∼ ε E ( x , y ) ∼ µ m ρ Γ( F ρ ( x , y )) . The inequality (i) holds due to Theorem 10, (i) and (iii). Andthe inequality (ii) uses Eq. (4). Using the sub-additivity ofthe sup function , we can bound the Gaussian complexity Γ( F ρ ( x , y )) = E γ sup ϕ ∈D m m (cid:88) i =1 ρ f ϕ ( x , y ) ( x i , y i ) γ i = E γ sup ϕ ∈D m m (cid:88) i =1 (cid:0) f ϕ ( x , y ) ( x i , y i ) − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) (cid:1) γ i ≤ E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y i ) γ i + E γ sup ϕ ∈D m m (cid:88) i =1 − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) γ i . (7)With the sub-additivity of sup , the first term of the above result can be bounded by E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y i ) γ i = E γ sup ϕ ∈D m m (cid:88) i =1 (cid:88) y ∈Y γ i f ϕ ( x , y ) ( x i , y ) y = y i (i) ≤ (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 γ i f ϕ ( x , y ) ( x i , y ) (cid:0) (cid:15) i + 12 (cid:1) ≤ (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y ) (cid:15) i γ i (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y ) γ i k Γ(Π F ( x )) . (8)Inequality (i) uses the fact that (cid:15) i = 2 y = y i − ∈ {− , +1 } .The last inequality holds because from Lemma 2 we know (cid:15) i γ i and γ i admits the same distribution, and |Y| = k . Sim-ilarly, we can obtain the upper bound of the second term inthe r.h.s of Eq. (7) E γ sup ϕ ∈D m m (cid:88) i =1 − max y (cid:54) = y i f ϕ ( x , y ) ( x i , y ) γ i ≤ E γ sup ϕ ∈D m m (cid:88) i =1 max y ∈Y f ϕ ( x , y ) ( x i , y ) γ i ≤ (cid:88) y ∈Y E γ sup ϕ ∈D m m (cid:88) i =1 f ϕ ( x , y ) ( x i , y ) γ i (Theorem 13)= k Γ(Π F ( x )) . (9)Combining Eqs. (7)-(9), we derive the expected result. (cid:4) Theorem 16 ( Theorem 4 in main paper ) . Let F and Π F ( X ) be the same as in previous theorems. For any δ > , we have, with probability at least − δ on the drawof the sample (( x , y ) , ..., ( x n , y n )) , E ( x , y ) ∼ ˆ ε ˆ (cid:96) f ϕ ( x , y ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + k √ mπρ E ( X , Y ) ∼ ˆ ε n Γ(Π F ( X )) . Proof.
Fix a meta-sample ( X , Y ) =(( x , y ) , ..., ( x n , y n )) . Define Gaussian processes Ω ϕ and Ξ ϕ indexed by ϕ as follows: Ω ϕ = n (cid:88) l =1 γ l ˆ (cid:96) f ϕ ( x l , y l ) Ξ ϕ = n (cid:88) l =1 m (cid:88) i =1 γ li √ mρ ρ f ϕ ( x l, y l ) ( x li , y li ) , where γ l and γ li are mutually independent standard Gaussiandistributed variables. Define function class G ϕ = { ( x , y ) (cid:55)→ ˆ (cid:96) f ϕ ( x , y ) } and observe that (2 /n ) E sup ϕ ∈D Ω ϕ =( G ϕ ( X , Y )) . Then ∀ ϕ , ϕ ∈ D by using the orthogonal-ity of the γ l we have E (Ω ϕ − Ω ϕ ) = E γ (cid:18) n (cid:88) l =1 γ l (cid:0) ˆ (cid:96) ρ fϕ ( x l , y l ) − ˆ (cid:96) ρ fϕ ( x l , y l ) (cid:1)(cid:19) = n (cid:88) l =1 (cid:0) ˆ (cid:96) ρ fϕ ( x l , y l ) − ˆ (cid:96) ρ fϕ ( x l , y l ) (cid:1) = n (cid:88) l =1 (cid:18) m m (cid:88) i =1 Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) − Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) (cid:19) ≤ n (cid:88) l =1 (cid:18) m m (cid:88) i =1 | Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) − Φ ρ ( ρ f ϕ x l, y l ) ( x li , y li )) | (cid:19) ≤ n (cid:88) l =1 m ρ (cid:18) m (cid:88) i =1 | ρ f ϕ x l, y l ) ( x li , y li ) − ρ f ϕ x l, y l ) ( x li , y li ) | (cid:19) ≤ n (cid:88) l =1 mρ m (cid:88) i =1 (cid:0) ρ f ϕ x l, y l ) ( x li , y li ) − ρ f ϕ x l, y l ) ( x li , y li ) (cid:1) = E (Ξ ϕ − Ξ ϕ ) Inequality (i) uses the Lipschitz Property of margin loss andinequality applies Mean Value Inequality. Then from Theo-rem 11 we have E ϕ ∈D Ω ϕ ≤ E ϕ ∈D Ξ ϕ . Multiplying with /n this becomes Γ( G ϕ ( X , Y )) ≤ n E f ∈F n (cid:88) l =1 m (cid:88) i =1 γ li √ mρ ρ f ϕ ( x l, y l ) ( x li , y li )= √ mρ Γ( F ρ ( X , Y )) where F ρ ( X , Y ) = { (cid:0) ρ f ϕ ( X , Y ) ( x , y ) , ..., ρ f ϕ ( X , Y ) ( x m , y m ) , ..., ρ f ϕ ( X , Y ) ( x n , y n ) , ..., ρ f ϕ ( X , Y ) ( x nm , y nm ) (cid:1) : ϕ ∈ D} and we define the scoring function ρ f ϕ ( X , Y ) ( x li , y li ) = ρ f ϕ ( x l, y l ) ( x li , y li ) , ∀ i ∈ [ m ] , l ∈ [ n ] . Analogous to thedemonstration process to bound the Gaussian complexity Γ( F ρ ( x , y )) in Theorem 15, we can bound Γ( F ρ ( X , Y )) with k Γ(Π F ( X )) and draw the conclusion Γ( G ϕ ( X , Y )) ≤ k √ mρ Γ(Π F ( X )) . (10)Combining Eq. (10) and Theorem 10 (ii), (iii) completes theproof. (cid:4) Covering Number Transfer Bound
Theorem 17 (Refined Dudley Entropy Bound, Theorem 5in main paper) . For any real-valued function class F con-taining function f : X → R , assume that sup f ∈F (cid:107) f (cid:107) isbounded under the L ( x ) and L ( X ) metric respectively.Then Γ( F ( x )) ≤ √ m (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( x ))d τ, Γ( F ( X )) ≤ √ nm (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:112) log N ( τ, F , L ( X ))d τ. Proof.
We just prove the first part. The main idea of theproof is to use generic chaining technique. Let α =sup f ∈F (cid:107) f (cid:107) , α i = 2 − i sup f ∈F (cid:107) f (cid:107) ( i ≥ . T = { } is the α -cover of F , and let T i ( i ≥ be an α i -cover of F with the smallest cardinality. Then ∀ i ≥ , we choose ˆ f i from T i , such that we have (cid:107) f − ˆ f i (cid:107) ≤ α i and can rewrite f as a ‘chain’ f = f − ˆ f N + (cid:80) Ni =1 ( ˆ f i − ˆ f i − ) . Γ( F ( x )) = 2 m E γ sup f ∈F m (cid:88) i =1 γ i f ( x i ) = 2 m E γ sup f ∈F (cid:104) γ , f ( x ) (cid:105) = 2 m E γ sup f ∈F (cid:16) (cid:104) γ , f − ˆ f N (cid:105) + (cid:104) γ , N (cid:88) i =1 ˆ f i − ˆ f i − (cid:105) (cid:17) ≤ m E γ sup f ∈F (cid:104) γ , f − ˆ f N (cid:105) + N (cid:88) i =1 m E γ sup f ∈F (cid:104) γ , ˆ f i − ˆ f i − (cid:105) . (11)To bound the first term of the above equation, we have m E γ sup f ∈F (cid:104) γ , f − ˆ f N (cid:105) = 2 m E γ sup f ∈F m (cid:88) i =1 γ i (cid:0) f ( x i ) − ˆ f N ( x i ) (cid:1) (i) ≤ m E γ sup f ∈F (cid:0) m (cid:88) i =1 γ i (cid:1) (cid:0) m (cid:88) i =1 (cid:0) f ( x i ) − ˆ f N ( x i ) (cid:1) (cid:1) =2 (cid:0) E γ (cid:107) γ (cid:107) (cid:1)(cid:0) sup f ∈F (cid:13)(cid:13) f − ˆ f N (cid:107) (cid:1) (ii) ≤ (cid:0)(cid:113) E γ (cid:107) γ (cid:107) (cid:1)(cid:0) sup f ∈F (cid:13)(cid:13) f − ˆ f N (cid:107) (cid:1) ≤ α N . (12)(i) and (ii) use Cauchy-Schwarz and Jensen inequality re-spectively. The last inequality of Eq. (12) holds because E γ (cid:107) γ (cid:107) = E m (cid:80) mi =1 γ i = 1 and T N is a α N -cover of F . To bound the second term in the r.h.s. of Eq. (11), withtriangle inequality we have (cid:107) ˆ f i − ˆ f i − (cid:107) ≤ (cid:107) ˆ f i − f (cid:107) + (cid:107) f − ˆ f i − (cid:107) ≤ α i + α i − = 3 α i . Define function class ˆ F i = { ˆ f i − ˆ f i − : ˆ f i ∈ T i , ˆ f i − ∈ T i − } , then we have N (cid:88) i =1 m E γ sup f ∈F (cid:104) γ , ˆ f i − ˆ f i − (cid:105) = N (cid:88) i =1 Γ( ˆ F i ) ≤ N (cid:88) i =1 α i (cid:112) | T i | · | T i − |√ m (Lemma 1) ≤ N (cid:88) i =1 α i (cid:112) log | T i |√ m = 24 √ m N (cid:88) i =1 (cid:0) α i − α i +1 (cid:1)(cid:112) log | T i |≤ √ m N (cid:88) i =1 (cid:0) α i − α i +1 (cid:1)(cid:112) log N ( τ, F , L ( x )) ( τ < α i ) ≤ √ m N (cid:88) i =1 (cid:90) α i α i +1 (cid:112) log N ( τ, F , L ( x ))d τ = 24 √ m (cid:90) α α N +1 (cid:112) log N ( τ, F , L ( x ))d τ. (13) (cid:15) > , let N = sup { i : α i > (cid:15) } . Then we have (cid:15) <α N +1 ≤ (cid:15) and α N = 2 α N +1 ≤ (cid:15) . Combining Eqs. (11)-(13) and recall that α = sup f ∈F (cid:107) f (cid:107) , we have Γ( F ( x )) ≤ (cid:15) + 24 √ m (cid:90) sup f ∈F (cid:107) f (cid:107) (cid:15) (cid:112) log N ( τ, F , L ( x ))d τ. (14)Let (cid:15) → in the right hand side of Eq. (14), we obtain thefinal result. (cid:4) Combining Theorem 14 and Theorem 18 we immediatelyobtain the following margin-based covering number boundfor few-shot learning.
Theorem 18 ( Margin-based Transfer Bound for MLMCwith Covering Number, Theorem 6 in main paper ) . Let F and Π F be the same as in previous theorems. Given aclassification algorithm f and a margin parameter ρ > ,for any environment ε ∈ M ( M ( X × Y )) and for any δ > , with probability at least − δ on the data ( X , Y ) ∼ ˆ ε n ,we have for all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n + 24 k √ πρ √ n E ( X , Y ) ∼ ˆ ε n (cid:90) L (cid:112) log N ( τ, Π F , L ( X ))d τ + 24 k √ πρ √ m E µ ∼ ε E ( x , y ) ∼ µ m (cid:90) L (cid:112) log N ( τ, Π F , L ( x ))d τ. VC-dimension Transfer Bound
In this section, we give the bound of the covering number N (Π F ( X )) and N (Π F ( x )) in Theorem 18 with the VC-dimension of the hypothesis space Π F . Then we yield ourmost import theoretical result in Theorem 20. Theorem 19 (Theorem 2.6.7 in (van der Vaart and Wellner1996)) . Let F be a real-valued function class on X with VC-dimension v . Assume that F is uniformly bounded by b > .Then, for any probability distribution Q on X and for p ≥ N ( τ, F , (cid:107) · (cid:107) L p ( Q ) ) ≤ C ( v + 1)(16 e ) v +1 ( bτ ) pv , for some uniform constant C > , where for any f, g ∈F , (cid:107) f − g (cid:107) L p ( Q ) = ( (cid:82) | f − g | p dQ ) /p . Theorem 20 ( Margin-based Transfer Bound for MLMCwith VC-dimension, Theorem 1 in main paper ) . Let theVC-dimension of Π F defined in previous theorems be v ,and Π F is uniformly bounded by b > . Given a classifi-cation algorithm f and a margin parameter ρ > , for anyenvironment ε ∈ M ( M ( X × Y )) and for any δ > , withprobability at least − δ on the data ( X , Y ) ∼ ˆ ε n , we havefor all feature maps ϕ ∈ D that R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n +( kρ √ m + kρ √ n )( C √ v + C ) , where constants C = 24 √ πb (1 + (cid:112) log(16 e ) + 2 √ and C = 24 √ πb ( √ log C + (cid:112) log(16 e )) . C is the uniformconstant in Theorem 19.Proof. Notice that for all y ∈ Y sup f ∈ Π F (cid:107) f (cid:107) = sup f ∈ Π F (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 f ( x i , y ) ≤ (cid:118)(cid:117)(cid:117)(cid:116) m m (cid:88) i =1 b = b. From Theorem 19, we know there exits a uniform constant C such that N ( τ, Π F , L ( x )) ≤ C ( v + 1)(16 e ) v +1 ( bτ ) v . Then the integral of the log covering number in Theorem 18can be bounded by I = (cid:90) sup f ∈ Π1 F (cid:107) f (cid:107) (cid:112) log N ( τ, Π F , L ( x ))d τ ≤ (cid:90) b (cid:114) log C ( v + 1)(16 e ) v +1 ( bτ ) v d τ = (cid:90) b (cid:114) log C + log( v + 1) + ( v + 1) log(16 e ) + 2 v log( bτ )d τ (i) ≤ (cid:90) b (cid:114) log C + v + ( v + 1) log(16 e ) + 2 v bτ d τ (ii) ≤ (cid:90) b (cid:112) log C + v + ( v + 1) log(16 e ) + (cid:114) v bτ d τ ≤ α √ v + β, (i) and (ii) hold due to the basic inequalities ln( x +1) ≤ x and √ x + y ≤ √ x + √ y . Further, α =(1 + (cid:112) log(16 e ) + 2 √ b, β = ( √ log C + (cid:112) log(16 e )) b .Similarly we can give the same bound for the integral: (cid:82) sup f ∈ Π1 F (cid:107) f (cid:107) (cid:112) N ( τ, Π F , L ( X ))d τ ≤ α √ v + β . Thencombining with Theorem 18, we have with probability atleast − δ on the data ( X , Y ) ∼ ˆ ε n R ε ( f ϕ ) ≤ n n (cid:88) l =1 ˆ (cid:96) f ϕ ( x l , y l ) + (cid:114) ln(1 /δ )2 n +( kρ √ m + kρ √ n )( C √ v + C ) , where C = 24 √ πb (1 + (cid:112) log(16 e ) + 2 √ and C = 24 √ πb ( √ log C + (cid:112) log(16 e )) . (cid:4) Cross-Domain Experiment Results
We provide more experiment results on the cross-domaindataset miniImageNet → CUB in Table 4. Different5-way classification settings are considered here. Wecan still find that the results obtained with the multi-margin loss are comparable to those obtained with cross-entropy loss in all cases. We conduct all our experi-ments based on the code released in https://github.com/wyharveychen/CloserLookFewShot and https://github.com/kjunelee/MetaOptNet.able 4: The 5-way s -shot classification results on the miniImageNet → CUB dataset. We report the average accuracy (%,top-1) as well as the 95% confidence interval over all 600 test episodes. We compare the
Multi-Margin loss with the
Cross-Entropy loss.
Baseline++ (Chen et al. 2019) . ± .
70 67 . ± .
69 75 . ± .
61 76 . ± .
60 81 . ± .
53 83 . ± . MAML (Finn, Abbeel, and Levine 2017) . ± .
57 60 . ± .
54 70 . ± .
74 69 . ± .
77 76 . ± .
72 75 . ± . MatchingNet (Vinyals et al. 2016) . ± .
73 63 . ± .
71 72 . ± .
71 71 . ± .
69 78 . ± .
61 77 . ± . ProtoNet (Snell, Swersky, and Zemel 2017) . ± .
75 64 . ± .
75 72 . ± .
66 72 . ± .
71 79 . ± .
63 79 . ± . RelationNet (Sung et al. 2018) . ± .
74 65 . ± .
74 69 . ± .
67 70 . ± .
71 75 . ± .
65 75 . ± . MetaOptNet (Lee et al. 2019) . ± .
73 64 . ± .
74 74 . ± .
68 73 . ± .
67 80 . ± .
63 80 . ± ..