[PDF] Few-Shot Classification with Feature Map Reconstruction Networks

Abstract

In this paper we reformulate few-shot classification as a reconstruction problem in latent space. The ability of the network to reconstruct a query feature map from support features of a given class predicts membership of the query in that class. We introduce a novel mechanism for few-shot classification by regressing directly from support features to query features in closed form, without introducing any new modules or large-scale learnable parameters. The resulting Feature Map Reconstruction Networks are both more performant and computationally efficient than previous approaches. We demonstrate consistent and substantial accuracy gains on four fine-grained benchmarks with varying neural architectures. Our model is also competitive on the non-fine-grained mini-ImageNet and tiered-ImageNet benchmarks with minimal bells and whistles.

Full PDF

FFine-Grained Few-Shot Classiﬁcation with Feature Map ReconstructionNetworks

Davis Wertheimer * Luming Tang * Bharath HariharanCornell University { dww78,lt453,bh497 } @cornell.edu Abstract

In this paper we reformulate few-shot classiﬁcation as areconstruction problem in latent space. The ability of thenetwork to reconstruct a query feature map from supportfeatures of a given class predicts membership of the queryin that class. We introduce a novel mechanism for few-shotclassiﬁcation by regressing directly from support features toquery features in closed form, without introducing any newmodules or large-scale learnable parameters. The result-ing Feature Map Reconstruction Networks are both moreperformant and computationally efﬁcient than previous ap-proaches. We demonstrate consistent and signiﬁcant accu-racy gains on four ﬁne-grained benchmarks with varyingneural architectures. Our model is also competitive on thenon-ﬁne-grained mini-ImageNet benchmark with minimalbells and whistles.

1. Introduction

Convolutional neural classiﬁers have achieved excellentperformance in a wide range of settings and benchmarks,but this performance is achieved through large quantities oflabelled images from the relevant classes. In practice, sucha large quantity of human-annotated images may not alwaysbe available for the categories of interest. Producing a per-formant classiﬁer in these settings requires a neural networkthat can rapidly adapt to novel, possibly unseen classes, us-ing a small number of representative images.This challenge is formalized in the few-shot classiﬁca-tion problem, where networks are evaluated on individual episodes drawn from a task distribution. Each episode hasassociated classes of interest, with images in each class par-titioned into a small support set and a larger query set . Us-ing the ground truth class labels provided for the supportimages, the classiﬁer must correctly classify the queries.A particularly promising approach to few-shot classiﬁca-tion is the family of metric learning techniques, where the * Equal contribution

Figure 1. Visual intuition for FRN: we reconstruct each query im-age as a weighted sum of components from the support images.Reconstructions from the same class are better than reconstruc-tions from different classes, enabling classiﬁcation. FRN performsthe reconstruction in latent space, as opposed to image space, here. standard parametric linear classiﬁer head is replaced with aclass-agnostic distance function. Membership in each classis determined by distance in latent space from a point orpoints known to belong to that class. Simple distance func-tions such as cosine distance [7, 4] and Euclidean distance[20] lead to surprisingly powerful classiﬁers, though morecomplex [19], non-Euclidean [10], and even learned para-metric options [21] are possible, and yield signiﬁcant gains.One overarching problem common to all these tech-niques is the fact that the convolutional feature extractorsused to learn the metric spaces produce feature maps char-acterizing appearance at a grid of spatial locations , whereasthe chosen distance functions require a single vectorial rep-resentation for the entire image . The researcher must decidehow to convert the feature map into a vector representation.Global average-pooling, the standard solution for paramet-ric softmax classiﬁers, averages together appearance infor-mation from disparate parts of the image, completely dis-1 a r X i v : . [ c s . C V ] D ec arding spatial details that might be necessary for ﬁne dis-tinctions. Flattening the feature map tensor into a singlelong vector preserves the individual feature vectors fromeach location [20, 21]. However, it also preserves the loca-tion of each feature vector, which is nuisance information- permuting the locations of the feature map completely al-ters the ﬂattened embedding, even if the underlying seman-tic content is unchanged. The only way to remove this nui-sance information is to increase both the size and sensitivityof the receptive ﬁelds, to the point where the feature vectorsat all locations are the same. However, this also destroysgranularity, and leads to overﬁtting to specious cues [5].Optimally, we would like to preserve spatial detail whiledisentangling it from location at the same time.We introduce Feature Map Reconstruction Networks(FRN), which accomplish this by framing class member-ship as a problem of reconstructing feature maps . Givena set of images all belonging to a single class, we producethe associated feature maps and collect the component fea-ture vectors across locations and images into a single poolof support features. For each query image, we then at-tempt to reconstruct every location in the feature map asa weighted sum of support features, and the negative aver-age squared reconstruction error is used as the class score.Images from the same class should be easier to reconstruct,since their feature maps contain similar embeddings, whileimages from different classes will be more difﬁcult and pro-duce larger reconstruction errors. By evaluating the recon-struction of the full feature map, FRN preserves the spatialdetails of appearance. But by allowing this reconstruction touse feature vectors from any location in the support images,FRN explicitly discards nuisance location information.While prior methods based on feature map reconstruc-tion exist, these methods either rely on constrained iterativeprocedures [31] or large learned attention modules [5, 9].Instead, we frame feature map reconstruction as a ridge re-gression problem, allowing us to rapidly calculate a solutionin closed form with only a single learned, soft constraint.The resulting reconstructions from FRN are high qual-ity and semantically informative, making FRN both sim-pler and more powerful than prior reconstruction-basedapproaches. We validate these claims by demonstratingacross-the-board superiority on four ﬁne-grained few-shotclassiﬁcation datasets (CUBirds [26], meta-iNat and tieredmeta-iNat [29], and FGVC Aircraft [14]) and one gen-eral few-shot recognition benchmark (mini-ImageNet [25]).These results hold for both shallow and deep network archi-tectures (Conv-4 [20, 11] and ResNet-12 [8, 23]).

2. Background and Related Work

The few-shot learning setup:

Standard few-shot train-ing and evaluation involves sampling task episodes from anoverarching task distribution - typically, by repeatedly se- lecting small subsets from a larger set of classes. The num-ber of classes per episode is referred to as the way , whilethe number of support images per class is the shot , so thatepisodes with ﬁve classes and one labelled image per classform a “5-way, 1-shot” classiﬁcation problem. Few-shotclassiﬁers are trained on a large, disjoint set of classes withmany labelled images, typically using this same episodicscheme for each batched iteration of SGD. Optimizing thefew-shot classiﬁer over the task distribution teaches it togeneralize to new tasks from a similar distribution. Theclassiﬁer learns to learn new tasks, thus few-shot trainingis also referred to as “meta-learning” or “meta-training”.

Prior work in few-shot learning:

Existing approachesto few-shot learning can be loosely organized into thefollowing two main-stream families. Optimization-basedmethods [6, 18, 16] aim to learn a good parameter initial-ization for the classiﬁer. These learned weights can thenbe quickly adapted to novel classes using gradient-basedoptimization on only a few labeled samples. Metric-basedmethods, on the other hand, aim to learn a completely task-independent embedding that can generalize to novel cate-gories under a chosen distance metric, such as Euclideandistance [20], cosine distance [7], hyperbolic distance [10],or a distance parameterized by a neural network [21].As an alternative to the standard meta-learning frame-work, many recent papers [3, 23, 27] study the performanceof standard end-to-end pre-trained classiﬁers on few-shottasks. Given minimal modiﬁcation, these classiﬁers are ac-tually competitive with or even outperform episodic meta-training methods. Therefore some recent works [31, 30, 4]take advantage of both, and utilize meta-learning after pre-training, further boosting performance.

Few-shot classiﬁcation through reconstruction:

Weare not the ﬁrst to use feature map reconstruction as a proxyfor few-shot classiﬁcation. DeepEMD [31] formulates la-tent reconstruction as an optimal transport problem, solvedusing external iterative constrained convex optimizationtools. This formulation is sophisticated and powerful, buttraining and inference come with signiﬁcant computationalcost compared to other methods, due to the reliance on it-erative solvers and test-time SGD. CrossTransformer [5]and CrossAttention [9] add attention modules that projectquery features into the space of support features (or viceversa), and compare the class-conditioned projections to thetarget to predict class membership. These attention-basedapproaches introduce many additional learned parametersover and above the network backbone, and place largelyarbitrary constraints on the projection matrix (weights arenon-negative and rows must sum to 1). In contrast, FRN ef-ﬁciently calculates least-squares-optimal reconstructions inclosed form using only a single learnable constraint.

Closed-form solvers in few-shot learning:

The use ofclosed-form solvers for few-shot classiﬁcation is also not2 igure 2. Overview of FRN classiﬁcation for a k-shot problem.Support images are converted into feature maps (left), which areaggregated into class-conditional pools (middle). The best-ﬁt re-construction of the query feature map is calculated for each cate-gory, and the closest candidate yields the predicted class (right). h, w is feature map resolution and d is the number of channels. entirely new, though to our knowledge they have not beenapplied in the explicit context of feature reconstruction. [2]uses ridge regression to map features directly to classiﬁca-tion labels, while [21] accomplishes the same mapping withdifferentiable SVMs. Deep Subspace Networks [19] use theclosed-formed projection distance from query embeddingsto subspaces spanned by support points as the similaritymeasure. In contrast, FRN uses closed-form ridge regres-sion to reconstruct entire feature maps, rather than perform-ing direct comparisons between points in any one particularlatent space, or regressing directly to class label targets.

3. Method

Feature Map Reconstruction Networks use the qualityof query feature map reconstructions from support fea-tures as a proxy for class membership. The pool of fea-tures associated with each class in the episode is used tocalculate a candidate reconstruction, with a better recon-struction indicating higher conﬁdence for the associatedclass. In this section we describe the reconstruction mech-anism of FRN in detail, and derive the closed-form so-lution used to calculate the reconstruction error and re-sulting class score. An overview is provided in Fig. 2.We discuss memory-efﬁcient implementations and an op-tional pre-training scheme, and draw comparisons to priorreconstruction-based approaches.

Let X s denote the set of support images with corre-sponding class labels in an n -way, k -shot episode. We wishto predict a class label y q for a single input query image x q . The output of the convolutional feature extractor for x q is a feature map Q ∈ R r × d , where r represents the spatialresolution (height times width) of the feature map, and d thenumber of channels. For each class c ∈ C , we pool all ofthe features across the k support image feature maps intoa single matrix of support features S c ∈ R kr × d . We thenattempt to reconstruct Q as a weighted sum of values in S c by ﬁnding the matrix W ∈ R r × kr such that W S c ≈ Q .Finding the optimal ¯ W amounts to solving the linear least-squares problem: ¯ W = arg min W || Q − W S c || + λ || W || (1)where ||·|| is the Frobenius norm and λ weights the ridgeregression penalty term used to ensure tractability when thelinear problem is over- or under-constrained ( kr (cid:54) = d ).The foremost beneﬁt of the ridge regression formulationis that it admits a widely-known closed-form solution for ¯ W and the optimal reconstruction ¯ Q c as follows: ¯ W = QS Tc ( S c S Tc + λI ) − (2) ¯ Q c = ¯ W S c (3)A similarity measure for Q and ¯ Q c is given by the meansquared Euclidean distance over all feature map locations: (cid:104) Q, ¯ Q c (cid:105) = 1 r || Q − ¯ Q c || (4)For a given class c , we use the negative similarity −(cid:104) Q, ¯ Q c (cid:105) as the probability logit. We also incorporate alearnable temperature factor γ , following the ﬁndings of[4, 7, 30] that temperature scaling improves few-shot train-ing. The ﬁnal predicted probability is thus given by: P ( y q = c | x q ) = e ( − γ (cid:104) Q, ¯ Q c (cid:105) ) (cid:80) c (cid:48) ∈ C e ( − γ (cid:104) Q, ¯ Q c (cid:48) (cid:105) ) (5)We optimize our network by sending the predicted classprobabilities for the query images in each episode througha cross-entropy loss, as in standard episodic meta-training.An overview of this process can be found in Fig. 2. It is not immediately clear how one should set the regu-larization parameter λ . Instead of choosing heuristically, weallow the network to learn λ through meta-learning. This issigniﬁcant, as it allows the network to meta-learn the appro-priate amount of regularization so that the reconstruction is discriminative , rather than strictly least-squares optimal.3hanging λ can have multiple effects. A large λ dis-courages sparse weights in W , but also reduces the norm ofthe reconstruction, increasing reconstruction error and lim-iting its discriminative power. We therefore disentangle thedegree of regularization from the magnitude of ¯ Q c by intro-ducing a learned recalibration term ρ : ¯ Q c = ρ ¯ W S c (6)By increasing ρ alongside λ , the network gains the abilityto penalize large, sparse weights without sending all recon-structions to the origin at the same time. Parametrizing λ and ρ : Note that in Eq. 1, the objec-tive is the sum of the squared Frobenius norm of two differ-ent matrices, a residual error matrix and the weight matrix.These two matrices can have very different sizes: the ﬁrstis br × d while the second is br × kr . Thus when kr ismuch greater or less than d , one term in the objective caneasily overwhelm the other. To ensure a balanced objectiveand stable training, we rescale the regularization term λ bya factor of dkr . λ and ρ are parametrized as e α and e β toensure non-negativity, with α and β initialized to zero.Thus, all together, our ﬁnal prediction is given by: λ = dkr e α ρ = e β (7) ¯ Q c = ρ ¯ W S c = ρQS Tc ( S c S Tc + λI ) − S c (8) P ( y q = c | x q ) = e ( − γ (cid:104) Q, ¯ Q c (cid:105) ) (cid:80) c (cid:48) ∈ C e ( − γ (cid:104) Q, ¯ Q c (cid:48) (cid:105) ) (9)The model is meta-trained in a similar manner to priorwork: sample episodes from a labeled base class dataset andminimize cross entropy on the predicted query labels [20].Our approach introduces only three learned parameters: α, β and γ . The temperature γ is also used by prior work[7, 4, 30]. Ablations on α and β can be found in Sec. 5.1. While we have described our approach as ﬁnding recon-structions for a single query image, it is relatively straight-forward to ﬁnd the reconstructions for an entire batch ofquery images. We are already calculating the optimal re-construction for each of the r feature vectors in Q indepen-dently; all we need to do for a batch of b images is pool thefeatures into a larger matrix Q (cid:48) ∈ R br × d and run the algo-rithm as written. Thus for an n -way episode we will onlyever need to run the algorithm n times, once for each sup-port matrix S c , regardless of the quantity or arrangement ofqueries. These n runs can also be parallelized, given paral-lel, highly optimized implementations of matrix multiplica-tion and inversion. The formula for ¯ Q in Eq. 8 is efﬁcient to compute when d (cid:29) kr , as the most expensive step is inverting a kr × kr matrix that does not grow with d . Additionally, computingthe matrix multiplications from left to right ensures that thenetwork need never store a potentially large d × d matrix inmemory. However, if the feature maps are large or the shotnumber is particularly high ( kr (cid:29) d ), Eq. 8 may quicklybecome infeasible to compute. In this case an alternativeformulation for ¯ Q exists, which swaps d for kr in terms ofcomputational requirements. This formulation is owed tothe Woodbury Identity as applied in [2]: ¯ Q c = ρ ¯ W S c = ρQ ( S Tc S c + λI ) − S Tc S c (10)In this case, the most expensive step is inverting a d × d matrix, and by computing the matrix multiplications fromright to left, we ensure that no large kr × kr or br × kr matrices need ever be stored in memory. Since r and d aredetermined in advance by the network architecture, the re-searcher is free to employ either formulation depending onthe value of k . The network can also decide on the ﬂy attest time. In terms of classiﬁer performance the two for-mulations are algebraically equivalent and the choice is re-dundant. We make the arbitrary decision to employ Eq. 10in our experiments rather than Eq. 8. Pseudo-code for thisformulation is provided in the supplementary. In addition to the classiﬁcation loss, we employ an aux-iliary loss that encourages support features from differentclasses to span the latent space [19]: L aux = (cid:88) i ∈ C (cid:88) j ∈ C,j (cid:54) = i || ˆ S i ˆ S Tj || (11)where ˆ S is row-normalized, with features projected tothe unit sphere. This loss encourages orthogonality betweenfeatures from different classes. Similar to [19], we down-scale this loss by a factor of . . We use L aux as the auxil-iary loss in our subspace network implementation [19], andit replaces the SimCLR episodes in our CrossTransformerimplementation [5]. We include it in our own model forconsistency, and include an ablation study in Sec. 5.1. Prior work [4, 30] has demonstrated that few-shot clas-siﬁers can beneﬁt greatly from non-episodic pre-training.For traditional metric learning-based approaches, the fea-ture extractor is initially trained as a linear classiﬁer withglobal average-pooling on the full set of training classes.The linear layer is subsequently discarded, and the featureextractor is ﬁne-tuned episodically.This pre-training does not work out-of-the-box for FRNdue to the novel way it performs classiﬁcation. Because the4inear classiﬁer uses average-pooling, the feature extractordoes not learn spatially distinct feature maps in the way thatFRN requires. Episodic training subsequently diverges.We therefore devise a new pre-training scheme for FRN.To keep the classiﬁer consistent with FRN meta-training,we continue to use a feature reconstruction error as the pre-dicted class logit. Similar to [31], the classiﬁcation layeris parametrized as a set of class-speciﬁc dummy features .Thus in addition to the network backbone, we also have alearnable matrix M c ∈ R r × d for each category c , whichacts as a proxy for S c . Following Eq. 10, for a sample x q with feature map Q ∈ R r × d , the category prediction is then: ¯ Q c = ρQ ( M Tc M c + λI ) − M Tc M c (12) P ( y q = c | x q ) = e ( − γ (cid:104) Q, ¯ Q c (cid:105) ) (cid:80) c (cid:48) ∈ C e ( − γ (cid:104) Q, ¯ Q c (cid:48) (cid:105) ) (13)It should be noted that C in this setting is no longer thesampled subset of episode categories, but rather the entireset of training classes for mini-ImageNet, | C | = 64 . Wethen use this output probability distribution to calculate thestandard cross-entropy classiﬁcation loss. During the pre-training stage, we ﬁx ρ = λ = 1 but keep γ a learnable pa-rameter. After pre-training is ﬁnished, all learned matrices { M c | c ∈ C } are discarded (similar to the pre-trianed MLPclassiﬁer in [23, 30, 27, 3, 4]). The pre-trained model sizeis thus the same as when trained from scratch. We then loadthe pre-trained backbone parameters and γ into the meta-training pipeline, and train episodically as before.While pre-training is broadly applicable and generallyboosts performance, for the sake of fairness we do not pre-train any of our ﬁne-grained experiments, as baseline meth-ods do not consistently pre-train in these settings. CrossTransformer [5]:

While FRN is not the ﬁrst at-tempt at building a few-shot classiﬁer based on feature mapreconstruction, it is the ﬁrst to do so explicitly in closedform. Some prior approaches instead approximate ¯ W usingattention and extra learned projection layers. CrossTrans-former is one such approach, which we re-implement in ourexperiments as a baseline (CTX). Using learned linear lay-ers, CrossTransformer reprojects the feature pools S c and Q into two different ”key” and ”value” subspaces, yielding S , Q and S , Q . The reconstruction of Q is given by: ¯ Q c = σ ( 1 √ d Q S T ) S (14)where σ ( · ) denotes a row-wise softmax and γ is the sametemperature scaling parameter. While Eq. 14 is looselyanalogous to Eq. 8, with the √ d -scaled softmax replacing the inverted matrix term, we ﬁnd that performance differs inpractice. The CrossTransformer layer is also somewhat un-wieldy: the two reprojection layers introduce extra parame-ters into the network, and during training it is necessary tostore the br × kr matrix of attention weights σ ( √ d Q S T ) for back-propagation. This led to a noticeable memory foot-print in our experiments. DeepEMD [31]:

Similar to the above approaches, Deep-EMD solves for a br × kr reconstruction matrix ¯ W anduses reconstruction quality (measured as transport cost) as aproxy for class membership. This technique is more sophis-ticated and powerful than ridge regression, but also highlyconstrained. As a transport matrix, ¯ W must contain strictlynonnegative values, with rows and columns that sum to 1.More importantly, ¯ W cannot be calculated in closed form,and instead requires an iterative procedure which can beslow in practice and does not scale well beyond pairs ofindividual images. DeepEMD also requires ﬁnetuning viaback-propagation at test time, whereas our approach scalesout of the box to a range of values for k, r, d . Deep Subspace Networks [19]:

Subspace networkspredict class membership by calculating the distance be-tween the query point and its projections onto the latent sub-spaces formed by the support images for each class. Thisis almost exactly analogous to our approach with r = 1 ,with average-pooling performing the necessary spatial re-duction. The crucial difference is that subspace networksassume (accurately) that d (cid:29) k , whereas in our setting itis not always the case that d (cid:29) kr . In fact, for many ofour models S spans the latent space, so the projection in-terpretation falls apart and we instead rely on the ridge re-gression regularizer to keep the problem well-posed. We re-implement this approach as a baseline in our experiments,and include the original published numbers where available.

4. Experiments

Because Feature Map Reconstruction Networks focus onspatial details without overﬁtting to pose, we ﬁnd that theyare particularly powerful in the ﬁne-grained few-shot recog-nition setting, where details are important and pose is notdiscriminative. We demonstrate clear superiority on foursuch ﬁne-grained benchmarks. For general few-shot learn-ing, pre-trained FRN achieves highly competitive resultswithout additional whistles and bells.

Implementation details : We conduct experiments ontwo widely used backbones: 4-layer ConvNet (Conv-4) andResNet-12. Same as [30, 11], Conv-4 consists of 4 consec-utive 64-channel convolution blocks that each downsampleby a factor of 2. The shape of the output feature maps for in-put images of size 84 ×

84 is thus 64 × ×

5. For ResNet-12,we use the same implementation as [30, 23, 31, 11]. The in-put image size is the same as Conv-4 and the output featuremap shape is 640 × ×

5. During training, we use the stan-5 onv-4 ResNet-12Model 1-shot 5-shot 1-shot 5-shot

MatchNet (cid:91) [25, 30, 31] 67.73 79.00 71.87 85.08ProtoNet (cid:91) [20, 30, 31] 63.73 81.50 66.09 82.50Hyperbolic [10] 64.02 82.53 - -FEAT (cid:91) [30] 68.87 82.90 - -DeepEMD (cid:91) [31] - - 75.65 88.69ICI (cid:91) [28] - - 76.16 90.32ProtoNet † [20] 63.42 83.01 79.09 90.59DSN † [19] 65.66 84.54 79.42 90.34CTX † [5] 69.91 86.83 77.38 89.95FRN (ours) 1-shot 68.76 - 82.62 -FRN (ours) 5-shot Table 1. Performance on CUB using bounding-box cropped im-ages as input, 5-way 1/5-shot. † denotes our own implementations. (cid:91) denotes the use of non-episodic pre-training. dard data augmentation as in [30, 31, 27, 3], which includesrandom crop, right-left ﬂip and color jitter.For all experiments, we include results for 1-shot and 5-shot settings. Surprisingly, we found that FRN trained with5-shot episodes consistently outperformed FRN trainedwith 1-shot episodes, even on 1-shot evaluation. For con-sistency, we still include the 1-shot model results, whichare competitive in their own right. For our ﬁne-grained experiments, we re-implementthree baselines: Prototypical Networks (ProtoNet) [20],CrossTransformer (CTX) [5], and Deep Subspace Networks(DSN) [19]. Evaluation is performed on the standard 5-way,5-shot and 1-shot settings. We average over 6000 episodesto obtain our accuracy scores and conﬁdence intervalswhere appropriate. For fair comparison, we do not use pre-training on any of our ﬁne-grained benchmarks, and attemptto keep hyperparameters as close as possible to the standardvalues for prototypical networks on each dataset. Furtherdetails can be found in supplementary.

Caltech-UCSD Birds-200-2011 [26] (CUB) consists of11,788 images from 200 classes. Following [3], we ran-domly split categories into 100 classes for training, 50 forvalidation and 50 for evaluation. Our split is identical to[22]. Prior work on this benchmark pre-processes the datain different ways: [3] uses raw images as input, while[30, 31] crop each image to a human-annotated boundingbox. We conduct experiments on both settings for a faircomparison. Results for the cropped setting can be found inTable 1; results for uncropped are in Table 2. FRN is supe-rior across the board, with a notable 3-point jump in accu-racy from the nearest baseline in every single 1-shot setting.These results are achieved without any pre-training.Note that our re-implemented baselines in Table 1 arecompetitive with (and in some cases beat outright) priorpublished numbers. This shows that in subsequent exper-

Model Backbone 1-shot 5-shot

Baseline++ (cid:91) [3] ResNet-34 68.00 ± ± ± ± ± ± ± ± ± ± (cid:91) [32] ResNet-18 80.96 88.68S2M2 (cid:91) [15] WRN-28-10 80.68 ± ± (cid:91) [12] ResNet-18 72.66 ± ± et al . (cid:91) [1] ResNet-18 74.22 ± ± ± ± ± Table 2. Performance on CUB using raw images as input, 5-way1/5-shot. (cid:91) denotes the use of non-episodic pre-training.

Conv-4 ResNet-12Model 1-shot 5-shot 1-shot 5-shot

ProtoNet † [20] 47.72 69.42 66.57 82.37DSN † [19] 47.12 66.36 68.16 81.85CTX † [5] 50.27 67.30 60.77 76.36FRN (ours) 1-shot 50.25 - 69.40 -FRN (ours) 5-shot Table 3. Performance on Aircraft, 5-way 1/5-shot. † denotes ourown implementations. iments without prior published numbers, our implementedbaselines still provide fair competition. We do not give FRNan unfair edge - if anything, our baselines are more compet-itive, not less. FGVC-Aircraft [14] contains 10,000 images spanning100 airplane models. Following the same ratio as CUB,we split the classes into 50 train, 25 validation and 25 test.The random split is identical to [22]. The images are pre-cropped to the provided bounding box. Results for thisbenchmark are provided in Table 3, where FRN once againoutperforms baseline methods in all settings.

Meta-iNat and Tiered Meta-iNat [29, 24] are bench-marks of animal species in the wild. These benchmarks areparticularly difﬁcult, as class distinctions are ﬁne-grained,and images are not cropped or centered, and may containmultiple instances of the animal in question. We follow theclass splits proposed by [29]: of 1135 classes with between50 and 1000 images, one ﬁfth (227) are randomly assignedto evaluation and the rest are used for training. While [29]originally propose a full 227-way, k -shot evaluation schemewith ≤ k ≤ , we instead perform standard 5-way, 1-shot and 5-shot evaluation, and leave extension to highershot for future work. We report mean accuracy only, as per-class accuracy was not meaningfully different.Tiered meta-iNat represents a more difﬁcult version ofmeta-iNat where a large domain gap is introduced betweenthe train and test categories. The 354 test classes are pop-ulated by insects and arachnids, while the remaining 781classes (mammals, birds, reptiles, etc.) form the training6 andom tieredModel 1-shot 5-shot 1-shot 5-shot ProtoNet † [20] 55.06 76.31 34.20 57.16Covar. pool † [29] 56.93 77.08 36.03 57.63DSN † [19] 58.02 77.27 36.81 60.01CTX † [5] 59.69 78.60 36.80 61.01FRN (ours) 1-shot 62.36 - 41.60 -FRN (ours) 5-shot Table 4. Performance on meta-iNat and tiered meta-iNat, 5-way1/5-shot. All methods use Conv-4 as backbone network. † denotesour own implementations. set. Training and evaluation is otherwise the same as forstandard meta-iNat. Results for both benchmarks are pro-vided in Table 4, with FRN again providing the best per-formance. We conclude that FRN is broadly effective atﬁne-grained few-shot classiﬁcation. Mini-ImageNet [25] is a subset of ImageNet contain-ing 100 classes in total, with 600 examples per class. Fol-lowing [17], we split categories into 64 classes for train-ing, 16 for validation and 20 for test. Compared to directepisodic meta-training from scratch, recent works [4, 23]gain a large advantage from pre-training on all the trainingdata and labels, followed by episodic ﬁne-tuning. We fol-low the framework of [30, 31] and pre-train our model onthe entire training set as described in Sec. 3.6. Details areprovided in supplementary.Compared with recent state-of-the-art results in Table 5,FRN achieves highly competitive performance. FRN lever-ages pre-training, but no other extra techniques or tricks.FRN also requires no gradient-based ﬁnetuning at infer-ence time, which makes it more efﬁcient than many existingbaselines in practice. We analyze the impact of pre-trainingon few-shot performance in Sec. 5.2.

5. Analysis

We perform an ablation study on our added regulariza-tion parameters and auxiliary loss, and analyze the pretrain-ing scheme on mini-ImageNet. Finally, we verify qualita-tively that the latent space reconstructions learned by ourclassiﬁer are superior for images of the same class, and in-ferior for images of a different class.

We perform our ablation study on CUB, using bothConv-4 and ResNet-12. Results are given in Table 6 for thecropped data setting. In this case, 1-shot results come frommodels trained in a 1-shot manner. We ﬁnd that the impactof the auxiliary loss is mixed: it helps in the 1-shot setting,but hurts 5-shot performance slightly. We suspect that thisis because the pools of support features in the 1-shot setting

Model Backbone 1-shot 5-shot

Meta-Baseline (cid:91) [4] ResNet-12 63.17 ± ± (cid:93) [11] ResNet-12 62.64 ± ± ‡ [19] ResNet-12 62.64 ± ± ♥ (cid:91) [9] ResNet-12 63.85 ± ± ♥ (cid:91) [25, 30] ResNet-12 ± ± (cid:91) [20, 30] ResNet-12 62.39 ± ± (cid:91) [27] ResNet-18 62.85 ± ± et al . ♥♦ (cid:91) [1] ResNet-18 59.88 ± ± ♦ (cid:91) [12] WRN-28-10 61.72 ± ± BM ♥♦ (cid:91) [13] ResNet-25 64.3 81.0FEAT ♥ (cid:91) [30] ResNet-12 ± ± ♦ (cid:91) [31] ResNet-12 ± ± RFS-Distill ‡ (cid:93)(cid:91) [23] ResNet-12 64.82 ± ± FRN (ours) 1-shot (cid:91)

ResNet-12 ± -FRN (ours) 5-shot (cid:91) ResNet-12 ± ± Table 5. Performance of selected competitive few-shot models onmini-ImageNet. ‡ denotes use of data augmentation during evalu-ation. (cid:93) denotes use of label smoothing or knowledge distillation. ♥ denotes modules with many additional learnable parameters. ♦ denotes use of SGD during evaluation. (cid:91) denotes non-episodic pre-training or classiﬁer losses. Bold numbers denote 1-shot accuracyover 65 or 5-shot over 82. FRN numbers are averaged over 10,000episodes with 95% conﬁdence intervals. Conv-4 ResNet-12Model 1-shot 5-shot 1-shot 5-shot no Aux 67.33 λ ρ λ , ρ whole model Table 6. Ablation study on regularization parameters and auxiliaryloss. FRN models are trained under different ablation settings oncropped CUB. are information-deﬁcient, and so explicitly encouraging fullutilization of the feature space is helpful. The support fea-ture pools in the 5-shot setting are not so deﬁcient, and sodo not beneﬁt from the auxiliary loss.We analyze the contribution of the learned regularizationterms λ and ρ by disabling their learnability, setting one orboth equal to one over the course of training. The impactof these terms is also mixed. The 4-layer network clearlybeneﬁts from learning these terms, but the ResNet-12 archi-tecture beneﬁts from removing them. It seems that a morepowerful network is able to overcome regularization prob-lems by massaging the feature space in a more elegant waythan the individual λ, ρ terms can provide.Overall, FRN is not particularly sensitive to these com-ponents of the training scheme. We ﬁnd that pre-training is crucial for competitive mini-ImageNet performance, especially when compared to base-7 igure 3. Decoder outputs on CUB (left) and mini-ImageNet (right). Images are regenerated from ground-truth feature maps (row 2), andreconstructions from same-class (row 3) and different-class support images (rows 4, 5). The same-class reconstructions are clearly morefaithful to the original. Best viewed digitally. training setting 1-shot 5-shot from scratch 1-shot 61.75 ± ± ± ± ± ± ± ± Table 7. Impact of pre-training on mini-ImageNet. Both pre-training and episodic ﬁnetuning are important. lines also utilizing pre-training. However, pre-trainingalone is not sufﬁcient to produce a competitive classiﬁer.An FRN trained from scratch outperforms a pre-trainedFRN evaluated naively (Table 7). The two-round processof pre-training followed by episodic ﬁne-tuning appears tobe crucial. This ﬁnding is in line with prior work [30, 4].

While our accuracy scores suggest that FRN learns toproduce better reconstructions from same-class support im-ages than from those of a different class, we would stilllike to conﬁrm this intuition visually. In particular, it isnot obvious that a better reconstruction in FRN latent spaceshould also be a better reconstruction semantically. Toverify this we train an image re-generator for the 5-shotResNet-12 FRN on CUB and mini-ImageNet. Using aninverted ResNet-12 architecture, this decoder network istrained to take the feature maps of the FRN and map themback to the original corresponding image. Training detailsfor the decoder can be found in supplementary. Results arereported on validation images from each dataset.If it is the case that same-class feature map reconstruc-tions are more semantically faithful than different-classones, we should be able to observe a corresponding dif-ference in image quality when we pass each feature mapreconstruction through the decoder. Table 8 shows the pixelerror of regenerated images from the 5-shot reconstructedfeature maps relative to the ground-truth feature map. Ourintuition holds: while both reconstructed feature maps pro-

Input CUB mini-IN ground-truth feature map .208 .177same-class reconstruction .343 .307diff-class reconstruction .385 .337

Table 8. L2 pixel error between original images and regeneratedones from different latent inputs on CUB and mini-ImageNet val-idation sets. Results are averaged over 1,000 trials and 95% conﬁ-dence intervals are below 1e-3. duce jumps in pixel error relative to the ground truth, theincrease is smaller when the feature map is reconstructedfrom images of the same class.Sample outputs from the decoder for CUB and mini-ImageNet can be found in Fig. 3. The reconstructions fromground-truth feature maps are not particularly good, as clas-siﬁer embeddings are designed to cluster same-class imagestightly and discard all other details. Nevertheless, visualquality is high enough that the difference between regener-ated same-class reconstructions (row 3) and different-classreconstructions (rows 4, 5) is readily apparent. Additionalvisualizations are provided in supplementary. We concludethat FRN is doing what we intend: learning reconstructionsthat are semantically faithful for same-class support imagesand less faithful otherwise.

6. Conclusion

We introduce Feature Map Reconstruction Networks, anovel approach to few-shot classiﬁcation based on recon-structing query features in latent space. Solving the re-construction problem in closed form produces a classiﬁerthat is both straightforward and powerful, incorporatingﬁne spatial details without overﬁtting to position or pose.We demonstrate state-of-the-art performance on four ﬁne-grained few-shot classiﬁcation benchmarks, and competi-tive performance in the general setting.8 cknowledgements

This work was funded by the DARPA Learning withLess Labels program (HR001118S0044). The authorswould like to thank Cheng Perng Phoo for valuable sug-gestions and computing cluster assistance.

References [1] Arman Afrasiyabi, Jean-Franc¸ois Lalonde, and ChristianGagn´e. Associative alignment for few-shot image classiﬁ-cation. In

Proceedings of the European Conference on Com-puter Vision (ECCV) , August 2020. 6, 7[2] Luca Bertinetto, Joao F. Henriques, Philip Torr, and An-drea Vedaldi. Meta-learning with differentiable closed-formsolvers. In

International Conference on Learning Represen-tations , 2019. 3, 4[3] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Wang,and Jia-Bin Huang. A closer look at few-shot classiﬁcation.In

International Conference on Learning Representations ,2019. 2, 5, 6, 12, 13[4] Yinbo Chen, Xiaolong Wang, Zhuang Liu, Huijuan Xu, andTrevor Darrell. A new meta-baseline for few-shot learning,2020. 1, 2, 3, 4, 5, 7, 8[5] Carl Doersch, Ankush Gupta, and Andrew Zisserman.Crosstransformers: spatially-aware few-shot transfer.

Ad-vances in Neural Information Processing Systems , 33, 2020.2, 4, 5, 6, 7[6] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 , 2017. 2, 6[7] Spyros Gidaris and Nikos Komodakis. Dynamic few-shotvisual learning without forgetting. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition , pages 4367–4375, 2018. 1, 2, 3, 4, 11[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016. 2[9] Ruibing Hou, Hong Chang, MA Bingpeng, Shiguang Shan,and Xilin Chen. Cross attention network for few-shot clas-siﬁcation. In

Advances in Neural Information ProcessingSystems , pages 4003–4014, 2019. 2, 7[10] Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Usti-nova, Ivan Oseledets, and Victor Lempitsky. Hyperbolicimage embeddings. In

Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition , pages6418–6428, 2020. 1, 2, 6[11] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, andStefano Soatto. Meta-learning with differentiable convex op-timization. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , pages 10657–10665,2019. 2, 5, 7, 11[12] Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Ming-sheng Long, and Han Hu. Negative margin matters: Un-derstanding margin in few-shot classiﬁcation. arXiv preprintarXiv:2003.12060 , 2020. 6, 7, 12, 13 [13] Yaoyao Liu, Bernt Schiele, and Qianru Sun. An ensemble ofepoch-wise empirical bayes for few-shot learning. In

Euro-pean Conference on Computer Vision (ECCV) , 2020. 7[14] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi.Fine-grained visual classiﬁcation of aircraft. Technical re-port, 2013. 2, 6[15] Puneet Mangla, Nupur Kumari, Abhishek Sinha, MayankSingh, Balaji Krishnamurthy, and Vineeth N Balasubrama-nian. Charting the right manifold: Manifold mixup for few-shot learning. In

The IEEE Winter Conference on Applica-tions of Computer Vision , pages 2218–2227, 2020. 6, 12,13[16] Alex Nichol, Joshua Achiam, and John Schulman. Onﬁrst-order meta-learning algorithms. arXiv preprintarXiv:1803.02999 , 2018. 2[17] Sachin Ravi and Hugo Larochelle. Optimization as a modelfor few-shot learning. 2016. 7, 12[18] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, OriolVinyals, Razvan Pascanu, Simon Osindero, and Raia Had-sell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960 , 2018. 2[19] Christian Simon, Piotr Koniusz, Richard Nock, andMehrtash Harandi. Adaptive subspaces for few-shot learn-ing. In

Proceedings of the IEEE/CVF Conference on Com-puter Vision and Pattern Recognition , pages 4136–4145,2020. 1, 3, 4, 5, 6, 7[20] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypicalnetworks for few-shot learning. In

Advances in neural infor-mation processing systems , pages 4077–4087, 2017. 1, 2, 4,6, 7, 13[21] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HSTorr, and Timothy M Hospedales. Learning to compare: Re-lation network for few-shot learning. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 1199–1208, 2018. 1, 2, 3, 6[22] Luming Tang, Davis Wertheimer, and Bharath Hariha-ran. Revisiting pose-normalization for ﬁne-grained few-shotrecognition. In

Proceedings of the IEEE/CVF Conferenceon Computer Vision and Pattern Recognition , pages 14352–14361, 2020. 6, 12, 13[23] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenen-baum, and Phillip Isola. Rethinking few-shot image classi-ﬁcation: a good embedding is all you need? arXiv preprintarXiv:2003.11539 , 2020. 2, 5, 7, 11[24] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui,Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, andSerge Belongie. The inaturalist species classiﬁcation and de-tection dataset. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , pages 8769–8778,2018. 6[25] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, DaanWierstra, et al. Matching networks for one shot learning. In

Advances in neural information processing systems , pages3630–3638, 2016. 2, 6, 7[26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-port CNS-TR-2011-001, California Institute of Technology,2011. 2, 6

27] Yan Wang, Wei-Lun Chao, Kilian Q. Weinberger, and Lau-rens van der Maaten. Simpleshot: Revisiting nearest-neighbor classiﬁcation for few-shot learning. arXiv preprintarXiv:1911.04623 , 2019. 2, 5, 6, 7[28] Yikai Wang, Chengming Xu, Chen Liu, Li Zhang, and Yan-wei Fu. Instance credibility inference for few-shot learning.In

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 12836–12845, 2020.6[29] Davis Wertheimer and Bharath Hariharan. Few-shot learningwith localization in realistic settings. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 6558–6567, 2019. 2, 6, 7, 12[30] Han-Jia Ye, Hexiang Hu, De-Chuan Zhan, and Fei Sha. Few-shot learning via embedding adaptation with set-to-set func-tions. In

IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR) , pages 8808–8817, 2020. 2, 3,4, 5, 6, 7, 8, 11[31] Chi Zhang, Yujun Cai, Guosheng Lin, and Chunhua Shen.Deepemd: Few-shot image classiﬁcation with differen-tiable earth mover’s distance and structured classiﬁers. In

IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR) , June 2020. 2, 5, 6, 7, 11[32] Imtiaz Masud Ziko, Jose Dolz, Eric Granger, and Ismail BenAyed. Laplacian regularized few-shot learning. arXivpreprint arXiv:2006.15486 , 2020. 6, 12, 13 upplementary Materials

7. Pseudo-Code for Eq. 10

Equation: ¯ Q c = ρ ¯ W S c = ρQ ( S Tc S c + λI ) − S Tc S c Listing 1. Pytorch pseudo-code for feature map reconstruction in Eq. 10 for a single meta-training episode. The whole calculation can beperformed in parallel via batched matrix multiplication and inversion.

8. Training Details

We use two neural architectures in our experiments: Conv-4 and ResNet-12. Conv-4 consists of four convolutional layerswith kernel size × and output size 64, followed by BatchNorm, ReLU, and × max-pooling. ResNet-12 consists of fourresidual blocks, each with three convolutional layers, with leaky ReLU (0.1) and × max-pooling on the main stem. We usedrop-block as in the original implementation [23, 30, 31, 11] . Output sizes for each residual block are: , , , .Due to the large output dimensionality, we found it necessary for stable training to normalize the ResNet-12 outputs bydownscaling by a factor of √ .Unless otherwise stated, all models are trained with a weight decay of 5e-4. Training details for each individual benchmarkfollow. CUB:

Our implementation follows [7, 11]. We train all Conv-4 models for 800 epochs using SGD with Nesterov momen-tum of 0.9 and an initial step size of 0.1. The step size decreases by a factor of 10 at epoch 400. ResNet-12 models trainfor 1200 epochs and scale down the step size at epochs 400 and 800. We use the validation set to select the best-performingmodel over the course of training, and validate every 20 epochs.Conv-4 models are trained with the standard episodic setup: 20-way 5-shot for 5-shot models, and 30-way 1-shot for1-shot models. We use 15 query images per class in both settings. In order to save memory we cut these values in half forResNet-12 models: 5-shot models train on 10-way episodes, while 1-shot models train on 15-way episodes.We temperature scale the output probability logits of all our models. We also normalize the logits for our baseline models,dividing by for Conv-4 and for ResNet-12. FRN did not beneﬁt from this normalization. Aircraft:

Our implementation for the aircraft dataset roughly follows the CUB setup. We found however that the 1-shot ResNet-12 FRN is highly unstable at the very beginning of training and frequently collapses to the suboptimal uniformsolution. We found it necessary to train this model using Adam with an initial step size of 1e-3, and no weight decay. Allother hyperparameters remain the same as for CUB. https://github.com/WangYueFt/rfs https://github.com/Sha-Lab/FEAT https://github.com/icoz69/DeepEMD https://github.com/kjunelee/MetaOptNet ecoder Network:Layer Output Size Input × × ResBlock × × ResBlock × × ResBlock × × ResBlock × × Upsample × × Conv3x3 × × Tanh × × Residual Block:

InputTranspConv4x4, stride 2 Conv3x3BatchNormReLUTranspConv4x4, stride 2BatchNorm BatchNormReLUConv3x3BatchNormAdditionELUTable 9. Neural architecture for the image decoder network (left) and residual block (right). Transposed convolution layers have half asmany output channels as input channels. meta-iNat and Tiered meta-iNat:

Our implementation follows [29], only we train for twice as long, as we found thatthe loss curves frequently failed to stabilize before each decrease in step size. We therefore train our models for 100 epochsusing Adam with initial step size 1e-3. We cut the step size by a factor of two every twenty episodes. Because there is novalidation set available, we simply use the ﬁnal model at the end of training. Episode setup is as in CUB and Aircraft.

For non-episodic FRN pretraining, we run 350 epochs using SGD with initial step size 0.1 and Nesterov momentum 0.9.We cut the step size by a factor of 10 at epochs 200 and 300. Batch size is 128. Subsequent episodic ﬁne-tuning uses thesame optimizer for 150 epochs, but with initial step size 1e-3, decreased by a factor of 10 at epochs 70 and 120. For the5-shot FRN model, we use the standard 20-way episodes. The 1-shot model uses 25-way episodes.FRN models trained from scratch (for the ablation study) use the same optimizer for 300 epochs, with initial step size 0.1,decreased by a factor of ten at epochs 160 and 250. Similar to CUB, we cut the way of the training episodes in half in orderto reduce memory footprint: 10-way for 5-shot models, and 15-way for 1-shot models.We continue to use the validation set to select the best-performing model during episodic training, validating every 10epochs.

The decoder network for the visualizations in Section 5.3 takes the form of an inverted ResNet-12, with four residualblocks and a ﬁnal projection layer to three channels. Upsampling is performed using strided transposed convolution in boththe residual stem and the main stem of each residual block. These transposed convolution layers also downsize the numberof channels. The ﬁnal residual block outputs of size × are rescaled to the original × resolution using bilinearsampling before the ﬁnal projection layer. The full architecture is described in Table 9.The decoder network is trained through L-1 reconstruction loss using Adam with an initial step size of 0.01 and batchsize of 200. The CUB decoder is trained for 500 epochs, with step size decreasing by a factor of 4 every 100 epochs. Mini-ImageNet is a larger datset than CUB, so the mini-ImageNet decoder is trained for 200 epochs, with step size decreasingevery 40 epochs.Over the course of training, we found that the FRN classiﬁer learns to regularize the reconstruction problem heavily. This isproblematic in that the regularized reconstructions all fall off the input manifold for the decoder network, producing uniformlyﬂat grey-brown images. To prevent this, we removed the normalization on the classiﬁer latent embeddings (multiplying by √ ). This was sufﬁcient to eliminate the negative impact of regularization and produce meaningful image reconstructions.

9. Additional Results on CUB

Unlike mini-ImageNet, for which researchers tend to use the same class split as [17], CUB doesn’t have an ofﬁcial split.For our CUB experiments, we use the same random train/validation/test split as Tang et al . [22]. However, many open-sourced baselines [32, 12, 15] use the dataset-generating code from Chen et al . [3], using raw, non-cropped images as input. https://github.com/wyharveychen/CloserLookFewShot/blob/master/filelists/CUB/write_CUB_filelist.py odel class split Backbone 1-shot 5-shot FRN (ours) 1-shot Tang et al . [22] ResNet-12 82.02 ± et al . [22] ResNet-12 ± ± FRN (ours) 1-shot Chen et al . [3] ResNet-12 82.23 ± et al . [3] ResNet-12 ± ± Baseline++ (cid:91) [3] Chen et al . [3] ResNet-34 68.00 ± ± et al . [3] ResNet-34 72.94 ± ± (cid:91) [32] Chen et al . [3] ResNet-18 80.96 88.68S2M2 (cid:91) [15] Chen et al . [3] WRN-28-10 80.68 ± ± (cid:91) [12] Chen et al . [3] ResNet-18 72.66 ± ± Table 10. Performance comparison for FRN under two different class split settings. We include baselines using the same split as Chen etal . [3] for reference. Results of FRN are averaged over 10,000 trials with 95% conﬁdence interval. (cid:91) denotes the use of non-episodicpre-training. training setting 1-shot 5-shot from scratch 1-shot 82.02 ± ± ± ± ± ± -ﬁnetune 5-shot 83.23 ± ± Table 11. Impact of pre-training for FRN on CUB using raw images as input. Pre-training can slightly boost the performance compared totraining from scratch.

Therefore, we re-run our method on CUB under the uncropped setting using the same class split as [3]. As shown in Table10, FRN performance remains basically the same.

We introduce the pre-training technique for FRN in Sec. 3.6 and apply it to mini-ImageNet in Table 5. Here, we apply itto CUB under the un-cropped setting with class splits as in [22].For non-episodic FRN pretraining, we run 1200 epochs using SGD with initial step size 0.1 and Nesterov momentum 0.9.We cut the step size by a factor of 10 at epochs 600 and 900. Batch size is 128. Subsequent episodic ﬁne-tuning uses thesame optimizer for 600 epochs, but with initial step size 1e-3, decreased by a factor of 10 at epochs 300 and 500. For the5-shot FRN model, we use the standard 20-way episodes. The 1-shot model uses 25-way episodes.As shown in Table 11, pre-training still improves the ﬁnal accuracy, but the gain compared to training from scratchbecomes much smaller than for mini-ImageNet (see Table 7). This is reasonable, as CUB is only about the size of mini-ImageNet. It is thus comparatively easier in this setting for the FRN trained episodically from scratch to ﬁnd a good optimum,and more difﬁcult for pre-training to improve.

10. Additional Visualizations

Additional image reconstruction trials as in Fig. 3 of the main paper begin on the following page.13 igure 4. Additional image reconstruction visualizations for CUB. Formatting follows Fig. 3: support images are given on the left whiletarget images and reconstructions are on the right. First row images are targets, second row images are autoencoded, third row images arereconstructed from the same class, and fourth and ﬁfth row images are reconstructed from different classes. Same-class reconstructions areclearly superior to those from different classes. igure 5. Additional image reconstruction visualizations for mini-ImageNet. Formatting follows Fig. 3: support images are given onthe left while target images and reconstructions are on the right. First row images are targets, second row images are autoencoded, thirdrow images are reconstructed from the same class, and fourth and ﬁfth row images are reconstructed from different classes. Same-classreconstructions tend to gray out or darken the colors, but are much more faithful shape-wise than those from different classes.igure 5. Additional image reconstruction visualizations for mini-ImageNet. Formatting follows Fig. 3: support images are given onthe left while target images and reconstructions are on the right. First row images are targets, second row images are autoencoded, thirdrow images are reconstructed from the same class, and fourth and ﬁfth row images are reconstructed from different classes. Same-classreconstructions tend to gray out or darken the colors, but are much more faithful shape-wise than those from different classes.