[PDF] Flexible Dataset Distillation: Learn Labels Instead of Images

Abstract

We study the problem of dataset distillation - creating a small set of synthetic examples capable of training a good model. In particular, we study the problem of label distillation - creating synthetic labels for a small set of real images, and show it to be more effective than the prior image-based approach to dataset distillation. Methodologically, we introduce a more robust and flexible meta-learning algorithm for distillation, as well as an effective first-order strategy based on convex optimization layers. Distilling labels with our new algorithm leads to improved results over prior image-based distillation. More importantly, it leads to clear improvements in flexibility of the distilled dataset in terms of compatibility with off-the-shelf optimizers and diverse neural architectures. Interestingly, label distillation can also be applied across datasets, for example enabling learning Japanese character recognition by training only on synthetically labeled English letters.

Full PDF

FFlexible Dataset Distillation:Learn Labels Instead of Images

Ondrej Bohdal, Yongxin Yang, Timothy Hospedales

School of InformaticsThe University of Edinburgh {ondrej.bohdal, yongxin.yang, t.hospedales}@ed.ac.uk

Abstract

We study the problem of dataset distillation – creating a small set of syntheticexamples capable of training a good model. In particular, we study the problem of label distillation – creating synthetic labels for a small set of real images, and showit to be more effective than the prior image -based approach to dataset distillation.Interestingly, label distillation can be applied across datasets, for example enablinglearning Japanese character recognition by training only on synthetically labeledEnglish letters. Methodologically, we introduce a more robust and ﬂexible meta-learning algorithm for distillation, as well as an effective ﬁrst-order strategy basedon convex optimization layers. Distilling labels with our new algorithm leads toimproved results over prior image-based distillation. More importantly, it leads toclear improvements in ﬂexibility of the distilled dataset in terms of compatibilitywith off-the-shelf optimizers and diverse neural architectures.

Distillation is a topical area of neural network research that initially began with the goal of extractingthe knowledge of a large pre-trained model and compiling it into a smaller model, while retainingsimilar performance [12]. The notion of distillation has since found numerous applications and usesincluding the possibility of dataset distillation [31]: extracting the knowledge of a large datasetand compiling it into a small set of carefully crafted examples, such that a model trained on thesmall dataset alone achieves good performance. This is of scientiﬁc interest as a tool to study neuralnetwork generalisation under small sample conditions. More practically, it has the potential to addressthe large and growing logistical and energy hurdle of neural network training, if adequate neuralnetworks can be quickly trained on small distilled datasets rather than massive raw datasets.Neveretheless, progress towards the vision of dataset distillation has been limited as the performanceof existing methods [31, 27] trained from random initialization is far from that of full datasetsupervised learning. More fundamentally, existing approaches are relatively inﬂexible in terms of thedistilled data being over-ﬁtted to the training conditions under which it was generated. While thereis some robustness to choice of initialization weights [31], the distilled dataset is largely speciﬁcto the architecture used to train it (thus preventing its use to accelerate NAS, for example), andmust use a highly-customized learner (a speciﬁc image visitation sequence, a speciﬁc sequence ofcarefully chosen meta-learned learning rates, and a speciﬁc number of learning steps). Altogetherthese constraints mean that existing distilled datasets are not general purpose enough to be useful inpractice, e.g. with off-the-shelf learning algorithms. We propose a more ﬂexible approach to datasetdistillation underpinned by both algorithmic improvements and changes to the problem deﬁnition.Rather than creating synthetic images [31] for arbitrary labels, or a combination of synthetic imagesand soft-labels [27], we focus on crafting synthetic labels for arbitrarily chosen standard images.Compared to these prior approaches focused on synthetic images, label distillation beneﬁts from

Preprint. Under review. a r X i v : . [ c s . L G ] J un ...... Figure 1: Label distillation enables training a model that can classify Japanese characters after beingtrained only on English letters and synthetic labels (actual examples shown).exploiting the data statistics of natural images and the lower-dimensionality of labels compared toimages as parameters for meta-learning. Practically, this leads to improved performance compared toprior image distillation approaches. This set-up also uniquely enables a new kind of cross-datasetknowledge distillation (Figure 1). One can learn solely on a source dataset (such as English characters)with synthetic distilled labels, and apply the learned model to recognise concepts in a disjoint targetdataset (such as Japanese characters). Surprisingly, it turns out that models can make good progresson learning to recognise Japanese only through exposure to English characters.Methodologically, we deﬁne a new meta-learning algorithm for distillation that does not requirecostly evaluation of multiple inner-loop (model-training) steps for each iteration of distillation. Moreimportantly our algorithm leads to a more ﬂexible distilled dataset that is better transferrable acrossoptimisers, architectures, learning iterations, etc. Furthermore, where existing dataset distillationalgorithms rely on second-order gradients, we introduce an alternative learning strategy based onconvex optimization layers that avoids high-order gradients and provides better optimization, thusimproving the quality of the distilled dataset.In summary, we contribute: (1) A dataset distillation method that produces ﬂexible distilled datasetsthat exhibit transferability across learning algorithms. This brings us one step closer to producinguseful general purpose distilled datasets. (2) Our distilled datasets can be used to train higherperformance models than those prior work. (3) We introduce the novel problem of concept for cross-dataset distillation, and demonstrate proofs of concept, such as English → Japanese letter recognition.

Dataset distillation

Most closely related to our work is Dataset [31] and Soft-Label DatasetDistillation [27]. They focus on the problem of distilling a dataset or model [20] into a small numberof example images, which are then used to train a new model. This can be seen as solving a meta-learning problem with respect to model’s training data [13]. The common approach is to initialise thedistilled dataset randomly, use the distilled data to train a model, and then backpropagate through themodel and its training updates to take gradient steps on the dataset. Since the ‘inner’ model trainingalgorithm is gradient-based, this leads to high-order gradients. To make this process tractable, theoriginal Dataset Distillation [31] uses only a few gradient steps in its inner loop (as per other famousmeta-learners [11]). To ensure that sufﬁcient learning progress is made with few updates, it alsometa-learns a ﬁxed set of optimal learning rates to use at each step. This balances tractability &efﬁcacy, but causes the distilled dataset to be ‘locked in’ to the customized optimiser, rather thanserve as a general purpose dataset. In this work we deﬁne an online meta-learning procedure thatsimultaneously learns the dataset and the base model. This enables us to tractably take more gradientsteps, and ultimately to produce a performant yet ﬂexible general purpose distilled dataset.There are various motivations for dataset distillation, but the most practically intriguing is to sum-marize a dataset in compressed form to accelerate model training. In this sense it is related to otherendeavours such as dataset pruning [2, 10, 16], core-set construction [29, 3, 26] and instance selection[22] that focuses on dataset summarization through a small number of examples. The key difference2s that summarization methods select a relatively large part of the data (e.g. at least 10%), whiledistillation extends down to using 10 images per category ( ≈ . of CIFAR-10 data) throughexample synthesis . We retain original data (like summarization methods), but synthesize labels (likedistillation). This leads to another novel application – synthesizing labels for a few ﬁxed examplesso that a model trained on these examples can directly (without any ﬁne-tuning) solve a differentproblem with a different label space (Figure 1).Wang et al. [31] have theoretically analysed a simple linear case and derived a lower bound on theamount of distilled data needed for a classiﬁer to generalize equally well as a classiﬁer trained on thewhole dataset. Their analysis has shown the lower bound on the number of distilled data is the sameas the dimensionality of the data. Meta-learning

Meta-learning algorithms can often be grouped [13] into ofﬂine approaches (e.g.[31, 11]) that solve an inner optimisation at every iteration of an outer optimization; and onlineapproaches that solve the base and meta-learning problem simultaneously (e.g. [4, 19]). Onlineapproaches are typically faster, but optimize meta-parameters for a single problem. Ofﬂine approachesare slower and typically limit the length of the inner optimisation for tractibility, but can often ﬁndmeta-parameters that solve a distribution of tasks (as different tasks are drawn in each outer-loopiteration). In dataset distillation, the notion of ‘distribution over tasks’ corresponds to ﬁnding adataset that can successfully train a network in many settings, such as different initial conditions [31].Our distillation algorithm is a novel hybrid of these two families. We efﬁciently solve the base andmeta-tasks simultaneously as per the online approaches, and are thus able to use more inner-loopsteps. However, we also learn to solve many ‘tasks’ by detecting meta-overﬁtting and sampling anew ‘task’ when this occurs. This leads to an excellent combination of efﬁcacy and efﬁciency.Finally, most gradient-based meta-learning algorithms rely on costly and often unstable higher-ordergradients [13, 11, 31], or else make simple shortest-path ﬁrst-order approximations [21]. We areinspired by recent approaches in few-shot learning [5, 18] that avoid this issue through use of convexoptimisation layers. We introduce the notion of a pseudo-gradient that enables this idea to scalebeyond the few-shot setting to general meta-learning problems such as dataset distillation.

We aim to meta-learn synthetic labels for a ﬁxed set of base examples that can be used to train arandomly initialized model. This corresponds to an objective such as (Eq. 1): ˜ y ∗ = arg min ˜ y (cid:96) (( x t , y t ) , θ − β ∇ θ (cid:96) ((˜ x , ˜ y ) , θ )) , (1)where ˜ y are the distilled labels for base examples ˜ x , ( x t , y t ) are the target set examples and labelsand θ is the neural network model learned with rate β and loss (cid:96) . We learn soft labels, so thedimensionality of each label is the number of classes C to distill. One gradient step is shown above,but in general there may be multiple steps.One option to achieve objective in (Eq. 1) would be to follow [31] and simulate the whole training pro-cedure for multiple gradient steps ∇ θ within the inner loop. However, this requires back-propagatingthrough a long inner loop, and ultimately requires a ﬁxed training schedule with optimized learningrates. We aim to produce a dataset that can be used in a standard training pipeline downstream (e.g.Adam optimizer with the default parameters).Our ﬁrst modiﬁcation to the standard pipeline is to perform gradient descent iteratively on the modeland the distilled labels, rather than performing many inner (model) steps for each outer (dataset)step. This increases efﬁciency signiﬁcantly due to a shorter compute graph for backpropagation.Nevertheless, when there are very few training examples, the model converges quickly to an over-ﬁtted local minima, likely within a few hundred iterations. To manage this, our innovation is todetect overﬁtting when it occurs, reset the model to a new random initialization and keep training.Speciﬁcally, we measure the moving average of target problem accuracy, and when it has not improvedfor set number of iterations, we reset the model. This periodic reset of the model after varying numberof iterations is helpful for learning labels that are useful for all stages of training and thus less sensitiveto the number of iterations used. To ensure scalability to any number of examples, we sample aminibatch of base examples and synthetic labels and use those to update the model, which also betteraligns with standard training practice. Once label distillation is done, we train a new model fromscratch using random initial weights, given the base examples and learned synthetic labels.3 lgorithm 1 Label distillation with ridge regression (RR) Input: ˜x : N unlabeled examples from the base dataset; ( x t , y t ) : labeled examples from thetarget dataset; β : step size; α : pseudo-gradient step size; n o , n i : outer (resp. inner) loop batchsize; C number of classes in the target dataset Output: distilled labels ˜y and a reasonable number of training steps T i ˜y ← /C //Uniformly initialize synthetic labels θ ∼ p ( θ ) //Randomly initialize feature extractor parameters w ← // Initialize global RR classiﬁer weights while ˜ y not converged do ( ˜x (cid:48) , ˜y (cid:48) ) ∼ ( ˜x , ˜y ) //Sample a minibatch of n i base examples ( x (cid:48) t , y (cid:48) t ) ∼ ( x t , y t ) //Sample a minibatch of n o target examples w l = arg min w l L ( f w l ( f θ ( ˜x (cid:48) )) , ˜y (cid:48) ) // Solve RR problem w ← (1 − α ) w + αw l //Update global RR classiﬁer weights ˜y ← ˜y − β ∇ ˜y L ( f w ( f θ ( x (cid:48) t )) , y (cid:48) t ) //Update synthetic labels with target loss θ ← θ − β ∇ θ L ( f w ( f θ ( ˜x (cid:48) )) , ˜y (cid:48) ) //Update feature extractor if L ( f w ( f θ ( x t )) , y t ) did not improve then θ ∼ p ( θ ) //Reset feature extractor w ← // Reset global RR classiﬁer weights T i ← iterations since previous reset //Record time to overﬁt end if end while We propose and evaluate two label distillation algorithms: a second-order version that performs oneupdate step within the inner loop; and a ﬁrst-order version that uses a closed form solution of ridgeregression to ﬁnd optimal classiﬁer weights for the base examples.

Second-order version

The training procedure includes both inner and outer loop. The innerloop consists of one update of the model θ (cid:48) = θ − α ∇ θ L ( f θ ( ˜x (cid:48) ) , ˜y (cid:48) ) , through which we thenbackpropagate to update the synthetic labels ˜y . We perform a standard update of the model afterupdating the synthetic labels, which is more of a technical detail about how it is implemented ratherthan needed in theory as it could be combined with the inner-loop update. The method is otherwisesimilar to our ﬁrst-order ridge regression method, which is summarized in Algorithm 1, togetherwith our notation. Note that examples from the target dataset are used when updating the syntheticlabels ˜y , but for updating the model θ we only use synthetic labels and the base examples. After eachupdate of the labels, we normalize them so that they represent a valid probability distribution. Thismakes them interpretable and has led to improvements compared to unnormalized labels. First-order version with ridge regression

To avoid second-order gradients, we propose a ﬁrst-order version that uses pseudo-gradient generated via a closed form solution to ridge regression. Ridgeregression layers have previously been used for few-shot learning [5] within minibatch. However, weextend it to learn global weights that persist across minibatches. Ridge regression can be deﬁned andsolved as: W l = arg min W (cid:107) XW − Y (cid:107) + λ (cid:107) W (cid:107) = (cid:0) X T X + λI (cid:1) − X T Y, (2)where X are the image features, Y are the labels, W are the ridge regression weights and λ is theregularization parameter. Following Bertinetto et al. [5], we use Woodbury formula [24]: W l = X T (cid:0) XX T + λI (cid:1) − Y, (3)which allows us to use matrix XX T with dimensionality depending on the square of the number ofinputs (minibatch size) rather than the square of the input embedding size. The embedding size istypically signiﬁcantly larger than the size of the minibatch, so it would make matrix inversion costly.While ridge regression (RR) is obviously oriented at regression problems, it has been shown [5] towork well for classiﬁcation when regressing label vectors. Ridge regression with pseudo-gradients

We use a RR layer as the ﬁnal output layer of our basenetwork. This solves for the optimal (local) weights w l that classify the features of the current4inibatch examples. We exploit this local minibatch solution by taking a pseudo-gradient step thatupdates global weights w as w ← (1 − α ) w + αw l , with α being the pseudo-gradient step size. Wecan understand this as a pseudo-gradient as it corresponds to the step w ← w − α ( w − w l ) . We canthen update the synthetic labels by back-propagating through local weights w l . Subsequent featureextractor updates on θ avoid second-order gradients. The full process is summarised in Algorithm 1. We perform two main types of experiments: (1) within-dataset distillation, when the base examplescome from the target dataset and (2) cross-dataset distillation, when the base examples come froma different but related dataset. The dataset should be related because if there is a large shift in thedomain (e.g. from characters to photos), then the feature extractor trained on the base exampleswould generalize poorly to the target dataset. We use MNIST, CIFAR-10 and CIFAR-100 for the taskof within-dataset distillation, while for cross-dataset distillation we use EMNIST (“English letters”),KMNIST, Kuzushiji-49 (both “Japanese letters”), MNIST (digits), CUB (birds) and CIFAR-10(general objects). Details of these datasets are in the supplementary material.

We use parameters N M , N W to say over how many iterations to calculatethe moving average and how many iterations to wait before reset since the best moving average value.We select N M = N W and use a value of 50 steps in most cases, while we use 100 for CIFAR-100and Kuzushiji-49, and 200 for other larger-scale experiments (more than 100 base examples). Early stopping for learning synthetic labels

We update the synthetic labels for a given numberof epochs and then select the best labels to use based on the validation performance. To estimate thevalidation performance, we train a new model from scratch using the current distilled labels and theassociated base examples and then evaluate the model on the validation part of the target set. Werandomly set aside about 10-15% (depending on the dataset) of the training data for validation.

Models

We use LeNet [17] for MNIST and similar experiments, and AlexNet [15] for CIFAR-10,CIFAR-100 and CUB. Both models are identical to the ones used in [31]. In a fully supervised settingthey achieve about 99% and 80% test accuracy on MNIST and CIFAR-10 respectively.

Selection of base examples

The base examples are selected randomly, using a shared random seedfor consistency across scenarios. Our baseline models use the same random seed as the distillationmodels, so they share base examples for fair comparison. For within-dataset label distillation, wecreate a balanced set of base examples, so each class has the same number of base examples. Forcross-dataset label distillation, we do not consider the original classes of base examples. The size ofthe label space and the labels are different in the source and the target problem.

Further details

Outer loop minibatch size is n o = 1024 examples, while the inner minibatch size n i depends on the number of base examples used. When using 100 or more base examples, we selecta minibatch of 50 examples, except for CIFAR-100 for which we use 100 examples. For 10, 20 and50 base examples our minibatch sizes are 10, 10 and 25. We optimize the synthetic labels and themodel itself using Adam optimizer with the default parameters ( β = 0 . ). Most of our models aretrained for 400 epochs, while larger-scale models (more than 100 base examples and CIFAR-100) aretrained for 800 epochs. Smaller-scale Kuzushiji-49 experiments are trained for 100 epochs, whilelarger-scale ones use 200 epochs. In the second-order version, we perform one inner-loop step update,using a learning rate of α = 0 . . We back-propagate through the inner-loop update when updatingthe synthetic labels (meta-knowledge), but not when subsequently updating the model θ with Adamoptimizer. In the ﬁrst-order version, we use a pseudo-gradient step size of 0.01 and λ of 1.0. Wecalibrate the regression weights by scaling them with a value learned during training with the speciﬁcset of base examples and distilled labels. Our tables report the mean test accuracy and standarddeviation (%) across 20 models trained from scratch using the base examples and synthetic labels. We compare our label distillation (LD) to previous dataset distillation (DD) and soft-label datasetdistillation (SLDD) on MNIST and CIFAR-10. We also establish new baselines that take true labels5able 1: Within-dataset distillation recognition accuracy (%). Our label distillation (LD) outperformsprior Dataset Distillation [31] (DD) and SLDD [27], and scales to synthesizing more examples.

Base examples 10 20 50 100 200 500 M N I S T LD 60.89 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± C I F A R - LD 25.69 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± from the target dataset and otherwise are trained in the same way as label distillation models. RRbaselines use RR and pseudo-gradient for consistency with LD RR (overall network architectureremains the same as in the second-order approach). In addition, we include baselines that use labelsmoothing (LS) [28] with a smoothing parameter of 0.1 as suggested in [23]. The results in Table 1show that LD signiﬁcantly outperforms previous work on MNIST. This is in part due to LD enablingthe use of more steps (LD estimates T i ≈ − steps vs ﬁxed 3 epochs of 10 steps in LD andSLDD). Our improved baselines (with the number of steps chosen using validation set, between 50and 1000 steps) are also competitive, and outperform the prior baselines in [31] due to taking moresteps. However, since [31, 27] are difﬁcult to scale to more steps, this is a reasonable comparison.The label smoothing baseline works well on MNIST for a large number of base examples, wherethe problem anyway approaches one of conventional supervised learning. However, this strategy hasnot shown to be effective enough for CIFAR-10, where synthetic labels are the best. Importantly, inthe most intensively distilled regime of 10 examples, LD clearly outperforms all competitors. Weprovide an analysis of the labels learned by our method in Section 4.5. For CIFAR-10 our results alsoimprove on the original DD result. In this experiment, our second-order algorithm performs similarlyto our RR pseudo-gradient strategy. Table 2: CIFAR-100 within-datasetdistillation. One example per-class.LD 11.46 ± ± ± ± ± ± For cross-dataset distillation, we considered four scenarios: from EMNIST letters to MNIST digits,from EMNIST letters (“English”) to Kuzushiji-MNIST or Kuzushiji-49 characters (“Japanese”)and from CUB bird species to CIFAR-10 general categories. The results in Table 3 show that weare able to distill labels on examples of a different source dataset and achieve surprisingly goodperformance on the target problem, given that no target data is used when training these models. Incontrast, directly applying a trained source-task model to the target without distillation unsurprisinglyleads to chance performance (about 2% test accuracy for Kuzushiji-49 and 10% for all other cases).These results show that we can indeed distill the knowledge of one dataset into base examples froma different but related dataset through crafting synthetic labels. Furthermore, our RR approachsurpasses the second-order method in most cases, conﬁrming its value.6able 3: Cross-dataset distillation recognition accuracy (%). Datasets: E = EMNIST, M = MNIST, K= KMNIST, B = CUB, C = CIFAR-10, K-49 = Kuzushiji-49.

Base examples 10 20 50 100 200 500E → M (LD) 36.13 ± ± ± ± ± ± → M (LD RR) 56.08 ± ± ± ± ± ± → K (LD) 31.90 ± ± ± ± ± ± → K (LD RR) 34.35 ± ± ± ± ± ± → C (LD) 26.86 ± ± ± ± ± ± → C (LD RR) 26.50 ± ± ± ± ± ± → K-49 (LD) 7.37 ± ± ± ± ± ± → K-49 (LD RR) 10.48 ± ± ± ± ± ± M N I S T t e s t a cc u r a cy ( % ) LD DD20% fewer Original 20% more (a) Optimization steps M N I S T t e s t a cc u r a cy ( % ) LD DDOriginal General optimizerShuffled examples (b) Optimization parameters C I F A R - t e s t a cc u r a cy ( % ) LD DDAlexNet LeNet (c) Cross-architecture transfer

Figure 2: Datasets distilled with LD are more ﬂexible than those distilled with DD.

We conduct experiments to verify the ﬂexibility and general applicability of our label-distilled datasetcompared to image-distilled alternative by [31]. Namely, we looked at (1) How the number of stepsused during meta-testing affects the performance of learning with the distilled data, and in particularsensitivity to deviation from the number of steps used during meta-training of DD and LD. (2)Sensitivity of the models to changes in optimization parameters between meta-training + meta-testing.(3) How well the distilled datasets transfer to neural architectures different to those used for training.We used the DD implementation by the authors of [31] for fair evaluation. Figure 2 summarizes thekey points, while tables with details are in the supplementary material.

Sensitivity to the number of meta-testing steps (Figure 2a) A key feature of our method is thatit is relatively insensitive to the number of steps used to train a model on the distilled data. Our resultsfrom Table 8 show that even if we do 50 steps less or 100 steps more than the number estimated duringtraining ( ≈ ), the testing performance does not change signiﬁcantly. This is in contrast withprevious DD and SLDD methods that must be trained for a speciﬁc number of steps with optimizedlearning rates, making them rather inﬂexible. If the number of steps changes even by as little as 20%,they incur a signiﬁcant cost in accuracy. Table 9 provides a further analysis of sensitivity of DD. Sensitivity to optimization parameters (Figure 2b) DD [31] uses step-wise meta-optimizedlearning rates to maximize accuracy. Our analysis in Table 10 shows that if we use an average ofthe optimized learning rate rather than the speciﬁc value in each step (more general optimizer), theperformance decreases signiﬁcantly. Further, original DD relies on training the distilled data in aﬁxed sequence, and our analysis in Table 11 shows that changing the order of examples leads toa large decrease in accuracy. Our LD method by design does not depend on the speciﬁc order ofexamples, and it can be used with off-the-shelf learning optimizers such as Adam to train the model,rather than with optimizers with learning rates speciﬁc to the example and step.

Transferability of labels across architectures (Figure 2c) We study the impact of distilling thelabels using AlexNet and then training AlexNet, LeNet and ResNet-18 using these distilled labels. Weestablish new baselines for these additional architectures, using a number of steps chosen based on thevalidation set. Our results in Tables 12 and 13 suggest our labels are helpful in both within and acrossdataset distillation scenarios. This suggests that even if we relabel a dataset with one label space to7olve a different problem, the labels generalize to a new architecture and lead to improvements overthe baselines. We have done experiments with original DD to ﬁnd how robust it is to the change ofarchitecture (Table 14). As expected, there is a large decrease in accuracy, but the results are betterthan the baselines they report [31], so their synthetic images are somewhat transferable when we trainthe model with the speciﬁc order of examples and optimized learning rates. However, our LD methodincurs a smaller decrease in accuracy, suggesting better transferability across architectures.

In Table 4 we compare training times of LD and the original DD approach (us-ing the same settings and hardware). Our LD involves meta-learning fewer parameters, so for afair comparison we have trained a DD version of label distillation that creates synthetic imagesrather than labels. The results show that our online approach signiﬁcantly accelerates training.Our version of DD only takes a few more minutes to run than LD. However, our paper focuseson LD because our version of DD was relatively unstable and led to signiﬁcantly worse perfor-mance than LD, perhaps because learning synthetic images is more complex than synthetic labels.Table 4: Comparison of training timesof DD [31] and our LD (mins).

MNIST CIFAR-10DD 116 205LD 61 86LD (RR) 65 98Our DD 67 96Our DD (RR) 72 90

Analysis of synthetic labels

We have analysed to whatextent the synthetic labels learn the true label and how thisdiffers between our second-order and RR methods (Figures8 and 9 in the supplementary material). The results for aMNIST experiment with 100 base examples show the second-order method recovers the true value to about 84% on average,so it could be viewed as non-uniform label smoothing withmeta-learned weights on labels speciﬁc to the examples. Forthe same scenario, RR recovers the true value to about 63%on average. Both methods achieved similar test accuracy,which suggests our RR method leads to distilled labels with more non-trivial information. Moreover,visually similar digits such as 4 and 9 receive more weight compared to other combinations. Suchmultiplexing of labels presumably underpins the improved performance compared to the baseline. ForCIFAR-10, the labels are signiﬁcantly more complex, indicating non-trivial information is embeddedinto the synthetic labels. For cross-dataset LD (EMNIST to KMNIST), in many cases a letter hadsimilar labels across various base examples, but in a large number of cases the pattern was less clear.Examples are shown in the supplementary material.

Number of steps −24−22−20−18−16−14−12−10 l o g ( V a r ( g r a d i e n t )) Second-orderPseudo-gradient RR

Figure 3: RR-based LD reducessynthetic label meta-gradient vari-ance

Var [ ∇ ˜ y L ] . Pseudo-gradient analysis

To understand the superior perfor-mance of our RR method, we analysed the variances of meta-gradients obtained by our second-order and pseudo-gradient meth-ods. Results in Figure 3 show that our pseudo-gradient RRmethod obtains a signiﬁcantly lower variance of meta-knowledgegradients, which underpins its stable and effective training.

Discussion

Our LD algorithm provides a more effective andmore ﬂexible distillation approach than prior work. This bringsus one step closer to the vision of leveraging distilled data toaccelerate model training, or design – such as architecture search[9]. Currently we only explicitly randomize over network initial-izations during training. In future work we believe our strategyof multi-step training + reset at convergence could be used withother factors, such as randomly selected network architectures tofurther improve cross-network generalisation performance.

We have introduced a new label distillation algorithm for distilling the knowledge of a large datasetinto synthetic labels of a few base examples from the same or a different dataset. Our methodimproves on prior dataset distillation results, scales better to larger problems, and enables novelsettings such as cross-dataset distillation. Most importantly, it is signiﬁcantly more ﬂexible in termsof distilling general purpose datasets that can be used downstream with off-the-shelf optimizers.8 roader Impact

We propose a ﬂexible and efﬁcient distillation scheme that gets us closer to the goal of practicallyuseful dataset distillation. Dataset distillation could ultimately lead to a beneﬁcial impact in termsof researcher’s time efﬁciency by enabling faster experimentation when training on small distilleddatasets. Perhaps more importantly, it could reduce the environmental impact of AI research anddevelopment by reduction of energy costs [25]. However, our results are still not strong enough yet.For this goal to be realised better distillation methods leading to more accurate downstream modelsneed to be developed.In future, label distillation could speculatively provide a useful tool for privacy preserving learning[1], e.g. in situations where organizations want to learn from user’s data. For example, a companycould provide a small set of public (source) data to a user, who performs cross-dataset distillationusing their private (target) data to train a model. The user could then return the distilled labels on thepublic data, which would allow the company to re-create the user’s model. In this way the knowledgefrom the user’s training could be obtained by the company in the form of distilled labels – withoutdirectly sending any private data, or trained model that could be a vector for memorization attacks[6].On the negative side, detecting and understanding the impact of bias in datasets is an importantyet already very challenging issue for machine learning. The impact of dataset distillation on anyunderlying biases in the data is completely unclear. If people were to train models on distilled datasetsin future it would be important to understand the impact of distillation on data biases.

Source Code

We provide a PyTorch implementation of our approach at https://github.com/ondrejbohdal/label-distillation . Acknowledgments and Disclosure of Funding

This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, fundedby the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and theUniversity of Edinburgh.

References [1] Al-Rubaie, M. and Chang, J. M. (2019). Privacy-preserving machine learning: threats andsolutions.

IEEE Security and Privacy , 17(2):49–58.[2] Angelova, A., Abu-Mostafa, Y., and Perona, P. (2005). Pruning training sets for learning ofobject categories. In

CVPR .[3] Bachem, O., Lucic, M., and Krause, A. (2017). Practical coreset constructions for machinelearning. In arXiv .[4] Balaji, Y., Sankaranarayanan, S., and Chellappa, R. (2018). MetaReg: towards domain general-ization using meta-regularization. In

NeurIPS .[5] Bertinetto, L., Henriques, J., Torr, P. H. S., and Vedaldi, A. (2019). Meta-learning with differen-tiable closed-form solvers. In

ICLR .[6] Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. (2019). The secret sharer: evaluatingand testing unintended memorization in neural networks. In

USENIX Security Symposium .[7] Clanuwat, T., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. (2018). Deep learning forclassical Japanese literature. In

NeurIPS Workshop on Machine Learning for Creativity andDesign .[8] Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. (2017). EMNIST: an extension of MNISTto handwritten letters. In arXiv . 99] Elsken, T., Metzen, J. H., and Hutter, F. (2019). Neural architecture search: a survey.

Journal ofMachine Learning Research , 20:1–21.[10] Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2010). Object detectionwith discriminatively trained part-based models.

IEEE Transactions on Pattern Analysis andMachine Intelligence , 32(9):1627–1645.[11] Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation ofdeep networks. In

ICML .[12] Hinton, G., Vinyals, O., and Dean, J. (2014). Distilling the knowledge in a neural network. In

NIPS .[13] Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. (2020). Meta-learning in neuralnetworks: a survey. In arXiv .[14] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.[15] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classiﬁcation with deepconvolutional neural networks. In

NIPS .[16] Lapedriza, A., Pirsiavash, H., Bylinskii, Z., and Torralba, A. (2013). Are all training examplesequally valuable? In arXiv .[17] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied todocument recognition.

Proceedings of the IEEE , 86(11):2278–2324.[18] Lee, K., Maji, S., Ravichandran, A., and Soatto, S. (2019). Meta-learning with differentiableconvex optimization. In

CVPR .[19] Li, Y., Yang, Y., Zhou, W., and Hospedales, T. M. (2019). Feature-critic networks for heteroge-neous domain generalization. In

ICML .[20] Micaelli, P. and Storkey, A. (2019). Zero-shot knowledge transfer via adversarial belief matching.In

NeurIPS .[21] Nichol, A., Achiam, J., and Schulman, J. (2018). On ﬁrst-order meta-learning algorithms. In arXiv .[22] Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., and Kittler, J. (2010). Areview of instance selection methods.

Artiﬁcial Intelligence Review , 34(2):133–143.[23] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., and Hinton, G. (2017). Regularizing neuralnetworks by penalizing conﬁdent output distributions. In arXiv .[24] Petersen, K. B. and Pedersen, M. S. (2012). The matrix cookbook. Technical report.[25] Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. (2019). Green AI. In arXiv .[26] Sener, O. and Savarese, S. (2018). Active learning for convolutional neural networks: A core-setapproach. In

ICLR .[27] Sucholutsky, I. and Schonlau, M. (2019). Soft-label dataset distillation and text dataset distilla-tion. In arXiv .[28] Szegedy, C., Vanhoucke, V., Ioffe, S., and Shlens, J. (2016). Rethinking the Inception architec-ture for computer vision. In

CVPR .[29] Tsang, I. W., Kwok, J. T., and Cheung, P.-M. (2005). Core vector machines: fast SVM trainingon very large data sets.

Journal of Machine Learning Research , 6:363–392.[30] Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. (2011). The Caltech-UCSDBirds-200-2011 dataset. Technical report.[31] Wang, T., Zhu, J.-Y., Torralba, A., and Efros, A. A. (2018). Dataset distillation. In arXiv .10

Datasets

We use MNIST [17], EMNIST [8], KMNIST and Kuzushiji-49 [7], CIFAR-10 and CIFAR-100 [14],and CUB [30] datasets. Example images are shown in Figure 4. MNIST includes images of 70000handwritten digits that belong into 10 classes. EMNIST dataset includes various characters, but wechoose EMNIST letters split that includes only letters. Lowercase and uppercase letters are combinedtogether into 26 balanced classes (145600 examples in total). KMNIST (Kuzushiji-MNIST) is adataset that includes images of 10 classes of cursive Japanese (

Kuzushiji ) characters and is of thesame size as MNIST. Kuzushiji-49 is a larger version of KMNIST with 270912 examples and 49classes. CIFAR-10 includes 60000 colour images of various general objects, for example airplanes,frogs or ships. As the name indicates, there are 10 classes. CIFAR-100 is like CIFAR-100, but has100 classes with 600 images for each of them. Every class belongs to one of 20 superclasses whichrepresent more general concepts. CUB includes colour images of 200 bird species. The numberof images is relatively small, only 11788. All datasets except Kuzushiji-49 are balanced or almostbalanced.

MNISTEMNISTKuzushijiCIFAR-10CUB

Figure 4: Example images from the different datasets that we use.

B Additional experimental details

Normalization

We normalize greyscale images using the standardly used normalization for MNIST(mean of 0.1307 and standard deviation of 0.3081). All our greyscale images are of size × .Colour images are normalized using CIFAR-10 normalization (means of about 0.4914, 0.4822,0.4465, and standard deviations of about 0.247, 0.243, 0.261 across channels). All colour images arereshaped to be of size × . Computational resources

Each experiment was done on a single GPU, in almost all cases NVIDIA2080 Ti. Shorter (400 epochs) experiments took about 1 or 2 hours to run, while longer (800 epochs)experiments took between 2 and 4 hours.In addition, Figure 5 illustrates the difference between a standard model used for second-order labeldistillation and a model that uses global ridge regression classiﬁer weights (used for ﬁrst-order RRlabel distillation). The two models are almost identical – only the ﬁnal linear layer is different.

C Additional experiments

Stability and dependence on choice of base examples

To evaluate the consistency of our results,we repeat the entire pipeline and report the results in Table 5. In the previous experiments, we used11 eature extractor with CNN layersInputLinear layerReLUnon-linearityLinear layerReLUnon-linearityLinear layer Feature extractor with CNN layersInputLinear layerReLUnon-linearityLinear layerReLUnon-linearityGlobal RR classifier weights

Standard model Model with ridgeregression weights

Prediction Prediction

Figure 5: Comparison of a standard model used for second-order label distillation and a model thatuses global ridge regression classiﬁer weights (used for ﬁrst-order RR label distillation).one randomly chosen but ﬁxed set of base examples per source task. We investigate the impact ofbase example choice by drawing further random base example sets. The results in Table 6 suggest thatthe impact of base example choice is slightly larger than that of the variability due to the distillationprocess, but still small overall. Note that the ± standard deviations in all cases quantify the impact ofretraining from different random initializations at meta-test, given a ﬁxed base set and completeddistillation. It is likely that if the base examples were not selected randomly, the impact of usingspeciﬁc base examples would be larger. In fact, future work could investigate how to choose baseexamples so that learning synthetic labels for them improves the result further. Dependence on target dataset size

Our experiments use a relatively large target dataset (about50000 examples) for meta-learning. We study the impact of reducing the amount of target data fordistillation in Table 7. Using 5000 or more examples (about 10% of the original size) is enough toachieve comparable performance.

Transferability of RR synthetic labels to standard model training

When using RR, we traina validation and test model with RR and global classiﬁer weights obtained using pseudo-gradient.In this experiment we study what happens if we create synthetic labels with RR, but do validationand testing with standard models trained from scratch without RR. For a fair comparison, we usethe same synthetic labels for training a new RR model and a new standard model. Validation forearly stopping is done with a standardly trained model. The results in Table 15 suggest RR labelsare largely transferable (even in cross-dataset scenarios), but there is some decrease in performance.Consequently, it is better to learn the synthetic labels using second-order approach if we want to traina standard model without RR during testing (comparing with the results in Table 1, 2 and 3).12

Results of analysis

Our tables report the mean test accuracy and standard deviation (%) across 20 models trained fromscratch using the base examples and synthetic labels. When analysing original DD, 200 randomlyinitialized models are used.Table 5: Repeatability. Label distillation is quite repeatable. Performance change from repeating thewhole distillation learning and subsequent re-training is small. We used 100 base examples for theseexperiments. Datasets: E = EMNIST, M = MNIST.Trial 1 Trial 2 Trial 3MNIST (LD) 87.27 ± ± ± ± ± ± → M (LD) 77.09 ± ± ± → M (LD RR) 82.70 ± ± ± Set 1 Set 2 Set 3 Set 4 Set 5MNIST (LD) 84.91 ± ± ± ± ± ± ± ± ± ± → M (LD) 79.34 ± ± ± ± ± → M (LD RR) 81.67 ± ± ± ± ± Table 7: Dependence on target set size. Around 5000 examples ( ≈ % of all data) is sufﬁcient.Similarly as before, we used 100 base examples. Using all examples means using 50000 examples. Target examples 100 500 1000 5000 10000 20000 AllE → M (LD) 50.70 ± ± ± ± ± ± ± → M (LD RR) 60.67 ± ± ± ± ± ± ± Table 8: Sensitivity to number of training steps at meta-testing. We re-train the model with differentnumbers of steps than estimated during meta-training. The results show our method is relativelyinsensitive to the number of steps. The default number of steps T i (+ 0 column) was estimated as 278for MNIST (LD), 217 for MNIST (LD RR), 364 steps for E → M (LD) and 311 steps for E → M(LD RR). Scenario with 100 base examples is reported.

Steps deviation - 50 - 20 - 10 + 0 + 10 + 20 + 50 + 100MNIST (LD) 86.67 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± → M (LD) 77.28 ± ± ± ± ± ± ± ± → M (LD RR) 81.84 ± ± ± ± ± ± ± ± Table 9: Sensitivity of original DD to number of steps. DD is very sensitive to using the speciﬁcnumber of steps. We take the ﬁrst N steps, keep their original learning rates, and assign learning ratesof 0 to the remaining steps. When we do 5 more steps than the original (30), we perform the ﬁnal 5steps with an average learning rate.

Steps 10 15 20 25 30 35MNIST 35.65 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Base examples 10 20 50 100 200 500CIFAR-10 LDAlexNet 26.09 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 13: Transferability of distilled labels across different architectures (RR method). The upper partof the table shows performance of various test models when trained on distilled labels synthetisedwith AlexNet only. The middle part shows the baseline performance of training models with differentarchitectures on true labels. The lower part shows that distilled labels work even in cross-datasetscenario (labels trained with AlexNet only). The results clearly suggest the distilled labels generalizeacross different architectures. Note that lower RR results for ResNet-18 may be caused by signiﬁcantlylower dimensionality of the RR layer (64 features + 1 for bias), while AlexNet and LeNet have 192features + 1 for bias in the RR layer.

Base examples 10 20 50 100 200 500CIFAR-10 LDAlexNet 26.78 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± → M 80.09 ± ± → K 58.14 ± ± → C 34.81 ± ± → K-49 17.56 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± econd-order RR [0.00, 0.00, 0.01, 0.99, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.00, 0.00, 0.01][0.00, 0.00, 0.99, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.02, 0.00, 0.00, 0.02, 0.00, 0.95][0.00, 0.93, 0.00, 0.00, 0.00, 0.00, 0.00, 0.07, 0.00, 0.00][0.99, 0.00, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.00, 0.00, 0.00, 0.01][0.00, 0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.97, 0.00][0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.00] [0.00, 0.01, 0.01, 0.76, 0.00, 0.00, 0.00, 0.20, 0.01, 0.00][0.05, 0.01, 0.02, 0.06, 0.06, 0.38, 0.04, 0.00, 0.15, 0.23][0.19, 0.00, 0.36, 0.28, 0.00, 0.00, 0.02, 0.00, 0.08, 0.08][0.00, 0.00, 0.00, 0.00, 0.36, 0.00, 0.00, 0.32, 0.06, 0.26][0.00, 0.73, 0.00, 0.05, 0.02, 0.00, 0.01, 0.07, 0.04, 0.09][0.90, 0.00, 0.01, 0.01, 0.00, 0.07, 0.00, 0.00, 0.01, 0.00][0.00, 0.00, 0.00, 0.03, 0.74, 0.04, 0.06, 0.00, 0.00, 0.12][0.00, 0.00, 0.13, 0.01, 0.01, 0.14, 0.17, 0.00, 0.54, 0.00][0.02, 0.01, 0.16, 0.00, 0.02, 0.00, 0.79, 0.00, 0.00, 0.00][0.19, 0.07, 0.00, 0.00, 0.00, 0.05, 0.06, 0.56, 0.00, 0.06][0.35, 0.00, 0.02, 0.02, 0.03, 0.01, 0.00, 0.18, 0.04, 0.34][0.00, 0.42, 0.03, 0.01, 0.14, 0.01, 0.31, 0.00, 0.06, 0.03][0.00, 0.00, 0.08, 0.01, 0.00, 0.01, 0.00, 0.79, 0.00, 0.11][0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, 0.01, 0.97][0.00, 0.01, 0.17, 0.29, 0.21, 0.01, 0.14, 0.16, 0.00, 0.01][0.00, 0.02, 0.00, 0.01, 0.00, 0.01, 0.00, 0.02, 0.86, 0.08][0.00, 0.00, 0.00, 0.49, 0.00, 0.28, 0.00, 0.24, 0.00, 0.00][0.45, 0.03, 0.05, 0.05, 0.05, 0.10, 0.03, 0.11, 0.06, 0.08][0.11, 0.00, 0.00, 0.03, 0.03, 0.73, 0.00, 0.08, 0.01, 0.01][0.00, 0.00, 0.57, 0.01, 0.00, 0.05, 0.05, 0.21, 0.10, 0.00] [0.26, 0.00, 0.01, 0.09, 0.04, 0.00, 0.02, 0.19, 0.12, 0.27][0.00, 0.23, 0.10, 0.00, 0.19, 0.02, 0.20, 0.01, 0.12, 0.13][0.01, 0.00, 0.07, 0.08, 0.00, 0.01, 0.29, 0.49, 0.00, 0.05][0.06, 0.13, 0.22, 0.00, 0.03, 0.00, 0.00, 0.04, 0.19, 0.32][0.00, 0.11, 0.16, 0.16, 0.11, 0.00, 0.21, 0.22, 0.00, 0.04][0.00, 0.36, 0.00, 0.32, 0.00, 0.00, 0.05, 0.00, 0.22, 0.04][0.05, 0.03, 0.01, 0.10, 0.03, 0.20, 0.00, 0.57, 0.01, 0.00][0.16, 0.06, 0.01, 0.07, 0.05, 0.33, 0.08, 0.10, 0.01, 0.14][0.17, 0.02, 0.11, 0.26, 0.10, 0.24, 0.08, 0.01, 0.00, 0.01][0.01, 0.00, 0.30, 0.03, 0.00, 0.19, 0.15, 0.23, 0.06, 0.03] Figure 6: Examples of distilled labels for both second-order and RR label distillation. The upperpart shows base examples and their distilled labels for within-dataset MNIST scenario with 10 baseexamples. The lower part shows EMNIST (“English”) base examples and their synthetic labels thatallow the model to learn to classify KMNIST (“Japanese”) characters – 10 base examples used.16 econd-order RR [0.03, 0.47, 0.04, 0.03, 0.03, 0.04, 0.03, 0.02, 0.11, 0.21][0.13, 0.21, 0.01, 0.01, 0.01, 0.00, 0.00, 0.03, 0.15, 0.46][0.00, 0.01, 0.31, 0.00, 0.29, 0.01, 0.34, 0.03, 0.00, 0.01][0.61, 0.02, 0.04, 0.00, 0.01, 0.00, 0.00, 0.00, 0.32, 0.00][0.03, 0.01, 0.25, 0.07, 0.24, 0.15, 0.10, 0.14, 0.00, 0.01][0.01, 0.05, 0.02, 0.27, 0.07, 0.35, 0.10, 0.08, 0.02, 0.03][0.28, 0.04, 0.04, 0.00, 0.02, 0.00, 0.00, 0.00, 0.53, 0.09][0.01, 0.07, 0.04, 0.22, 0.02, 0.16, 0.11, 0.14, 0.03, 0.19][0.05, 0.08, 0.11, 0.14, 0.15, 0.15, 0.19, 0.11, 0.00, 0.03][0.03, 0.01, 0.17, 0.08, 0.27, 0.10, 0.15, 0.18, 0.00, 0.01] [0.00, 0.48, 0.07, 0.01, 0.02, 0.11, 0.00, 0.01, 0.14, 0.17][0.22, 0.14, 0.05, 0.03, 0.01, 0.00, 0.00, 0.05, 0.14, 0.35][0.00, 0.04, 0.25, 0.05, 0.19, 0.03, 0.38, 0.06, 0.01, 0.00][0.42, 0.09, 0.10, 0.01, 0.01, 0.00, 0.00, 0.00, 0.35, 0.02][0.00, 0.02, 0.15, 0.08, 0.10, 0.18, 0.16, 0.27, 0.00, 0.03][0.03, 0.01, 0.08, 0.25, 0.12, 0.31, 0.02, 0.16, 0.02, 0.00][0.32, 0.00, 0.00, 0.00, 0.07, 0.00, 0.00, 0.06, 0.52, 0.03][0.01, 0.07, 0.04, 0.17, 0.02, 0.10, 0.13, 0.17, 0.04, 0.26][0.08, 0.03, 0.16, 0.11, 0.22, 0.10, 0.21, 0.00, 0.07, 0.01][0.00, 0.01, 0.14, 0.13, 0.27, 0.06, 0.24, 0.16, 0.00, 0.00][0.04, 0.37, 0.01, 0.04, 0.01, 0.05, 0.07, 0.03, 0.11, 0.28][0.01, 0.00, 0.24, 0.06, 0.32, 0.08, 0.13, 0.15, 0.00, 0.00][0.18, 0.12, 0.08, 0.08, 0.06, 0.07, 0.03, 0.06, 0.17, 0.14][0.00, 0.02, 0.13, 0.18, 0.13, 0.17, 0.18, 0.18, 0.00, 0.01][0.00, 0.01, 0.06, 0.09, 0.04, 0.20, 0.06, 0.52, 0.00, 0.01][0.53, 0.02, 0.03, 0.00, 0.01, 0.00, 0.00, 0.01, 0.35, 0.03][0.02, 0.06, 0.09, 0.22, 0.18, 0.19, 0.17, 0.06, 0.00, 0.01][0.00, 0.00, 0.35, 0.00, 0.29, 0.00, 0.09, 0.25, 0.00, 0.00][0.04, 0.02, 0.16, 0.17, 0.15, 0.19, 0.17, 0.09, 0.00, 0.01][0.01, 0.00, 0.23, 0.02, 0.31, 0.04, 0.33, 0.05, 0.00, 0.00] [0.03, 0.15, 0.04, 0.11, 0.04, 0.11, 0.11, 0.05, 0.16, 0.20][0.04, 0.03, 0.16, 0.10, 0.23, 0.07, 0.11, 0.22, 0.02, 0.02][0.20, 0.06, 0.12, 0.12, 0.06, 0.09, 0.05, 0.07, 0.13, 0.10][0.02, 0.04, 0.18, 0.14, 0.10, 0.13, 0.19, 0.18, 0.00, 0.02][0.06, 0.10, 0.12, 0.10, 0.09, 0.12, 0.05, 0.26, 0.00, 0.10][0.32, 0.09, 0.04, 0.00, 0.03, 0.00, 0.00, 0.00, 0.44, 0.09][0.05, 0.09, 0.04, 0.10, 0.21, 0.18, 0.17, 0.11, 0.05, 0.00][0.02, 0.03, 0.16, 0.02, 0.24, 0.00, 0.31, 0.07, 0.02, 0.11][0.00, 0.01, 0.21, 0.09, 0.20, 0.16, 0.10, 0.15, 0.04, 0.03][0.03, 0.02, 0.21, 0.10, 0.17, 0.07, 0.38, 0.02, 0.00, 0.00][0.03, 0.47, 0.04, 0.03, 0.03, 0.04, 0.03, 0.02, 0.11, 0.21][0.13, 0.21, 0.01, 0.01, 0.01, 0.00, 0.00, 0.03, 0.15, 0.46][0.00, 0.01, 0.31, 0.00, 0.29, 0.01, 0.34, 0.03, 0.00, 0.01][0.61, 0.02, 0.04, 0.00, 0.01, 0.00, 0.00, 0.00, 0.32, 0.00][0.03, 0.01, 0.25, 0.07, 0.24, 0.15, 0.10, 0.14, 0.00, 0.01][0.01, 0.05, 0.02, 0.27, 0.07, 0.35, 0.10, 0.08, 0.02, 0.03][0.28, 0.04, 0.04, 0.00, 0.02, 0.00, 0.00, 0.00, 0.53, 0.09][0.01, 0.07, 0.04, 0.22, 0.02, 0.16, 0.11, 0.14, 0.03, 0.19][0.05, 0.08, 0.11, 0.14, 0.15, 0.15, 0.19, 0.11, 0.00, 0.03][0.03, 0.01, 0.17, 0.08, 0.27, 0.10, 0.15, 0.18, 0.00, 0.01] [0.00, 0.48, 0.07, 0.01, 0.02, 0.11, 0.00, 0.01, 0.14, 0.17][0.22, 0.14, 0.05, 0.03, 0.01, 0.00, 0.00, 0.05, 0.14, 0.35][0.00, 0.04, 0.25, 0.05, 0.19, 0.03, 0.38, 0.06, 0.01, 0.00][0.42, 0.09, 0.10, 0.01, 0.01, 0.00, 0.00, 0.00, 0.35, 0.02][0.00, 0.02, 0.15, 0.08, 0.10, 0.18, 0.16, 0.27, 0.00, 0.03][0.03, 0.01, 0.08, 0.25, 0.12, 0.31, 0.02, 0.16, 0.02, 0.00][0.32, 0.00, 0.00, 0.00, 0.07, 0.00, 0.00, 0.06, 0.52, 0.03][0.01, 0.07, 0.04, 0.17, 0.02, 0.10, 0.13, 0.17, 0.04, 0.26][0.08, 0.03, 0.16, 0.11, 0.22, 0.10, 0.21, 0.00, 0.07, 0.01][0.00, 0.01, 0.14, 0.13, 0.27, 0.06, 0.24, 0.16, 0.00, 0.00][0.04, 0.37, 0.01, 0.04, 0.01, 0.05, 0.07, 0.03, 0.11, 0.28][0.01, 0.00, 0.24, 0.06, 0.32, 0.08, 0.13, 0.15, 0.00, 0.00][0.18, 0.12, 0.08, 0.08, 0.06, 0.07, 0.03, 0.06, 0.17, 0.14][0.00, 0.02, 0.13, 0.18, 0.13, 0.17, 0.18, 0.18, 0.00, 0.01][0.00, 0.01, 0.06, 0.09, 0.04, 0.20, 0.06, 0.52, 0.00, 0.01][0.53, 0.02, 0.03, 0.00, 0.01, 0.00, 0.00, 0.01, 0.35, 0.03][0.02, 0.06, 0.09, 0.22, 0.18, 0.19, 0.17, 0.06, 0.00, 0.01][0.00, 0.00, 0.35, 0.00, 0.29, 0.00, 0.09, 0.25, 0.00, 0.00][0.04, 0.02, 0.16, 0.17, 0.15, 0.19, 0.17, 0.09, 0.00, 0.01][0.01, 0.00, 0.23, 0.02, 0.31, 0.04, 0.33, 0.05, 0.00, 0.00] [0.03, 0.15, 0.04, 0.11, 0.04, 0.11, 0.11, 0.05, 0.16, 0.20][0.04, 0.03, 0.16, 0.10, 0.23, 0.07, 0.11, 0.22, 0.02, 0.02][0.20, 0.06, 0.12, 0.12, 0.06, 0.09, 0.05, 0.07, 0.13, 0.10][0.02, 0.04, 0.18, 0.14, 0.10, 0.13, 0.19, 0.18, 0.00, 0.02][0.06, 0.10, 0.12, 0.10, 0.09, 0.12, 0.05, 0.26, 0.00, 0.10][0.32, 0.09, 0.04, 0.00, 0.03, 0.00, 0.00, 0.00, 0.44, 0.09][0.05, 0.09, 0.04, 0.10, 0.21, 0.18, 0.17, 0.11, 0.05, 0.00][0.02, 0.03, 0.16, 0.02, 0.24, 0.00, 0.31, 0.07, 0.02, 0.11][0.00, 0.01, 0.21, 0.09, 0.20, 0.16, 0.10, 0.15, 0.04, 0.03][0.03, 0.02, 0.21, 0.10, 0.17, 0.07, 0.38, 0.02, 0.00, 0.00]