Flexible Dataset Distillation: Learn Labels Instead of Images
FFlexible Dataset Distillation:Learn Labels Instead of Images
Ondrej Bohdal, Yongxin Yang, Timothy Hospedales
School of InformaticsThe University of Edinburgh {ondrej.bohdal, yongxin.yang, t.hospedales}@ed.ac.uk
Abstract
We study the problem of dataset distillation – creating a small set of syntheticexamples capable of training a good model. In particular, we study the problem of label distillation – creating synthetic labels for a small set of real images, and showit to be more effective than the prior image -based approach to dataset distillation.Interestingly, label distillation can be applied across datasets, for example enablinglearning Japanese character recognition by training only on synthetically labeledEnglish letters. Methodologically, we introduce a more robust and flexible meta-learning algorithm for distillation, as well as an effective first-order strategy basedon convex optimization layers. Distilling labels with our new algorithm leads toimproved results over prior image-based distillation. More importantly, it leads toclear improvements in flexibility of the distilled dataset in terms of compatibilitywith off-the-shelf optimizers and diverse neural architectures.
Distillation is a topical area of neural network research that initially began with the goal of extractingthe knowledge of a large pre-trained model and compiling it into a smaller model, while retainingsimilar performance [12]. The notion of distillation has since found numerous applications and usesincluding the possibility of dataset distillation [31]: extracting the knowledge of a large datasetand compiling it into a small set of carefully crafted examples, such that a model trained on thesmall dataset alone achieves good performance. This is of scientific interest as a tool to study neuralnetwork generalisation under small sample conditions. More practically, it has the potential to addressthe large and growing logistical and energy hurdle of neural network training, if adequate neuralnetworks can be quickly trained on small distilled datasets rather than massive raw datasets.Neveretheless, progress towards the vision of dataset distillation has been limited as the performanceof existing methods [31, 27] trained from random initialization is far from that of full datasetsupervised learning. More fundamentally, existing approaches are relatively inflexible in terms of thedistilled data being over-fitted to the training conditions under which it was generated. While thereis some robustness to choice of initialization weights [31], the distilled dataset is largely specificto the architecture used to train it (thus preventing its use to accelerate NAS, for example), andmust use a highly-customized learner (a specific image visitation sequence, a specific sequence ofcarefully chosen meta-learned learning rates, and a specific number of learning steps). Altogetherthese constraints mean that existing distilled datasets are not general purpose enough to be useful inpractice, e.g. with off-the-shelf learning algorithms. We propose a more flexible approach to datasetdistillation underpinned by both algorithmic improvements and changes to the problem definition.Rather than creating synthetic images [31] for arbitrary labels, or a combination of synthetic imagesand soft-labels [27], we focus on crafting synthetic labels for arbitrarily chosen standard images.Compared to these prior approaches focused on synthetic images, label distillation benefits from
Preprint. Under review. a r X i v : . [ c s . L G ] J un ...... Figure 1: Label distillation enables training a model that can classify Japanese characters after beingtrained only on English letters and synthetic labels (actual examples shown).exploiting the data statistics of natural images and the lower-dimensionality of labels compared toimages as parameters for meta-learning. Practically, this leads to improved performance compared toprior image distillation approaches. This set-up also uniquely enables a new kind of cross-datasetknowledge distillation (Figure 1). One can learn solely on a source dataset (such as English characters)with synthetic distilled labels, and apply the learned model to recognise concepts in a disjoint targetdataset (such as Japanese characters). Surprisingly, it turns out that models can make good progresson learning to recognise Japanese only through exposure to English characters.Methodologically, we define a new meta-learning algorithm for distillation that does not requirecostly evaluation of multiple inner-loop (model-training) steps for each iteration of distillation. Moreimportantly our algorithm leads to a more flexible distilled dataset that is better transferrable acrossoptimisers, architectures, learning iterations, etc. Furthermore, where existing dataset distillationalgorithms rely on second-order gradients, we introduce an alternative learning strategy based onconvex optimization layers that avoids high-order gradients and provides better optimization, thusimproving the quality of the distilled dataset.In summary, we contribute: (1) A dataset distillation method that produces flexible distilled datasetsthat exhibit transferability across learning algorithms. This brings us one step closer to producinguseful general purpose distilled datasets. (2) Our distilled datasets can be used to train higherperformance models than those prior work. (3) We introduce the novel problem of concept for cross-dataset distillation, and demonstrate proofs of concept, such as English → Japanese letter recognition.
Dataset distillation
Most closely related to our work is Dataset [31] and Soft-Label DatasetDistillation [27]. They focus on the problem of distilling a dataset or model [20] into a small numberof example images, which are then used to train a new model. This can be seen as solving a meta-learning problem with respect to model’s training data [13]. The common approach is to initialise thedistilled dataset randomly, use the distilled data to train a model, and then backpropagate through themodel and its training updates to take gradient steps on the dataset. Since the ‘inner’ model trainingalgorithm is gradient-based, this leads to high-order gradients. To make this process tractable, theoriginal Dataset Distillation [31] uses only a few gradient steps in its inner loop (as per other famousmeta-learners [11]). To ensure that sufficient learning progress is made with few updates, it alsometa-learns a fixed set of optimal learning rates to use at each step. This balances tractability &efficacy, but causes the distilled dataset to be ‘locked in’ to the customized optimiser, rather thanserve as a general purpose dataset. In this work we define an online meta-learning procedure thatsimultaneously learns the dataset and the base model. This enables us to tractably take more gradientsteps, and ultimately to produce a performant yet flexible general purpose distilled dataset.There are various motivations for dataset distillation, but the most practically intriguing is to sum-marize a dataset in compressed form to accelerate model training. In this sense it is related to otherendeavours such as dataset pruning [2, 10, 16], core-set construction [29, 3, 26] and instance selection[22] that focuses on dataset summarization through a small number of examples. The key difference2s that summarization methods select a relatively large part of the data (e.g. at least 10%), whiledistillation extends down to using 10 images per category ( ≈ . of CIFAR-10 data) throughexample synthesis . We retain original data (like summarization methods), but synthesize labels (likedistillation). This leads to another novel application – synthesizing labels for a few fixed examplesso that a model trained on these examples can directly (without any fine-tuning) solve a differentproblem with a different label space (Figure 1).Wang et al. [31] have theoretically analysed a simple linear case and derived a lower bound on theamount of distilled data needed for a classifier to generalize equally well as a classifier trained on thewhole dataset. Their analysis has shown the lower bound on the number of distilled data is the sameas the dimensionality of the data. Meta-learning
Meta-learning algorithms can often be grouped [13] into offline approaches (e.g.[31, 11]) that solve an inner optimisation at every iteration of an outer optimization; and onlineapproaches that solve the base and meta-learning problem simultaneously (e.g. [4, 19]). Onlineapproaches are typically faster, but optimize meta-parameters for a single problem. Offline approachesare slower and typically limit the length of the inner optimisation for tractibility, but can often findmeta-parameters that solve a distribution of tasks (as different tasks are drawn in each outer-loopiteration). In dataset distillation, the notion of ‘distribution over tasks’ corresponds to finding adataset that can successfully train a network in many settings, such as different initial conditions [31].Our distillation algorithm is a novel hybrid of these two families. We efficiently solve the base andmeta-tasks simultaneously as per the online approaches, and are thus able to use more inner-loopsteps. However, we also learn to solve many ‘tasks’ by detecting meta-overfitting and sampling anew ‘task’ when this occurs. This leads to an excellent combination of efficacy and efficiency.Finally, most gradient-based meta-learning algorithms rely on costly and often unstable higher-ordergradients [13, 11, 31], or else make simple shortest-path first-order approximations [21]. We areinspired by recent approaches in few-shot learning [5, 18] that avoid this issue through use of convexoptimisation layers. We introduce the notion of a pseudo-gradient that enables this idea to scalebeyond the few-shot setting to general meta-learning problems such as dataset distillation.
We aim to meta-learn synthetic labels for a fixed set of base examples that can be used to train arandomly initialized model. This corresponds to an objective such as (Eq. 1): ˜ y ∗ = arg min ˜ y (cid:96) (( x t , y t ) , θ − β ∇ θ (cid:96) ((˜ x , ˜ y ) , θ )) , (1)where ˜ y are the distilled labels for base examples ˜ x , ( x t , y t ) are the target set examples and labelsand θ is the neural network model learned with rate β and loss (cid:96) . We learn soft labels, so thedimensionality of each label is the number of classes C to distill. One gradient step is shown above,but in general there may be multiple steps.One option to achieve objective in (Eq. 1) would be to follow [31] and simulate the whole training pro-cedure for multiple gradient steps ∇ θ within the inner loop. However, this requires back-propagatingthrough a long inner loop, and ultimately requires a fixed training schedule with optimized learningrates. We aim to produce a dataset that can be used in a standard training pipeline downstream (e.g.Adam optimizer with the default parameters).Our first modification to the standard pipeline is to perform gradient descent iteratively on the modeland the distilled labels, rather than performing many inner (model) steps for each outer (dataset)step. This increases efficiency significantly due to a shorter compute graph for backpropagation.Nevertheless, when there are very few training examples, the model converges quickly to an over-fitted local minima, likely within a few hundred iterations. To manage this, our innovation is todetect overfitting when it occurs, reset the model to a new random initialization and keep training.Specifically, we measure the moving average of target problem accuracy, and when it has not improvedfor set number of iterations, we reset the model. This periodic reset of the model after varying numberof iterations is helpful for learning labels that are useful for all stages of training and thus less sensitiveto the number of iterations used. To ensure scalability to any number of examples, we sample aminibatch of base examples and synthetic labels and use those to update the model, which also betteraligns with standard training practice. Once label distillation is done, we train a new model fromscratch using random initial weights, given the base examples and learned synthetic labels.3 lgorithm 1 Label distillation with ridge regression (RR) Input: ˜x : N unlabeled examples from the base dataset; ( x t , y t ) : labeled examples from thetarget dataset; β : step size; α : pseudo-gradient step size; n o , n i : outer (resp. inner) loop batchsize; C number of classes in the target dataset Output: distilled labels ˜y and a reasonable number of training steps T i ˜y ← /C //Uniformly initialize synthetic labels θ ∼ p ( θ ) //Randomly initialize feature extractor parameters w ← // Initialize global RR classifier weights while ˜ y not converged do ( ˜x (cid:48) , ˜y (cid:48) ) ∼ ( ˜x , ˜y ) //Sample a minibatch of n i base examples ( x (cid:48) t , y (cid:48) t ) ∼ ( x t , y t ) //Sample a minibatch of n o target examples w l = arg min w l L ( f w l ( f θ ( ˜x (cid:48) )) , ˜y (cid:48) ) // Solve RR problem w ← (1 − α ) w + αw l //Update global RR classifier weights ˜y ← ˜y − β ∇ ˜y L ( f w ( f θ ( x (cid:48) t )) , y (cid:48) t ) //Update synthetic labels with target loss θ ← θ − β ∇ θ L ( f w ( f θ ( ˜x (cid:48) )) , ˜y (cid:48) ) //Update feature extractor if L ( f w ( f θ ( x t )) , y t ) did not improve then θ ∼ p ( θ ) //Reset feature extractor w ← // Reset global RR classifier weights T i ← iterations since previous reset //Record time to overfit end if end while We propose and evaluate two label distillation algorithms: a second-order version that performs oneupdate step within the inner loop; and a first-order version that uses a closed form solution of ridgeregression to find optimal classifier weights for the base examples.
Second-order version
The training procedure includes both inner and outer loop. The innerloop consists of one update of the model θ (cid:48) = θ − α ∇ θ L ( f θ ( ˜x (cid:48) ) , ˜y (cid:48) ) , through which we thenbackpropagate to update the synthetic labels ˜y . We perform a standard update of the model afterupdating the synthetic labels, which is more of a technical detail about how it is implemented ratherthan needed in theory as it could be combined with the inner-loop update. The method is otherwisesimilar to our first-order ridge regression method, which is summarized in Algorithm 1, togetherwith our notation. Note that examples from the target dataset are used when updating the syntheticlabels ˜y , but for updating the model θ we only use synthetic labels and the base examples. After eachupdate of the labels, we normalize them so that they represent a valid probability distribution. Thismakes them interpretable and has led to improvements compared to unnormalized labels. First-order version with ridge regression
To avoid second-order gradients, we propose a first-order version that uses pseudo-gradient generated via a closed form solution to ridge regression. Ridgeregression layers have previously been used for few-shot learning [5] within minibatch. However, weextend it to learn global weights that persist across minibatches. Ridge regression can be defined andsolved as: W l = arg min W (cid:107) XW − Y (cid:107) + λ (cid:107) W (cid:107) = (cid:0) X T X + λI (cid:1) − X T Y, (2)where X are the image features, Y are the labels, W are the ridge regression weights and λ is theregularization parameter. Following Bertinetto et al. [5], we use Woodbury formula [24]: W l = X T (cid:0) XX T + λI (cid:1) − Y, (3)which allows us to use matrix XX T with dimensionality depending on the square of the number ofinputs (minibatch size) rather than the square of the input embedding size. The embedding size istypically significantly larger than the size of the minibatch, so it would make matrix inversion costly.While ridge regression (RR) is obviously oriented at regression problems, it has been shown [5] towork well for classification when regressing label vectors. Ridge regression with pseudo-gradients
We use a RR layer as the final output layer of our basenetwork. This solves for the optimal (local) weights w l that classify the features of the current4inibatch examples. We exploit this local minibatch solution by taking a pseudo-gradient step thatupdates global weights w as w ← (1 − α ) w + αw l , with α being the pseudo-gradient step size. Wecan understand this as a pseudo-gradient as it corresponds to the step w ← w − α ( w − w l ) . We canthen update the synthetic labels by back-propagating through local weights w l . Subsequent featureextractor updates on θ avoid second-order gradients. The full process is summarised in Algorithm 1. We perform two main types of experiments: (1) within-dataset distillation, when the base examplescome from the target dataset and (2) cross-dataset distillation, when the base examples come froma different but related dataset. The dataset should be related because if there is a large shift in thedomain (e.g. from characters to photos), then the feature extractor trained on the base exampleswould generalize poorly to the target dataset. We use MNIST, CIFAR-10 and CIFAR-100 for the taskof within-dataset distillation, while for cross-dataset distillation we use EMNIST (“English letters”),KMNIST, Kuzushiji-49 (both “Japanese letters”), MNIST (digits), CUB (birds) and CIFAR-10(general objects). Details of these datasets are in the supplementary material.
We use parameters N M , N W to say over how many iterations to calculatethe moving average and how many iterations to wait before reset since the best moving average value.We select N M = N W and use a value of 50 steps in most cases, while we use 100 for CIFAR-100and Kuzushiji-49, and 200 for other larger-scale experiments (more than 100 base examples). Early stopping for learning synthetic labels
We update the synthetic labels for a given numberof epochs and then select the best labels to use based on the validation performance. To estimate thevalidation performance, we train a new model from scratch using the current distilled labels and theassociated base examples and then evaluate the model on the validation part of the target set. Werandomly set aside about 10-15% (depending on the dataset) of the training data for validation.
Models
We use LeNet [17] for MNIST and similar experiments, and AlexNet [15] for CIFAR-10,CIFAR-100 and CUB. Both models are identical to the ones used in [31]. In a fully supervised settingthey achieve about 99% and 80% test accuracy on MNIST and CIFAR-10 respectively.
Selection of base examples
The base examples are selected randomly, using a shared random seedfor consistency across scenarios. Our baseline models use the same random seed as the distillationmodels, so they share base examples for fair comparison. For within-dataset label distillation, wecreate a balanced set of base examples, so each class has the same number of base examples. Forcross-dataset label distillation, we do not consider the original classes of base examples. The size ofthe label space and the labels are different in the source and the target problem.
Further details
Outer loop minibatch size is n o = 1024 examples, while the inner minibatch size n i depends on the number of base examples used. When using 100 or more base examples, we selecta minibatch of 50 examples, except for CIFAR-100 for which we use 100 examples. For 10, 20 and50 base examples our minibatch sizes are 10, 10 and 25. We optimize the synthetic labels and themodel itself using Adam optimizer with the default parameters ( β = 0 . ). Most of our models aretrained for 400 epochs, while larger-scale models (more than 100 base examples and CIFAR-100) aretrained for 800 epochs. Smaller-scale Kuzushiji-49 experiments are trained for 100 epochs, whilelarger-scale ones use 200 epochs. In the second-order version, we perform one inner-loop step update,using a learning rate of α = 0 . . We back-propagate through the inner-loop update when updatingthe synthetic labels (meta-knowledge), but not when subsequently updating the model θ with Adamoptimizer. In the first-order version, we use a pseudo-gradient step size of 0.01 and λ of 1.0. Wecalibrate the regression weights by scaling them with a value learned during training with the specificset of base examples and distilled labels. Our tables report the mean test accuracy and standarddeviation (%) across 20 models trained from scratch using the base examples and synthetic labels. We compare our label distillation (LD) to previous dataset distillation (DD) and soft-label datasetdistillation (SLDD) on MNIST and CIFAR-10. We also establish new baselines that take true labels5able 1: Within-dataset distillation recognition accuracy (%). Our label distillation (LD) outperformsprior Dataset Distillation [31] (DD) and SLDD [27], and scales to synthesizing more examples.
Base examples 10 20 50 100 200 500 M N I S T LD 60.89 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± C I F A R - LD 25.69 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± from the target dataset and otherwise are trained in the same way as label distillation models. RRbaselines use RR and pseudo-gradient for consistency with LD RR (overall network architectureremains the same as in the second-order approach). In addition, we include baselines that use labelsmoothing (LS) [28] with a smoothing parameter of 0.1 as suggested in [23]. The results in Table 1show that LD significantly outperforms previous work on MNIST. This is in part due to LD enablingthe use of more steps (LD estimates T i ≈ − steps vs fixed 3 epochs of 10 steps in LD andSLDD). Our improved baselines (with the number of steps chosen using validation set, between 50and 1000 steps) are also competitive, and outperform the prior baselines in [31] due to taking moresteps. However, since [31, 27] are difficult to scale to more steps, this is a reasonable comparison.The label smoothing baseline works well on MNIST for a large number of base examples, wherethe problem anyway approaches one of conventional supervised learning. However, this strategy hasnot shown to be effective enough for CIFAR-10, where synthetic labels are the best. Importantly, inthe most intensively distilled regime of 10 examples, LD clearly outperforms all competitors. Weprovide an analysis of the labels learned by our method in Section 4.5. For CIFAR-10 our results alsoimprove on the original DD result. In this experiment, our second-order algorithm performs similarlyto our RR pseudo-gradient strategy. Table 2: CIFAR-100 within-datasetdistillation. One example per-class.LD 11.46 ± ± ± ± ± ± For cross-dataset distillation, we considered four scenarios: from EMNIST letters to MNIST digits,from EMNIST letters (“English”) to Kuzushiji-MNIST or Kuzushiji-49 characters (“Japanese”)and from CUB bird species to CIFAR-10 general categories. The results in Table 3 show that weare able to distill labels on examples of a different source dataset and achieve surprisingly goodperformance on the target problem, given that no target data is used when training these models. Incontrast, directly applying a trained source-task model to the target without distillation unsurprisinglyleads to chance performance (about 2% test accuracy for Kuzushiji-49 and 10% for all other cases).These results show that we can indeed distill the knowledge of one dataset into base examples froma different but related dataset through crafting synthetic labels. Furthermore, our RR approachsurpasses the second-order method in most cases, confirming its value.6able 3: Cross-dataset distillation recognition accuracy (%). Datasets: E = EMNIST, M = MNIST, K= KMNIST, B = CUB, C = CIFAR-10, K-49 = Kuzushiji-49.
Base examples 10 20 50 100 200 500E → M (LD) 36.13 ± ± ± ± ± ± → M (LD RR) 56.08 ± ± ± ± ± ± → K (LD) 31.90 ± ± ± ± ± ± → K (LD RR) 34.35 ± ± ± ± ± ± → C (LD) 26.86 ± ± ± ± ± ± → C (LD RR) 26.50 ± ± ± ± ± ± → K-49 (LD) 7.37 ± ± ± ± ± ± → K-49 (LD RR) 10.48 ± ± ± ± ± ± M N I S T t e s t a cc u r a cy ( % ) LD DD20% fewer Original 20% more (a) Optimization steps M N I S T t e s t a cc u r a cy ( % ) LD DDOriginal General optimizerShuffled examples (b) Optimization parameters C I F A R - t e s t a cc u r a cy ( % ) LD DDAlexNet LeNet (c) Cross-architecture transfer
Figure 2: Datasets distilled with LD are more flexible than those distilled with DD.
We conduct experiments to verify the flexibility and general applicability of our label-distilled datasetcompared to image-distilled alternative by [31]. Namely, we looked at (1) How the number of stepsused during meta-testing affects the performance of learning with the distilled data, and in particularsensitivity to deviation from the number of steps used during meta-training of DD and LD. (2)Sensitivity of the models to changes in optimization parameters between meta-training + meta-testing.(3) How well the distilled datasets transfer to neural architectures different to those used for training.We used the DD implementation by the authors of [31] for fair evaluation. Figure 2 summarizes thekey points, while tables with details are in the supplementary material.
Sensitivity to the number of meta-testing steps (Figure 2a) A key feature of our method is thatit is relatively insensitive to the number of steps used to train a model on the distilled data. Our resultsfrom Table 8 show that even if we do 50 steps less or 100 steps more than the number estimated duringtraining ( ≈ ), the testing performance does not change significantly. This is in contrast withprevious DD and SLDD methods that must be trained for a specific number of steps with optimizedlearning rates, making them rather inflexible. If the number of steps changes even by as little as 20%,they incur a significant cost in accuracy. Table 9 provides a further analysis of sensitivity of DD. Sensitivity to optimization parameters (Figure 2b) DD [31] uses step-wise meta-optimizedlearning rates to maximize accuracy. Our analysis in Table 10 shows that if we use an average ofthe optimized learning rate rather than the specific value in each step (more general optimizer), theperformance decreases significantly. Further, original DD relies on training the distilled data in afixed sequence, and our analysis in Table 11 shows that changing the order of examples leads toa large decrease in accuracy. Our LD method by design does not depend on the specific order ofexamples, and it can be used with off-the-shelf learning optimizers such as Adam to train the model,rather than with optimizers with learning rates specific to the example and step.
Transferability of labels across architectures (Figure 2c) We study the impact of distilling thelabels using AlexNet and then training AlexNet, LeNet and ResNet-18 using these distilled labels. Weestablish new baselines for these additional architectures, using a number of steps chosen based on thevalidation set. Our results in Tables 12 and 13 suggest our labels are helpful in both within and acrossdataset distillation scenarios. This suggests that even if we relabel a dataset with one label space to7olve a different problem, the labels generalize to a new architecture and lead to improvements overthe baselines. We have done experiments with original DD to find how robust it is to the change ofarchitecture (Table 14). As expected, there is a large decrease in accuracy, but the results are betterthan the baselines they report [31], so their synthetic images are somewhat transferable when we trainthe model with the specific order of examples and optimized learning rates. However, our LD methodincurs a smaller decrease in accuracy, suggesting better transferability across architectures.
In Table 4 we compare training times of LD and the original DD approach (us-ing the same settings and hardware). Our LD involves meta-learning fewer parameters, so for afair comparison we have trained a DD version of label distillation that creates synthetic imagesrather than labels. The results show that our online approach significantly accelerates training.Our version of DD only takes a few more minutes to run than LD. However, our paper focuseson LD because our version of DD was relatively unstable and led to significantly worse perfor-mance than LD, perhaps because learning synthetic images is more complex than synthetic labels.Table 4: Comparison of training timesof DD [31] and our LD (mins).
MNIST CIFAR-10DD 116 205LD 61 86LD (RR) 65 98Our DD 67 96Our DD (RR) 72 90
Analysis of synthetic labels
We have analysed to whatextent the synthetic labels learn the true label and how thisdiffers between our second-order and RR methods (Figures8 and 9 in the supplementary material). The results for aMNIST experiment with 100 base examples show the second-order method recovers the true value to about 84% on average,so it could be viewed as non-uniform label smoothing withmeta-learned weights on labels specific to the examples. Forthe same scenario, RR recovers the true value to about 63%on average. Both methods achieved similar test accuracy,which suggests our RR method leads to distilled labels with more non-trivial information. Moreover,visually similar digits such as 4 and 9 receive more weight compared to other combinations. Suchmultiplexing of labels presumably underpins the improved performance compared to the baseline. ForCIFAR-10, the labels are significantly more complex, indicating non-trivial information is embeddedinto the synthetic labels. For cross-dataset LD (EMNIST to KMNIST), in many cases a letter hadsimilar labels across various base examples, but in a large number of cases the pattern was less clear.Examples are shown in the supplementary material.
Number of steps −24−22−20−18−16−14−12−10 l o g ( V a r ( g r a d i e n t )) Second-orderPseudo-gradient RR
Figure 3: RR-based LD reducessynthetic label meta-gradient vari-ance
Var [ ∇ ˜ y L ] . Pseudo-gradient analysis
To understand the superior perfor-mance of our RR method, we analysed the variances of meta-gradients obtained by our second-order and pseudo-gradient meth-ods. Results in Figure 3 show that our pseudo-gradient RRmethod obtains a significantly lower variance of meta-knowledgegradients, which underpins its stable and effective training.
Discussion
Our LD algorithm provides a more effective andmore flexible distillation approach than prior work. This bringsus one step closer to the vision of leveraging distilled data toaccelerate model training, or design – such as architecture search[9]. Currently we only explicitly randomize over network initial-izations during training. In future work we believe our strategyof multi-step training + reset at convergence could be used withother factors, such as randomly selected network architectures tofurther improve cross-network generalisation performance.
We have introduced a new label distillation algorithm for distilling the knowledge of a large datasetinto synthetic labels of a few base examples from the same or a different dataset. Our methodimproves on prior dataset distillation results, scales better to larger problems, and enables novelsettings such as cross-dataset distillation. Most importantly, it is significantly more flexible in termsof distilling general purpose datasets that can be used downstream with off-the-shelf optimizers.8 roader Impact
We propose a flexible and efficient distillation scheme that gets us closer to the goal of practicallyuseful dataset distillation. Dataset distillation could ultimately lead to a beneficial impact in termsof researcher’s time efficiency by enabling faster experimentation when training on small distilleddatasets. Perhaps more importantly, it could reduce the environmental impact of AI research anddevelopment by reduction of energy costs [25]. However, our results are still not strong enough yet.For this goal to be realised better distillation methods leading to more accurate downstream modelsneed to be developed.In future, label distillation could speculatively provide a useful tool for privacy preserving learning[1], e.g. in situations where organizations want to learn from user’s data. For example, a companycould provide a small set of public (source) data to a user, who performs cross-dataset distillationusing their private (target) data to train a model. The user could then return the distilled labels on thepublic data, which would allow the company to re-create the user’s model. In this way the knowledgefrom the user’s training could be obtained by the company in the form of distilled labels – withoutdirectly sending any private data, or trained model that could be a vector for memorization attacks[6].On the negative side, detecting and understanding the impact of bias in datasets is an importantyet already very challenging issue for machine learning. The impact of dataset distillation on anyunderlying biases in the data is completely unclear. If people were to train models on distilled datasetsin future it would be important to understand the impact of distillation on data biases.
Source Code
We provide a PyTorch implementation of our approach at https://github.com/ondrejbohdal/label-distillation . Acknowledgments and Disclosure of Funding
This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, fundedby the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and theUniversity of Edinburgh.
References [1] Al-Rubaie, M. and Chang, J. M. (2019). Privacy-preserving machine learning: threats andsolutions.
IEEE Security and Privacy , 17(2):49–58.[2] Angelova, A., Abu-Mostafa, Y., and Perona, P. (2005). Pruning training sets for learning ofobject categories. In
CVPR .[3] Bachem, O., Lucic, M., and Krause, A. (2017). Practical coreset constructions for machinelearning. In arXiv .[4] Balaji, Y., Sankaranarayanan, S., and Chellappa, R. (2018). MetaReg: towards domain general-ization using meta-regularization. In
NeurIPS .[5] Bertinetto, L., Henriques, J., Torr, P. H. S., and Vedaldi, A. (2019). Meta-learning with differen-tiable closed-form solvers. In
ICLR .[6] Carlini, N., Liu, C., Erlingsson, U., Kos, J., and Song, D. (2019). The secret sharer: evaluatingand testing unintended memorization in neural networks. In
USENIX Security Symposium .[7] Clanuwat, T., Kitamoto, A., Lamb, A., Yamamoto, K., and Ha, D. (2018). Deep learning forclassical Japanese literature. In
NeurIPS Workshop on Machine Learning for Creativity andDesign .[8] Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. (2017). EMNIST: an extension of MNISTto handwritten letters. In arXiv . 99] Elsken, T., Metzen, J. H., and Hutter, F. (2019). Neural architecture search: a survey.
Journal ofMachine Learning Research , 20:1–21.[10] Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. (2010). Object detectionwith discriminatively trained part-based models.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 32(9):1627–1645.[11] Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation ofdeep networks. In
ICML .[12] Hinton, G., Vinyals, O., and Dean, J. (2014). Distilling the knowledge in a neural network. In
NIPS .[13] Hospedales, T., Antoniou, A., Micaelli, P., and Storkey, A. (2020). Meta-learning in neuralnetworks: a survey. In arXiv .[14] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report.[15] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deepconvolutional neural networks. In
NIPS .[16] Lapedriza, A., Pirsiavash, H., Bylinskii, Z., and Torralba, A. (2013). Are all training examplesequally valuable? In arXiv .[17] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied todocument recognition.
Proceedings of the IEEE , 86(11):2278–2324.[18] Lee, K., Maji, S., Ravichandran, A., and Soatto, S. (2019). Meta-learning with differentiableconvex optimization. In
CVPR .[19] Li, Y., Yang, Y., Zhou, W., and Hospedales, T. M. (2019). Feature-critic networks for heteroge-neous domain generalization. In
ICML .[20] Micaelli, P. and Storkey, A. (2019). Zero-shot knowledge transfer via adversarial belief matching.In
NeurIPS .[21] Nichol, A., Achiam, J., and Schulman, J. (2018). On first-order meta-learning algorithms. In arXiv .[22] Olvera-López, J. A., Carrasco-Ochoa, J. A., Martínez-Trinidad, J. F., and Kittler, J. (2010). Areview of instance selection methods.
Artificial Intelligence Review , 34(2):133–143.[23] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., and Hinton, G. (2017). Regularizing neuralnetworks by penalizing confident output distributions. In arXiv .[24] Petersen, K. B. and Pedersen, M. S. (2012). The matrix cookbook. Technical report.[25] Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. (2019). Green AI. In arXiv .[26] Sener, O. and Savarese, S. (2018). Active learning for convolutional neural networks: A core-setapproach. In
ICLR .[27] Sucholutsky, I. and Schonlau, M. (2019). Soft-label dataset distillation and text dataset distilla-tion. In arXiv .[28] Szegedy, C., Vanhoucke, V., Ioffe, S., and Shlens, J. (2016). Rethinking the Inception architec-ture for computer vision. In
CVPR .[29] Tsang, I. W., Kwok, J. T., and Cheung, P.-M. (2005). Core vector machines: fast SVM trainingon very large data sets.
Journal of Machine Learning Research , 6:363–392.[30] Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. (2011). The Caltech-UCSDBirds-200-2011 dataset. Technical report.[31] Wang, T., Zhu, J.-Y., Torralba, A., and Efros, A. A. (2018). Dataset distillation. In arXiv .10
Datasets
We use MNIST [17], EMNIST [8], KMNIST and Kuzushiji-49 [7], CIFAR-10 and CIFAR-100 [14],and CUB [30] datasets. Example images are shown in Figure 4. MNIST includes images of 70000handwritten digits that belong into 10 classes. EMNIST dataset includes various characters, but wechoose EMNIST letters split that includes only letters. Lowercase and uppercase letters are combinedtogether into 26 balanced classes (145600 examples in total). KMNIST (Kuzushiji-MNIST) is adataset that includes images of 10 classes of cursive Japanese (
Kuzushiji ) characters and is of thesame size as MNIST. Kuzushiji-49 is a larger version of KMNIST with 270912 examples and 49classes. CIFAR-10 includes 60000 colour images of various general objects, for example airplanes,frogs or ships. As the name indicates, there are 10 classes. CIFAR-100 is like CIFAR-100, but has100 classes with 600 images for each of them. Every class belongs to one of 20 superclasses whichrepresent more general concepts. CUB includes colour images of 200 bird species. The numberof images is relatively small, only 11788. All datasets except Kuzushiji-49 are balanced or almostbalanced.
MNISTEMNISTKuzushijiCIFAR-10CUB
Figure 4: Example images from the different datasets that we use.
B Additional experimental details
Normalization
We normalize greyscale images using the standardly used normalization for MNIST(mean of 0.1307 and standard deviation of 0.3081). All our greyscale images are of size × .Colour images are normalized using CIFAR-10 normalization (means of about 0.4914, 0.4822,0.4465, and standard deviations of about 0.247, 0.243, 0.261 across channels). All colour images arereshaped to be of size × . Computational resources
Each experiment was done on a single GPU, in almost all cases NVIDIA2080 Ti. Shorter (400 epochs) experiments took about 1 or 2 hours to run, while longer (800 epochs)experiments took between 2 and 4 hours.In addition, Figure 5 illustrates the difference between a standard model used for second-order labeldistillation and a model that uses global ridge regression classifier weights (used for first-order RRlabel distillation). The two models are almost identical – only the final linear layer is different.
C Additional experiments
Stability and dependence on choice of base examples
To evaluate the consistency of our results,we repeat the entire pipeline and report the results in Table 5. In the previous experiments, we used11 eature extractor with CNN layersInputLinear layerReLUnon-linearityLinear layerReLUnon-linearityLinear layer Feature extractor with CNN layersInputLinear layerReLUnon-linearityLinear layerReLUnon-linearityGlobal RR classifier weights
Standard model Model with ridgeregression weights
Prediction Prediction
Figure 5: Comparison of a standard model used for second-order label distillation and a model thatuses global ridge regression classifier weights (used for first-order RR label distillation).one randomly chosen but fixed set of base examples per source task. We investigate the impact ofbase example choice by drawing further random base example sets. The results in Table 6 suggest thatthe impact of base example choice is slightly larger than that of the variability due to the distillationprocess, but still small overall. Note that the ± standard deviations in all cases quantify the impact ofretraining from different random initializations at meta-test, given a fixed base set and completeddistillation. It is likely that if the base examples were not selected randomly, the impact of usingspecific base examples would be larger. In fact, future work could investigate how to choose baseexamples so that learning synthetic labels for them improves the result further. Dependence on target dataset size
Our experiments use a relatively large target dataset (about50000 examples) for meta-learning. We study the impact of reducing the amount of target data fordistillation in Table 7. Using 5000 or more examples (about 10% of the original size) is enough toachieve comparable performance.
Transferability of RR synthetic labels to standard model training
When using RR, we traina validation and test model with RR and global classifier weights obtained using pseudo-gradient.In this experiment we study what happens if we create synthetic labels with RR, but do validationand testing with standard models trained from scratch without RR. For a fair comparison, we usethe same synthetic labels for training a new RR model and a new standard model. Validation forearly stopping is done with a standardly trained model. The results in Table 15 suggest RR labelsare largely transferable (even in cross-dataset scenarios), but there is some decrease in performance.Consequently, it is better to learn the synthetic labels using second-order approach if we want to traina standard model without RR during testing (comparing with the results in Table 1, 2 and 3).12
Results of analysis
Our tables report the mean test accuracy and standard deviation (%) across 20 models trained fromscratch using the base examples and synthetic labels. When analysing original DD, 200 randomlyinitialized models are used.Table 5: Repeatability. Label distillation is quite repeatable. Performance change from repeating thewhole distillation learning and subsequent re-training is small. We used 100 base examples for theseexperiments. Datasets: E = EMNIST, M = MNIST.Trial 1 Trial 2 Trial 3MNIST (LD) 87.27 ± ± ± ± ± ± → M (LD) 77.09 ± ± ± → M (LD RR) 82.70 ± ± ± Set 1 Set 2 Set 3 Set 4 Set 5MNIST (LD) 84.91 ± ± ± ± ± ± ± ± ± ± → M (LD) 79.34 ± ± ± ± ± → M (LD RR) 81.67 ± ± ± ± ± Table 7: Dependence on target set size. Around 5000 examples ( ≈ % of all data) is sufficient.Similarly as before, we used 100 base examples. Using all examples means using 50000 examples. Target examples 100 500 1000 5000 10000 20000 AllE → M (LD) 50.70 ± ± ± ± ± ± ± → M (LD RR) 60.67 ± ± ± ± ± ± ± Table 8: Sensitivity to number of training steps at meta-testing. We re-train the model with differentnumbers of steps than estimated during meta-training. The results show our method is relativelyinsensitive to the number of steps. The default number of steps T i (+ 0 column) was estimated as 278for MNIST (LD), 217 for MNIST (LD RR), 364 steps for E → M (LD) and 311 steps for E → M(LD RR). Scenario with 100 base examples is reported.
Steps deviation - 50 - 20 - 10 + 0 + 10 + 20 + 50 + 100MNIST (LD) 86.67 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± → M (LD) 77.28 ± ± ± ± ± ± ± ± → M (LD RR) 81.84 ± ± ± ± ± ± ± ± Table 9: Sensitivity of original DD to number of steps. DD is very sensitive to using the specificnumber of steps. We take the first N steps, keep their original learning rates, and assign learning ratesof 0 to the remaining steps. When we do 5 more steps than the original (30), we perform the final 5steps with an average learning rate.
Steps 10 15 20 25 30 35MNIST 35.65 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Base examples 10 20 50 100 200 500CIFAR-10 LDAlexNet 26.09 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 13: Transferability of distilled labels across different architectures (RR method). The upper partof the table shows performance of various test models when trained on distilled labels synthetisedwith AlexNet only. The middle part shows the baseline performance of training models with differentarchitectures on true labels. The lower part shows that distilled labels work even in cross-datasetscenario (labels trained with AlexNet only). The results clearly suggest the distilled labels generalizeacross different architectures. Note that lower RR results for ResNet-18 may be caused by significantlylower dimensionality of the RR layer (64 features + 1 for bias), while AlexNet and LeNet have 192features + 1 for bias in the RR layer.
Base examples 10 20 50 100 200 500CIFAR-10 LDAlexNet 26.78 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± → M 80.09 ± ± → K 58.14 ± ± → C 34.81 ± ± → K-49 17.56 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± econd-order RR [0.00, 0.00, 0.01, 0.99, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.00, 0.00, 0.01][0.00, 0.00, 0.99, 0.01, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.02, 0.00, 0.00, 0.02, 0.00, 0.95][0.00, 0.93, 0.00, 0.00, 0.00, 0.00, 0.00, 0.07, 0.00, 0.00][0.99, 0.00, 0.00, 0.00, 0.00, 0.01, 0.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.00, 0.00, 0.00, 0.01][0.00, 0.00, 0.02, 0.00, 0.00, 0.00, 0.00, 0.00, 0.97, 0.00][0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00][0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.99, 0.00, 0.00] [0.00, 0.01, 0.01, 0.76, 0.00, 0.00, 0.00, 0.20, 0.01, 0.00][0.05, 0.01, 0.02, 0.06, 0.06, 0.38, 0.04, 0.00, 0.15, 0.23][0.19, 0.00, 0.36, 0.28, 0.00, 0.00, 0.02, 0.00, 0.08, 0.08][0.00, 0.00, 0.00, 0.00, 0.36, 0.00, 0.00, 0.32, 0.06, 0.26][0.00, 0.73, 0.00, 0.05, 0.02, 0.00, 0.01, 0.07, 0.04, 0.09][0.90, 0.00, 0.01, 0.01, 0.00, 0.07, 0.00, 0.00, 0.01, 0.00][0.00, 0.00, 0.00, 0.03, 0.74, 0.04, 0.06, 0.00, 0.00, 0.12][0.00, 0.00, 0.13, 0.01, 0.01, 0.14, 0.17, 0.00, 0.54, 0.00][0.02, 0.01, 0.16, 0.00, 0.02, 0.00, 0.79, 0.00, 0.00, 0.00][0.19, 0.07, 0.00, 0.00, 0.00, 0.05, 0.06, 0.56, 0.00, 0.06][0.35, 0.00, 0.02, 0.02, 0.03, 0.01, 0.00, 0.18, 0.04, 0.34][0.00, 0.42, 0.03, 0.01, 0.14, 0.01, 0.31, 0.00, 0.06, 0.03][0.00, 0.00, 0.08, 0.01, 0.00, 0.01, 0.00, 0.79, 0.00, 0.11][0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.02, 0.01, 0.97][0.00, 0.01, 0.17, 0.29, 0.21, 0.01, 0.14, 0.16, 0.00, 0.01][0.00, 0.02, 0.00, 0.01, 0.00, 0.01, 0.00, 0.02, 0.86, 0.08][0.00, 0.00, 0.00, 0.49, 0.00, 0.28, 0.00, 0.24, 0.00, 0.00][0.45, 0.03, 0.05, 0.05, 0.05, 0.10, 0.03, 0.11, 0.06, 0.08][0.11, 0.00, 0.00, 0.03, 0.03, 0.73, 0.00, 0.08, 0.01, 0.01][0.00, 0.00, 0.57, 0.01, 0.00, 0.05, 0.05, 0.21, 0.10, 0.00] [0.26, 0.00, 0.01, 0.09, 0.04, 0.00, 0.02, 0.19, 0.12, 0.27][0.00, 0.23, 0.10, 0.00, 0.19, 0.02, 0.20, 0.01, 0.12, 0.13][0.01, 0.00, 0.07, 0.08, 0.00, 0.01, 0.29, 0.49, 0.00, 0.05][0.06, 0.13, 0.22, 0.00, 0.03, 0.00, 0.00, 0.04, 0.19, 0.32][0.00, 0.11, 0.16, 0.16, 0.11, 0.00, 0.21, 0.22, 0.00, 0.04][0.00, 0.36, 0.00, 0.32, 0.00, 0.00, 0.05, 0.00, 0.22, 0.04][0.05, 0.03, 0.01, 0.10, 0.03, 0.20, 0.00, 0.57, 0.01, 0.00][0.16, 0.06, 0.01, 0.07, 0.05, 0.33, 0.08, 0.10, 0.01, 0.14][0.17, 0.02, 0.11, 0.26, 0.10, 0.24, 0.08, 0.01, 0.00, 0.01][0.01, 0.00, 0.30, 0.03, 0.00, 0.19, 0.15, 0.23, 0.06, 0.03] Figure 6: Examples of distilled labels for both second-order and RR label distillation. The upperpart shows base examples and their distilled labels for within-dataset MNIST scenario with 10 baseexamples. The lower part shows EMNIST (“English”) base examples and their synthetic labels thatallow the model to learn to classify KMNIST (“Japanese”) characters – 10 base examples used.16 econd-order RR [0.03, 0.47, 0.04, 0.03, 0.03, 0.04, 0.03, 0.02, 0.11, 0.21][0.13, 0.21, 0.01, 0.01, 0.01, 0.00, 0.00, 0.03, 0.15, 0.46][0.00, 0.01, 0.31, 0.00, 0.29, 0.01, 0.34, 0.03, 0.00, 0.01][0.61, 0.02, 0.04, 0.00, 0.01, 0.00, 0.00, 0.00, 0.32, 0.00][0.03, 0.01, 0.25, 0.07, 0.24, 0.15, 0.10, 0.14, 0.00, 0.01][0.01, 0.05, 0.02, 0.27, 0.07, 0.35, 0.10, 0.08, 0.02, 0.03][0.28, 0.04, 0.04, 0.00, 0.02, 0.00, 0.00, 0.00, 0.53, 0.09][0.01, 0.07, 0.04, 0.22, 0.02, 0.16, 0.11, 0.14, 0.03, 0.19][0.05, 0.08, 0.11, 0.14, 0.15, 0.15, 0.19, 0.11, 0.00, 0.03][0.03, 0.01, 0.17, 0.08, 0.27, 0.10, 0.15, 0.18, 0.00, 0.01] [0.00, 0.48, 0.07, 0.01, 0.02, 0.11, 0.00, 0.01, 0.14, 0.17][0.22, 0.14, 0.05, 0.03, 0.01, 0.00, 0.00, 0.05, 0.14, 0.35][0.00, 0.04, 0.25, 0.05, 0.19, 0.03, 0.38, 0.06, 0.01, 0.00][0.42, 0.09, 0.10, 0.01, 0.01, 0.00, 0.00, 0.00, 0.35, 0.02][0.00, 0.02, 0.15, 0.08, 0.10, 0.18, 0.16, 0.27, 0.00, 0.03][0.03, 0.01, 0.08, 0.25, 0.12, 0.31, 0.02, 0.16, 0.02, 0.00][0.32, 0.00, 0.00, 0.00, 0.07, 0.00, 0.00, 0.06, 0.52, 0.03][0.01, 0.07, 0.04, 0.17, 0.02, 0.10, 0.13, 0.17, 0.04, 0.26][0.08, 0.03, 0.16, 0.11, 0.22, 0.10, 0.21, 0.00, 0.07, 0.01][0.00, 0.01, 0.14, 0.13, 0.27, 0.06, 0.24, 0.16, 0.00, 0.00][0.04, 0.37, 0.01, 0.04, 0.01, 0.05, 0.07, 0.03, 0.11, 0.28][0.01, 0.00, 0.24, 0.06, 0.32, 0.08, 0.13, 0.15, 0.00, 0.00][0.18, 0.12, 0.08, 0.08, 0.06, 0.07, 0.03, 0.06, 0.17, 0.14][0.00, 0.02, 0.13, 0.18, 0.13, 0.17, 0.18, 0.18, 0.00, 0.01][0.00, 0.01, 0.06, 0.09, 0.04, 0.20, 0.06, 0.52, 0.00, 0.01][0.53, 0.02, 0.03, 0.00, 0.01, 0.00, 0.00, 0.01, 0.35, 0.03][0.02, 0.06, 0.09, 0.22, 0.18, 0.19, 0.17, 0.06, 0.00, 0.01][0.00, 0.00, 0.35, 0.00, 0.29, 0.00, 0.09, 0.25, 0.00, 0.00][0.04, 0.02, 0.16, 0.17, 0.15, 0.19, 0.17, 0.09, 0.00, 0.01][0.01, 0.00, 0.23, 0.02, 0.31, 0.04, 0.33, 0.05, 0.00, 0.00] [0.03, 0.15, 0.04, 0.11, 0.04, 0.11, 0.11, 0.05, 0.16, 0.20][0.04, 0.03, 0.16, 0.10, 0.23, 0.07, 0.11, 0.22, 0.02, 0.02][0.20, 0.06, 0.12, 0.12, 0.06, 0.09, 0.05, 0.07, 0.13, 0.10][0.02, 0.04, 0.18, 0.14, 0.10, 0.13, 0.19, 0.18, 0.00, 0.02][0.06, 0.10, 0.12, 0.10, 0.09, 0.12, 0.05, 0.26, 0.00, 0.10][0.32, 0.09, 0.04, 0.00, 0.03, 0.00, 0.00, 0.00, 0.44, 0.09][0.05, 0.09, 0.04, 0.10, 0.21, 0.18, 0.17, 0.11, 0.05, 0.00][0.02, 0.03, 0.16, 0.02, 0.24, 0.00, 0.31, 0.07, 0.02, 0.11][0.00, 0.01, 0.21, 0.09, 0.20, 0.16, 0.10, 0.15, 0.04, 0.03][0.03, 0.02, 0.21, 0.10, 0.17, 0.07, 0.38, 0.02, 0.00, 0.00][0.03, 0.47, 0.04, 0.03, 0.03, 0.04, 0.03, 0.02, 0.11, 0.21][0.13, 0.21, 0.01, 0.01, 0.01, 0.00, 0.00, 0.03, 0.15, 0.46][0.00, 0.01, 0.31, 0.00, 0.29, 0.01, 0.34, 0.03, 0.00, 0.01][0.61, 0.02, 0.04, 0.00, 0.01, 0.00, 0.00, 0.00, 0.32, 0.00][0.03, 0.01, 0.25, 0.07, 0.24, 0.15, 0.10, 0.14, 0.00, 0.01][0.01, 0.05, 0.02, 0.27, 0.07, 0.35, 0.10, 0.08, 0.02, 0.03][0.28, 0.04, 0.04, 0.00, 0.02, 0.00, 0.00, 0.00, 0.53, 0.09][0.01, 0.07, 0.04, 0.22, 0.02, 0.16, 0.11, 0.14, 0.03, 0.19][0.05, 0.08, 0.11, 0.14, 0.15, 0.15, 0.19, 0.11, 0.00, 0.03][0.03, 0.01, 0.17, 0.08, 0.27, 0.10, 0.15, 0.18, 0.00, 0.01] [0.00, 0.48, 0.07, 0.01, 0.02, 0.11, 0.00, 0.01, 0.14, 0.17][0.22, 0.14, 0.05, 0.03, 0.01, 0.00, 0.00, 0.05, 0.14, 0.35][0.00, 0.04, 0.25, 0.05, 0.19, 0.03, 0.38, 0.06, 0.01, 0.00][0.42, 0.09, 0.10, 0.01, 0.01, 0.00, 0.00, 0.00, 0.35, 0.02][0.00, 0.02, 0.15, 0.08, 0.10, 0.18, 0.16, 0.27, 0.00, 0.03][0.03, 0.01, 0.08, 0.25, 0.12, 0.31, 0.02, 0.16, 0.02, 0.00][0.32, 0.00, 0.00, 0.00, 0.07, 0.00, 0.00, 0.06, 0.52, 0.03][0.01, 0.07, 0.04, 0.17, 0.02, 0.10, 0.13, 0.17, 0.04, 0.26][0.08, 0.03, 0.16, 0.11, 0.22, 0.10, 0.21, 0.00, 0.07, 0.01][0.00, 0.01, 0.14, 0.13, 0.27, 0.06, 0.24, 0.16, 0.00, 0.00][0.04, 0.37, 0.01, 0.04, 0.01, 0.05, 0.07, 0.03, 0.11, 0.28][0.01, 0.00, 0.24, 0.06, 0.32, 0.08, 0.13, 0.15, 0.00, 0.00][0.18, 0.12, 0.08, 0.08, 0.06, 0.07, 0.03, 0.06, 0.17, 0.14][0.00, 0.02, 0.13, 0.18, 0.13, 0.17, 0.18, 0.18, 0.00, 0.01][0.00, 0.01, 0.06, 0.09, 0.04, 0.20, 0.06, 0.52, 0.00, 0.01][0.53, 0.02, 0.03, 0.00, 0.01, 0.00, 0.00, 0.01, 0.35, 0.03][0.02, 0.06, 0.09, 0.22, 0.18, 0.19, 0.17, 0.06, 0.00, 0.01][0.00, 0.00, 0.35, 0.00, 0.29, 0.00, 0.09, 0.25, 0.00, 0.00][0.04, 0.02, 0.16, 0.17, 0.15, 0.19, 0.17, 0.09, 0.00, 0.01][0.01, 0.00, 0.23, 0.02, 0.31, 0.04, 0.33, 0.05, 0.00, 0.00] [0.03, 0.15, 0.04, 0.11, 0.04, 0.11, 0.11, 0.05, 0.16, 0.20][0.04, 0.03, 0.16, 0.10, 0.23, 0.07, 0.11, 0.22, 0.02, 0.02][0.20, 0.06, 0.12, 0.12, 0.06, 0.09, 0.05, 0.07, 0.13, 0.10][0.02, 0.04, 0.18, 0.14, 0.10, 0.13, 0.19, 0.18, 0.00, 0.02][0.06, 0.10, 0.12, 0.10, 0.09, 0.12, 0.05, 0.26, 0.00, 0.10][0.32, 0.09, 0.04, 0.00, 0.03, 0.00, 0.00, 0.00, 0.44, 0.09][0.05, 0.09, 0.04, 0.10, 0.21, 0.18, 0.17, 0.11, 0.05, 0.00][0.02, 0.03, 0.16, 0.02, 0.24, 0.00, 0.31, 0.07, 0.02, 0.11][0.00, 0.01, 0.21, 0.09, 0.20, 0.16, 0.10, 0.15, 0.04, 0.03][0.03, 0.02, 0.21, 0.10, 0.17, 0.07, 0.38, 0.02, 0.00, 0.00]