[PDF] The Early Phase of Neural Network Training

Abstract

Recent studies have shown that many important aspects of neural network learning take place within the very earliest iterations or epochs of training. For example, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descent moves into a small subspace (Gur-Ari et al., 2018), and the network undergoes a critical period (Achille et al., 2019). Here, we examine the changes that deep neural networks undergo during this early phase of training. We perform extensive measurements of the network state during these early iterations of training and leverage the framework of Frankle et al. (2019) to quantitatively probe the weight distribution and its reliance on various aspects of the dataset. We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations. Despite this behavior, pre-training with blurred inputs or an auxiliary self-supervised task can approximate the changes in supervised networks, suggesting that these changes are not inherently label-dependent, though labels significantly accelerate this process. Together, these results help to elucidate the network changes occurring during this pivotal initial period of learning.

Full PDF

PPublished as a conference paper at ICLR 2020 T HE E ARLY P HASE OF N EURAL N ETWORK T RAINING

Jonathan Frankle † MIT CSAIL

David J. Schwab

CUNY ITSFacebook AI Research

Ari S. Morcos

Facebook AI Research A BSTRACT

Recent studies have shown that many important aspects of neural network learn-ing take place within the very earliest iterations or epochs of training. For exam-ple, sparse, trainable sub-networks emerge (Frankle et al., 2019), gradient descentmoves into a small subspace (Gur-Ari et al., 2018), and the network undergoesa critical period (Achille et al., 2019). Here we examine the changes that deepneural networks undergo during this early phase of training. We perform exten-sive measurements of the network state during these early iterations of trainingand leverage the framework of Frankle et al. (2019) to quantitatively probe theweight distribution and its reliance on various aspects of the dataset. We ﬁndthat, within this framework, deep networks are not robust to reinitializing withrandom weights while maintaining signs, and that weight distributions are highlynon-independent even after only a few hundred iterations. Despite this behavior,pre-training with blurred inputs or an auxiliary self-supervised task can approx-imate the changes in supervised networks, suggesting that these changes are notinherently label-dependent, though labels signiﬁcantly accelerate this process. To-gether, these results help to elucidate the network changes occurring during thispivotal initial period of learning.

NTRODUCTION

Over the past decade, methods for successfully training big, deep neural networks have revolution-ized machine learning. Yet surprisingly, the underlying reasons for the success of these approachesremain poorly understood, despite remarkable empirical performance (Santurkar et al., 2018; Zhanget al., 2017). A large body of work has focused on understanding what happens during the laterstages of training (Neyshabur et al., 2019; Yaida, 2019; Chaudhuri & Soatto, 2017; Wei & Schwab,2019), while the initial phase has been less explored. However, a number of distinct observationsindicate that signiﬁcant and consequential changes are occurring during the most early stage of train-ing. These include the presence of critical periods during training (Achille et al., 2019), the dramaticreshaping of the local loss landscape (Sagun et al., 2017; Gur-Ari et al., 2018), and the necessity ofrewinding in the context of the lottery ticket hypothesis (Frankle et al., 2019). Here we perform athorough investigation of the state of the network in this early stage.To provide a uniﬁed framework for understanding the changes the network undergoes during theearly phase, we employ the methodology of iterative magnitude pruning with rewinding (IMP), asdetailed below, throughout the bulk of this work (Frankle & Carbin, 2019; Frankle et al., 2019). Theinitial lottery ticket hypothesis, which was validated on comparatively small networks, proposed thatsmall, sparse sub-networks found via pruning of converged larger models could be trained to highperformance provided they were initialized with the same values used in the training of the unprunedmodel (Frankle & Carbin, 2019). However, follow-up work found that rewinding the weights totheir values at some iteration early in the training of the unpruned model, rather than to their initialvalues, was necessary to achieve good performance on deeper networks such as ResNets (Frankleet al., 2019). This observation suggests that the changes in the network during this initial phaseare vital for the success of the training of small, sparse sub-networks. As a result, this paradigmprovides a simple and quantitative scheme for measuring the importance of the weights at variouspoints early in training within an actionable and causal framework. † Work done while an intern at Facebook AI Research. a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2020We make the following contributions, all evaluated across three different network architectures:1. We provide an in-depth overview of various statistics summarizing learning over the earlypart of training.2. We evaluate the impact of perturbing the state of the network in various ways during theearly phase of training, ﬁnding that:(i) counter to observations in smaller networks (Zhou et al., 2019), deeper networks arenot robust to reinitializion with random weights, but maintained signs(ii) the distribution of weights after the early phase of training is already highly non-i.i.d.,as permuting them dramatically harms performance, even when signs are maintained(iii) both of the above perturbations can roughly be approximated by simply adding noiseto the network weights, though this effect is stronger for (ii) than (i)3. We measure the data-dependence of the early phase of training, ﬁnding that pre-trainingusing only p ( x ) can approximate the changes that occur in the early phase of training,though pre-training must last for far longer ( ∼ × longer) and not be fed misleading labels. NOWN P HENOMENA IN THE E ARLY P HASE OF T RAINING E v a l A cc u r ac y ( % ) Resnet-20 (CIFAR-10)Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000

Figure 1: Accuracy of IMP when rewinding tovarious iterations of the early phase for ResNet-20 sub-networks as a function of sparsity level.

Lottery ticket rewinding:

The original lot-tery ticket paper (Frankle & Carbin, 2019) re-wound weights to initialization, i.e., k = 0 ,during IMP. Follow up work on larger modelsdemonstrated that it is necessary to rewind to alater point during training for IMP to succeed,i.e., k << T , where T is total training itera-tions (Frankle et al., 2019). Notably, the beneﬁtof rewinding to a later point in training satu-rates quickly, roughly between and iterations for ResNet-20 on CIFAR-10 (Figure1). This timescale is strikingly similar to thechanges in the Hessian described below. Hessian eigenspectrum:

The shape of the losslandscape around the network state also appearsto change rapidly during the early phase of training (Sagun et al., 2017; Gur-Ari et al., 2018). Atinitialization, the Hessian of the loss contains a number of large positive and negative eigenvalues.However, very rapidly the curvature is reshaped in a few marked ways: a few large eigenvaluesemerge, the bulk eigenvalues are close to zero, and the negative eigenvalues become very small.Moreover, once the Hessian spectrum has reshaped, gradient descent appears to occur largely withinthe top subspace of the Hessian (Gur-Ari et al., 2018). These results have been largely conﬁrmed inlarge scale studies (Ghorbani et al., 2019), but note they depend to some extent on architecture and(absence of) batch normalization (Ioffe & Szegedy, 2015). A notable exception to this consistencyis the presence of substantial L energy of negative eigenvalues for models trained on ImageNet. Critical periods in deep learning:

Achille et al. (2019) found that perturbing the training processby providing corrupted data early on in training can result in irrevocable damage to the ﬁnal perfor-mance of the network. Note that the timescales over which the authors ﬁnd a critical period extendwell beyond those we study here. However, architecture, learning rate schedule, and regularizationall modify the timing of the critical period, and follow-up work found that critical periods were alsopresent for regularization, in particular weight decay and data augmentation (Golatkar et al., 2019).

RELIMINARIES AND M ETHODOLOGY

Networks:

Throughout this paper, we study ﬁve standard convolutional neural networks forCIFAR-10. These include the ResNet-20 and ResNet-56 architectures designed for CIFAR-10 (Heet al., 2015), the ResNet-18 architecture designed for ImageNet but commonly used on CIFAR-10(He et al., 2015), the WRN-16-8 wide residual network (Zagoruyko & Komodakis, 2016), and the2ublished as a conference paper at ICLR 2020

Training iterations

Gradient magnitudes are very largeRapid motion in weight spaceLarge sign changes (3% in first 10 it) Gradients converge to roughly constant magnitudeWeight magnitudes increase linearlyPerformance increase decelerates, but is still rapidRewinding begins to become highly effective Gradient magnitudes remain roughly constantWeight magnitude increase slowsPerformance slowly asymptotesBenefit of rewinding saturatesGradient magnitudes reach a minimum before stabilizing 10% higher Motion in weight space slows, but is still rapidPerformance increases rapidly, reaching over 50% eval accuracy

Figure 2: Rough timeline of the early phase of training for ResNet-20 on CIFAR-10.VGG-13 network (Simonyan & Zisserman (2015) as adapted by Liu et al. (2019)). Throughout themain body of the paper, we show ResNet-20; in Appendix B, we present the same experiments forthe other networks. Unless otherwise stated, results were qualitatively similar across all three net-works. All experiments in this paper display the mean and standard deviation across ﬁve replicateswith different random seeds. See Appendix A for further model details.

Iterative magnitude pruning with rewinding:

In order to test the effect of various hypothesesabout the state of sparse networks early in training, we use the

Iterative Magnitude Pruning withrewinding (IMP) procedure of Frankle et al. (2019) to extract sub-networks from various points intraining that could have learned on their own. The procedure involves training a network to com-pletion, pruning the 20% of weights with the lowest magnitudes globally throughout the network,and rewinding the remaining weights to their values from an earlier iteration k during the initial,pre-pruning training run. This process is iterated to produce networks with high sparsity levels. Asdemonstrated in Frankle et al. (2019), IMP with rewinding leads to sparse sub-networks which cantrain to high performance even at high sparsity levels > .Figure 1 shows the results of the IMP with rewinding procedure, showing the accuracy of ResNet-20 at increasing sparsity when performing this procedure for several rewinding values of k . For k ≥ , sub-networks can match the performance of the original network with 16.8% of weightsremaining. For k > , essentially no further improvement is observed (not shown). HE S TATE OF THE N ETWORK E ARLY IN T RAINING

Many of the aforementioned papers refer to various points in the “early” part of training. In thissection, we descriptively chart the state of ResNet-20 during the earliest phase of training to providecontext for this related work and our subsequent experiments. We speciﬁcally focus on the ﬁrst4,000 iterations (10 epochs). See Figure A3 for the characterization of additional networks. Weinclude a summary of these results for ResNet-20 as a timeline in Figure 2, and include a broadertimeline including results from several previous papers for ResNet-18 in Figure A1.As shown in Figure 3, during the earliest ten iterations, the network undergoes substantial change. Itexperiences large gradients that correspond to a rapid increase in distance from the initialization anda large number of sign changes of the weights. After these initial iterations, gradient magnitudesdrop and the rate of change in each of the aforementioned quantities gradually slows through theremainder of the period we observe. Interestingly, gradient magnitudes reach a minimum after theﬁrst 200 iterations and subsequently increase to a stable level by iteration 500. Evaluation accuracy,improves rapidly, reaching 55% by the end of the ﬁrst epoch (400 iterations), more than halfway tothe ﬁnal 91.5%. By 2000 iterations, accuracy approaches 80%.During the ﬁrst 4000 iterations of training, we observe three sub-phases. In the ﬁrst phase, lastingonly the initial few iterations, gradient magnitudes are very large and, consequently, the networkchanges rapidly. In the second phase, lasting about 500 iterations, performance quickly improves,weight magnitudes quickly increase, sign differences from initialization quickly increase, and gra-dient magnitudes reach a minimum before settling at a stable level. Finally, in the third phase, all ofthese quantities continue to change in the same direction, but begin to decelerate.3ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) Eval Accuracy % o f S i gn s t h a t D i ff e r fr o m I n it Sign Differences M a gn it ud e Average Weight Magnitude M a gn it ud e Weight Trace E v a l L o ss Eval Loss M a gn it ud e Gradient Magnitude L D i s t a n ce L2 DistInit Final C o s i n e S i m il a r it y Cosine SimilarityInit Final

Figure 3: Basic telemetry about the state of ResNet-20 during the ﬁrst 4000 iterations (10 epochs).Top row: evaluation accuracy/loss; average weight magnitude; percentage of weights that changesign from initialization; the values of ten randomly-selected weights. Bottom row: gradient magni-tude; L2 distance of weights from their initial values and ﬁnal values at the end of training; cosinesimilarity of weights from their initial values and ﬁnal values at the end of training. E v a l A cc u r ac y ( % ) init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining8486889092 E v a l A cc u r ac y ( % ) init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0 Figure 4: Performance of an IMP-derived sub-network of ResNet-20 on CIFAR-10 initialized to thesigns at iteration 0 or k and the magnitudes at iteration 0 or k . Left: k = 500 . Right: k = 2000 . ERTURBING N EURAL N ETWORKS E ARLY IN T RAINING

Figure 1 shows that the changes in the network weights over the ﬁrst 500 iterations of training areessential to enable high performance at high sparsity levels. What features of this weight transfor-mation are necessary to recover increased performance? Can they be summarized by maintainingthe weight signs, but discarding their magnitudes as implied by Zhou et al. (2019)? Can they berepresented distributionally? In this section, we evaluate these questions by perturbing the earlystate of the network in various ways. Concretely, we either add noise or shufﬂe the weights ofIMP sub-networks of ResNet-20 across different network sub-compenents and examine the effecton the network’s ability to learn thereafter. The sub-networks derived by IMP with rewinding makeit possible to understand the causal impact of perturbations on sub-networks that are as capable asthe full networks but more visibly decline in performance when improperly conﬁgured. To enablecomparisons between the experiments in Section 5 and provide a common frame of reference, wemeasure the effective standard deviation of each perturbation, i.e. stddev ( w perturb − w orig ) .5.1 A RE S IGNS A LL Y OU N EED ?Zhou et al. (2019) show that, for a set of small convolutional networks, signs alone are sufﬁcientto capture the state of lottery ticket sub-networks. However, it is unclear whether signs are stillsufﬁcient for larger networks early in training. In Figure 4, we investigate the impact of combiningthe magnitudes of the weights from one time-point with the signs from another. We found that thesigns at iteration 500 paired with the magnitudes from initialization (red line) or from a separaterandom initialization (green line) were insufﬁcient to maintain the performance reached by usingboth signs and magnitudes from iteration 500 (orange line), and performance drops to that of using4ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) Resnet-20 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining8486889092 E v a l A cc u r ac y ( % ) Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters

Figure 5: Performance of an IMP-derived ResNet-20 sub-network on CIFAR-10 initialized with theweights at iteration k permuted within various structural elements. Left: k = 500 . Right: k = 2000 . E v a l A cc u r ac y ( % ) Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining8486889092 E v a l A cc u r ac y ( % ) Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init

Figure 6: The effect of training an IMP-derived sub-network of ResNet-20 on CIFAR-10 initializedwith the weights at iteration k as shufﬂed within various structural elements where shufﬂing onlyoccurs between weights with the same sign. Left: k = 500 . Right: k = 2000 .both magnitudes and signs from initialization (blue line). However, while using the magnitudesfrom iteration 500 and the signs from initialization, performance is still substantially better thaninitialization signs and magnitudes. In addition, the overall perturbation to the network by using themagnitudes at iteration 500 and signs from initialization (mean: 0.0, stddev: 0.033) is smaller thanby using the signs at iteration 500 and the magnitudes from initialization ( . ± . , mean ± std).These results suggest that the change in weight magnitudes over the ﬁrst 500 iterations of trainingare substantially more important than the change in the signs for enabling subsequent training.By iteration 2000, however, pairing the iteration 2000 signs with magnitudes from initialization (redline) reaches similar performance to using the signs from initialization and the magnitudes fromiteration 2000 (purple line) though not as high performance as using both from iteration 2000. Thisresult suggests that network signs undergo important changes between iterations 500 and 2000, asonly 9% of signs change during this period. Our results also suggest that counter to the observationsof Zhou et al. (2019) in shallow networks, signs are not sufﬁcient in deeper networks.5.2 A RE W EIGHT D ISTRIBUTIONS I . I . D .?Can the changes in weights over the ﬁrst k iterations be approximated distributionally? To mea-sure this, we permuted the weights at iteration k within various structural sub-components of thenetwork (globally, within layers, and within convolutional ﬁlters). If networks are robust to thesepermutations, it would suggest that the weights in such sub-compenents might be approximated andsampled from. As Figure 5 shows, however, we found that performance was not robust to shufﬂingweights globally (green line) or within layers (red line), and drops substantially to no better thanthat of the original initialization (blue line) at both 500 and 2000 iterations. Shufﬂing within ﬁlters(purple line) performs slightly better, but results in a smaller overall perturbation ( . ± . for k = 500 ) than shufﬂing layerwise ( . ± . ) or globally ( . ± . ), suggesting that this changein perturbation strength may simply account for the difference. We also considered shufﬂing within the incoming and outgoing weights for each neuron, but performancewas equivalent to shufﬂing within layers. We elided these lines for readability. E v a l A cc u r ac y ( % ) Rewind to 0Rewind to 5000.51.02.03.0 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining8486889092 E v a l A cc u r ac y ( % ) Rewind to 0Rewind to 20000.51.02.03.0

Figure 7: The effect of training an IMP-derived sub-network of ResNet-20 on CIFAR-10 initializedwith the weights at iteration k and Gaussian noise of nσ , where σ is the standard deviation of theinitialization distribution for each layer. Left: k = 500 . Right: k = 2000 .Are the signs from the rewinding iteration, k , sufﬁcient to recover the damage caused by permuta-tion? In Figure 6, we also consider shufﬂing only amongst weights that have the same sign. Doing sosubstantially improves the performance of the ﬁlter-wise shufﬂe; however, it also reduces the extentof the overall perturbation ( . ± . for k = 500 ). It also improves the performance of shufﬂingwithin layers slightly for k = 500 and substantially for k = 2000 . We attribute the behavior for k = 2000 to the signs just as in Figure 4: when the magnitudes are similar in value (Figure 4 redline) or distribution (Figure 6 red and green lines), using the signs improves performance. Revertingback to the initial signs while shufﬂing magnitudes within layers (brown line), however, damagesthe network too severely ( . ± . for k = 500 ) to yield any performance improvement overrandom noise. These results suggest that, while the signs from initialization are not sufﬁcient forhigh performance at high sparsity as shown in Section 5.1, the signs from the rewinding iteration are sufﬁcient to recover the damage caused by permutation, at least to some extent.5.3 I S IT ALL JUST NOISE ?Some of our previous results suggested that the impact of signs and permutations may simply reduceto adding noise to the weights. To evaluate this hypothesis, we next study the effect of simplyadding Gaussian noise to the network weights at iteration k . To add noise appropriately for layerswith different scales, the standard deviation of the noise added for each layer was normalized to amultiple of the standard deviation σ of the initialization distribution for that layer. In Figure 7, wesee that for iteration k = 500 , sub-networks can tolerate . σ to σ of noise before performancedegrades back to that of the original initialization at higher levels of noise. For iteration k = 2000 ,networks are surprisingly robust to noise up to σ , and even σ exhibits nontrivial performance.In Figure 8, we plot the performance of each network at a ﬁxed sparsity level as a function of theeffective standard deviation of the noise imposed by each of the aforementioned perturbations. Weﬁnd that the standard deviation of the effective noise explained fairly well the resultant performance( k = 500 : r = − . , p = 0 . ; k = 2000 : r = − . , p = 0 . ). As expected, perturbationsthat preserved the performance of the network generally resulted in smaller changes to the state ofthe network at iteration k . Interestingly, experiments that mixed signs and magnitudes from differentpoints in training (green points) aligned least well with this pattern: the standard deviation of theperturbation is roughly similar among all of these experiments, but the accuracy of the resultingnetworks changes substantially. This result suggests that although the standard deviation of thenoise is certainly indicative of lower accuracy, there are still speciﬁc perturbations that, while smallin overall magnitude, can have a large effect on the network’s ability to learn, suggesting that theobserved perturbation effects are not, in fact, just a consequence of noise. HE D ATA -D EPENDENCE OF N EURAL N ETWORKS E ARLY IN T RAINING

Section 5 suggests that the change in network behavior by iteration k is not due to easily-ascertainable, distributional properties of the network weights and signs. Rather, it appears thattraining is required to reach these network states. It is unclear, however, the extent to which variousaspects of the data distribution are necessary. Mainly, is the change in weights during the earlyphase of training dependent on p ( x ) or p ( y | x ) ? Here, we attempt to answer this question by mea-6ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y Rewind to 500Gaussian 0.5 Gaussian 1.0Gaussian 2.0Gaussian 3.0random init signs: 500init: 0 signs: 500 init: 500 signs: 0Shuffle GloballyShuffle LayersShuffle FiltersShuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 50016.8% Sparsity E v a l A cc u r ac y Rewind to 2000 Gaussian 0.5Gaussian 1.0Gaussian 2.0Gaussian 3.0random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0Shuffle Globally Shuffle LayersShuffle FiltersShuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 200016.8% Sparsity

Figure 8: The effective standard deviation of various perturbations as a function of mean evaluationaccuracy (across 5 seeds) at sparsity . . The mean of each perturbation was approximately 0.Left: k = 500 , r = − . , p = 0 . ; Right: k = 2000 , r = − . , p = 0 . . E v a l A cc u r ac y ( % ) Random LabelsRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining8486889092 E v a l A cc u r ac y ( % ) Self-Supervised RotationRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining828486889092 E v a l A cc u r ac y ( % )

4x BlurringRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining828486889092 E v a l A cc u r ac y ( % )

4x Blurring + Self-Superivsed RotationRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

Figure 9: The effect of pre-training ResNet-20 on CIFAR-10 with random labels, self-supervisedrotation, 4x blurring, and 4x blurring and self-supervised rotation.suring the extent to which we can re-create a favorable network state for sub-network training usingrestricted information from the training data and labels. In particular, we consider pre-training thenetwork with techniques that ignore labels entirely (self-supervised rotation prediction, Section 6.2),provide misleading labels (training with random labels, Section 6.1), or eliminate information fromexamples (blurring training examples Section 6.3).We ﬁrst train a randomly-initialized, unpruned network on CIFAR-10 on the pre-training task for aset number of epochs. After pre-training, we train the network normally as if the pre-trained statewere the original initialization. We then use the state of the network at the end of the pre-trainingphase as the “initialization” to ﬁnd masks for IMP. Finally, we examine the performance of the IMP-pruned sub-networks as initialized using the state after pre-training. This experiment determinesthe extent to which pre-training places the network in a state suitable for sub-network training ascompared to using the state of the network at iteration k of training on the original task.7ublished as a conference paper at ICLR 20206.1 R ANDOM LABELS

To evaluate whether this phase of training is dependent on underlying structure in the data, we drewinspiration from Zhang et al. (2017) and pre-trained networks on data with randomized labels. Thisexperiment tests whether the input distribution of the training data is sufﬁcient to put the networkin a position from which IMP with rewinding can ﬁnd a sparse, trainable sub-network despite thepresence of incorrect (not just missing) labels. Figure 9 (upper left) shows that pre-training onrandom labels for up to 10 epochs provides no improvement above rewinding to iteration 0 and thatpre-training for longer begins to hurt accuracy. This result suggests that, though it is still possiblethat labels may not be required for learning, the presence incorrect labels is sufﬁcient to preventlearning which approximates the early phase of training.6.2 S

ELF - SUPERVISED ROTATION PREDICTION

What if we remove labels entirely? Is p ( x ) sufﬁcient to approximate the early phase of training?Historically, neural network training often involved two steps: a self-supervised pre-training phasefollowed by a supervised phase on the target task (Erhan et al., 2010). Here, we consider one suchself-supervised technique: rotation prediction (Gidaris et al., 2018). During the pre-training phase,the network is presented with a training image that has randomly been rotated n degrees (where n ∈ { , , , } ). The network must classify examples by the value of n . If self-supervised pre-training can approximate the early phase of training, it would suggest that p ( x ) is sufﬁcient on itsown. Indeed, as shown in Figure 9 (upper right), this pre-training regime leads to well-trainable sub-networks, though networks must be trained for many more epochs compared to supervised training(40 compared to 1.25, or a factor of × ). This result suggests that the labels for the ultimate taskthemselves are not necessary to put the network in such a state (although explicitly misleading labelsare detrimental). We emphasize that the duration of the pre-training phase required is an order ofmagnitude larger than the original rewinding iteration, however, suggesting that labels add importantinformation which accelerates the learning process.6.3 B LURRING TRAINING EXAMPLES

To probe the importance of p ( x ) for the early phase of training, we study the extent to which thetraining input distribution is necessary. Namely, we pretrain using blurred training inputs with thecorrect labels. Following Achille et al. (2019), we blur training inputs by downsampling by 4xand then upsampling back to the full size. Figure 9 (bottom left) shows that this pre-training methodsucceeds: after 40 epochs of pre-training, IMP with rewinding can ﬁnd sub-networks that are similarin performance to those found after training on the original task for 500 iterations (1.25 epochs).Due to the success of the the rotation and blurring pre-training tasks, we explored the effect ofcombining these pre-training techniques. Doing so tests the extent to which we can discard both thetraining labels and some information from the training inputs. Figure 9 (bottom right) shows thatdoing so provides the network too little information: no amount of pre-training we considered makesit possible for IMP with rewinding to ﬁnd sub-networks that perform tangibly better than rewindingto iteration 0. Interestingly however, as shown in Appendix B, trainable sub-networks are found forVGG-13 with this pre-training regime, suggesting that different network architectures have differentsensitivities to the deprivation of labels and input content.6.4 S PARSE P RETRAINING

Since sparse sub-networks are often challenging to train from scratch without the proper initializa-tion (Han et al., 2015; Liu et al., 2019; Frankle & Carbin, 2019), does pre-training make it easierfor sparse neural networks to learn? Doing so would serve as a rough form of curriculum learning(Bengio et al., 2009) for sparse neural networks. We experimented with training sparse sub-networksof ResNet-20 (IMP sub-networks, randomly reinitialized sub-networks, and randomly pruned sub-networks) ﬁrst on self-supervised rotation and then on the main task, but found no beneﬁt beyondrewinding to iteration 0 (Figure 10). Moreover, doing so when starting from a sub-network rewoundto iteration 500 actually hurts ﬁnal accuracy. This result suggests that while pre-training is sufﬁcient to approximate the early phase of supervised training with an appropriately structured mask, it is notsufﬁcient to do so with an inappropriate mask. 8ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) Resnet-20 (CIFAR-10) - Sparse PretrainRewind to 0Rewind to 500PretrainPretrain, Random ReinitPretrain, Random Reinit, Randomize Masks

Figure 10: The effect of pretraining sparse sub-networks of Resnet-20 (rewound to iteration 500)with 40 epochs of self-supervised rotation before training on CIFAR-10.

ISCUSSION

In this paper, we ﬁrst performed extensive measurements of various statistics summarizing learningover the early part of training. Notably, we uncovered 3 sub-phases: in the very ﬁrst iterations, gra-dient magnitudes are anomalously large and motion is rapid. Subsequently, gradients overshoot tosmaller magnitudes before leveling off while performance increases rapidly. Then, learning slowlybegins to decelerate. We then studied a suite of perturbations to the network state in the early phaseﬁnding that, counter to observations in smaller networks (Zhou et al., 2019), deeper networks are notrobust to reinitializing with random weights with maintained signs. We also found that the weightdistribution after the early phase of training is highly non-independent. Finally, we measured thedata-dependence of the early phase with the surprising result that pre-training on a self-supervisedtask yields equivalent performance to late rewinding with IMP.These results have signiﬁcant implications for the lottery ticket hypothesis. The seeming necessityof late rewinding calls into question certain interpretations of lottery tickets as well as the ability toidentify sub-networks at initialization. Our observation that weights are highly non-independent atthe rewinding point suggests that the weights at this point cannot be easily approximated, makingapproaches which attempt to “jump” directly to the rewinding point unlikely to succeed. However,our result that labels are not necessary to approximate the rewinding point suggests that the learningduring this phase does not require task-speciﬁc information, suggesting that rewinding may not benecessary if networks are pre-trained appropriately. R EFERENCES

Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neuralnetworks. 2019.Yoshua Bengio, J´erˆome Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In

Proceedings of the 26th annual international conference on machine learning , pp. 41–48. ACM,2009.Pratik Chaudhuri and Stefano Soatto. Stochastic gradient descent performs variational inference,converges to limit cycles for deep networks. 2017. URL https://arxiv.org/abs/1710.11029 .Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, andSamy Bengio. Why does unsupervised pre-training help deep learning?

Journal of MachineLearning Research , 11(Feb):625–660, 2010.Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks. In

International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=rJl-b3RcF7 .Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M Roy, and Michael Carbin. Stabilizing thelottery ticket hypothesis. arXiv preprint arXiv:1903.01611 , 2019.9ublished as a conference paper at ICLR 2020Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimizationvia hessian eigenvalue density. In

Proceedings of the 36th International Conference on MachineLearning , volume 97, pp. 2232–2241, 2019. URL http://proceedings.mlr.press/v97/ghorbani19b.html .Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning bypredicting image rotations. In

International Conference on Learning Representations , 2018. URL https://openreview.net/forum?id=S1v4N2l0- .Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep net-works: Weight decay and data augmentation affect early learning dynamics, matter little nearconvergence. 2019.Guy Gur-Ari, Daniel A Roberts, and Ethan Dyer. Gradient descent happens in a tiny subspace. arXivpreprint arXiv:1812.04754 , 2018.Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections forefﬁcient neural network. In

Advances in neural information processing systems , pp. 1135–1143,2015.K He, X Zhang, S Ren, and J Sun. Deep residual learning for image recognition. In

ComputerVision and Pattern Recogntion (CVPR) , volume 5, pp. 6, 2015.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In

Proceedings of the 32Nd International Conference on Inter-national Conference on Machine Learning - Volume 37 , ICML’15, pp. 448–456. JMLR.org, 2015.URL http://dl.acm.org/citation.cfm?id=3045118.3045167 .Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value ofnetwork pruning. In

International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=rJlnB3C5Ym .Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towardsunderstanding the role of over-parametrization in generalization of neural networks. 2019.Levent Sagun, Utku Evci, Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis ofthe hessian of over-parametrized neural networks. 2017. URL https://arxiv.org/abs/1706.04454 .Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normal-ization help optimization? In

Advances in neural information processing systems , 2018.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. 2015.Mingwei Wei and David Schwab. How noise during training affects the hessian spectrum in over-parameterized neural networks. 2019.Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. 2019.Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprintarXiv:1605.07146 , 2016.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. 2017.Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros,signs, and the supermask. 2019. 10ublished as a conference paper at ICLR 2020

A M

ODEL DETAILS

Network Epochs Batch Size Learning Rate Parameters Eval AccuracyResNet-20 160 128 0.1 (Mom 0.9) 272K 91.5 ± ± ± ± ± B E

XPERIMENTS FOR O THER N ETWORKS

Training iterations

Gradient magnitudes are very largeRapid motion in weight spaceLarge sign changes (6% in first 10 it) Gradients converge to roughly constant magnitudeWeight magnitudes decrease linearlyWeight magnitudes decrease linearly Performance increase decelerates, but is still rapidRewinding begins to become highly effectiveRewinding begins to become effective(~250 iterations) Hessian eigenspectrum separates(700 iterations; Gur-Ari et al., 2018) Gradient magnitudes remain roughly constantCritical periods end (~8000 iterations; Achille et al., 2019; Golatkar et al., 2019)Performance slowly asymptotesBenefit of rewinding saturatesGradient magnitudes reach a minimum before stabilizing 10% higher at iteration 250 Motion in weight space slows, but is still rapidPerformance increases rapidly

Figure A1: Rough timeline of the early phase of training for ResNet-18 on CIFAR-10, includingresults from previous papers. 11ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10)Rewind to 0Rewind to 10Rewind to 50Rewind to 100Rewind to 250Rewind to 500Rewind to 1000 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10)Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining8182838485868788 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10)Rewind to 0Rewind to 100Rewind to 250Rewind to 500Rewind to 1000Rewind to 2000 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining909192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10)Rewind to 0Rewind to 10Rewind to 25Rewind to 50Rewind to 100Rewind to 250Rewind to 500

Figure A2: The effect of IMP rewinding iteration on the accuracy of sub-networks at various levelsof sparsity. Accompanies Figure 1. 12ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 - Eval Accuracy E v a l L o ss VGG-13 - Eval Loss M a gn it ud e VGG-13 - Average Weight Magnitude % o f S i gn s t h a t D i ff e r fr o m I n it VGG-13 - Sign Differences M a gn it ud e VGG-13 - Weight Trace M a gn it ud e VGG-13 - Gradient Magnitude L D i s t a n ce VGG-13 - L2 Dist from Init 0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration80859095 L D i s t a n ce VGG-13 - L2 Dist from Final C o s i n e S i m il a r it y VGG-13 - Cosine Similarity To Init 0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration0.200.250.300.350.40 C o s i n e S i m il a r it y VGG-13 - Cosine Similarity to Final E v a l A cc u r ac y ( % ) Resnet-56 - Eval Accuracy E v a l L o ss Resnet-56 - Eval Loss M a gn it ud e Resnet-56 - Average Weight Magnitude % o f S i gn s t h a t D i ff e r fr o m I n it Resnet-56 - Sign Differences M a gn it ud e Resnet-56 - Weight Trace M a gn it ud e Resnet-56 - Gradient Magnitude L D i s t a n ce Resnet-56 - L2 Dist from Init L D i s t a n ce Resnet-56 - L2 Dist from Final 0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration0.750.800.850.900.951.00 C o s i n e S i m il a r it y Resnet-56 - Cosine Similarity To Init 0 500 1000 1500 2000 2500 3000 3500 4000Training Iteration0.250.300.350.400.450.50 C o s i n e S i m il a r it y Resnet-56 - Cosine Similarity to Final E v a l A cc u r ac y ( % ) Resnet-18 - Eval Accuracy E v a l L o ss Resnet-18 - Eval Loss M a gn it ud e Resnet-18 - Average Weight Magnitude % o f S i gn s t h a t D i ff e r fr o m I n it Resnet-18 - Sign Differences M a gn it ud e Resnet-18 - Weight Trace M a gn it ud e Resnet-18 - Gradient Magnitude L D i s t a n ce Resnet-18 - L2 DistInit Final C o s i n e S i m il a r it y Resnet-18 - Cosine SimilarityInit Final E v a l A cc u r ac y ( % ) WRN-16-8 - Eval Accuracy E v a l L o ss WRN-16-8 - Eval Loss M a gn it ud e WRN-16-8 - Average Weight Magnitude % o f S i gn s t h a t D i ff e r fr o m I n it WRN-16-8 - Sign Differences M a gn it ud e WRN-16-8 - Weight Trace M a gn it ud e WRN-16-8 - Gradient Magnitude L D i s t a n ce WRN-16-8 - L2 Dist from Init L D i s t a n ce WRN-16-8 - L2 Dist from Final C o s i n e S i m il a r it y WRN-16-8 - Cosine Similarity To Init C o s i n e S i m il a r it y WRN-16-8 - Cosine Similarity to Final

Figure A3: Basic telemetry about the state of all networks in Table A1 during the ﬁrst 4000 iterationsof training. Accompanies Figure 3. 13ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10)init: 0 signs: 0init: 250 signs: 250random init signs: 250init: 0 signs: 250init: 250 signs: 0 100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining8688909294 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10)init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining80828486889092 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10)init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10)init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining8182838485868788 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10)init: 0 signs: 0init: 500 signs: 500random init signs: 500init: 0 signs: 500init: 500 signs: 0 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10)init: 0 signs: 0init: 2000 signs: 2000random init signs: 2000init: 0 signs: 2000init: 2000 signs: 0100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining8889909192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10)init: 0 signs: 0init: 50 signs: 50random init signs: 50init: 0 signs: 50init: 50 signs: 0 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining8889909192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10)init: 0 signs: 0init: 250 signs: 250random init signs: 250init: 0 signs: 250init: 250 signs: 0

Figure A4: The effect of training an IMP-derived sub-network initialized to the signs at iteration 0or k and the magnitudes at iteration 0 or k . Accompanies Figure A4.14ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - Rewinding Iteration 250Rewind to 0Rewind to 250Shuffle GloballyShuffle LayersShuffle Filters 100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining8688909294 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - Rewinding Iteration 2000Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining8182838485868788 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 500Shuffle GloballyShuffle LayersShuffle Filters 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Rewinding Iteration 2000Rewind to 0Rewind to 2000Shuffle GloballyShuffle LayersShuffle Filters100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining9192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Rewinding Iteration 50Rewind to 0Rewind to 50Shuffle GloballyShuffle LayersShuffle Filters 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining9192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Rewinding Iteration 250Rewind to 0Rewind to 250Shuffle GloballyShuffle LayersShuffle Filters

Figure A5: The effect of training an IMP-derived sub-network initialized with the weights at itera-tion k as shufﬂed within various structural elements. Accompanies Figure 5.15ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10)Rewind to 0Rewind to 250Shuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250Shuffle Layers, Signs at Init 100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining8688909294 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10)Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining80828486889092 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10)Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10)Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining8182838485868788 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 500Shuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Shuffle Layers, Signs at Init 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Rewinding Iteration 2000Rewind to 0Rewind to 2000Shuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Shuffle Layers, Signs at Init100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining9192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Rewinding Iteration 50Rewind to 0Rewind to 50Shuffle Globally, Signs at 50Shuffle Layers, Signs at 50Shuffle Filters, Signs at 50Shuffle Layers, Signs at Init 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining92939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Rewinding Iteration 250Rewind to 0Rewind to 250Shuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250Shuffle Layers, Signs at Init

Figure A6: The effect of training an IMP-derived sub-network initialized with the weights at iter-ation k as shufﬂed within various structural elements where shufﬂing only occurs between weightswith the same sign. Accompanies Figure 6. 16ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - Rewinding Iteration 250Rewind to 0Rewind to 2500.51.02.03.0 100.0 51.2 26.3 13.5 6.9 3.6 1.9 1.0 0.5Percent of Weights Remaining8688909294 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 5000.51.02.03.0100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining87888990919293 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 5000.51.02.03.0 100.0 64.0 41.0 26.2 16.8 10.7 6.9 4.4Percent of Weights Remaining8788899091929394 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - Rewinding Iteration 2000Rewind to 0Rewind to 20000.51.02.03.0100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining81828384858687 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Rewinding Iteration 500Rewind to 0Rewind to 5000.51.02.03.0 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Rewinding Iteration 2000Rewind to 0Rewind to 20000.51.02.03.0100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining9192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Rewinding Iteration 50Rewind to 0Rewind to 500.51.02.03.0 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining9192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Rewinding Iteration 250Rewind to 0Rewind to 2500.51.02.03.0

Figure A7: The effect of training an IMP-derived sub-network initialized with the weights at itera-tion k and Gaussian noise of nσ , where σ is the standard deviation of the initialization distributionfor each layer. Accompanies Figure 7. 17ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y Rewind to 250Gaussian 0.5Gaussian 1.0 Gaussian 2.0Gaussian 3.0random init signs: 250init: 0 signs: 250 init: 250 signs: 0 Shuffle GloballyShuffle LayersShuffle FiltersShuffle Globally, Signs at 250Shuffle Layers, Signs at 250Shuffle Filters, Signs at 250VGG-13 - Rewind to 250 - 0.8% Sparsity E v a l A cc u r ac y Rewind to 500Gaussian 0.5Gaussian 1.0 Gaussian 2.0Gaussian 3.0random init signs: 500init: 0 signs: 500init: 500 signs: 0 Shuffle GloballyShuffle LayersShuffle FiltersShuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500VGG-13 - Rewind to 500 - 0.8% Sparsity E v a l A cc u r ac y Rewind to 500Gaussian 0.5Gaussian 1.0 Gaussian 2.0Gaussian 3.0random init signs: 500init: 0 signs: 500 init: 500 signs: 0Shuffle Globally Shuffle LayersShuffle FiltersShuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Resnet-56 - Rewind to 500 - 16.8% Sparsity E v a l A cc u r ac y Rewind to 2000Gaussian 0.5 Gaussian 1.0Gaussian 2.0Gaussian 3.0random init signs: 2000init: 0 signs: 2000 init: 2000 signs: 0Shuffle GloballyShuffle LayersShuffle FiltersShuffle Globally, Signs at 2000Shuffle Layers, Signs at 2000Shuffle Filters, Signs at 2000Resnet-56 - Rewind to 2000 - 16.8% Sparsity E v a l A cc u r ac y Rewind to 500Gaussian 0.5Gaussian 1.0 Gaussian 2.0Gaussian 3.0random init signs: 500init: 0 signs: 500 init: 500 signs: 0Shuffle GloballyShuffle FiltersShuffle Globally, Signs at 500Shuffle Layers, Signs at 500Shuffle Filters, Signs at 500Resnet-18 - Rewind to 500 - 3.5% Sparsity E v a l A cc u r ac y Rewind to 2000Gaussian 0.5Gaussian 1.0 Gaussian 2.0Gaussian 3.0random init signs: 2000init: 0 signs: 2000 init: 2000 signs: 0Shuffle Globally Shuffle LayersShuffle FiltersShuffle Globally, Signs at 2000Resnet-18 - Rewind to 2000 - 3.5% Sparsity E v a l A cc u r ac y Rewind to 50 Gaussian 0.5Gaussian 1.0Gaussian 2.0 Gaussian 3.0random init signs: 50init: 0 signs: 50init: 50 signs: 0 Shuffle GloballyShuffle LayersShuffle FiltersShuffle Globally, Signs at 50Shuffle Filters, Signs at 50WRN-16-8 - Rewind to 50 - 6.9% Sparsity E v a l A cc u r ac y Rewind to 250 Gaussian 0.5Gaussian 1.0Gaussian 2.0 Gaussian 3.0random init signs: 250init: 0 signs: 250init: 250 signs: 0WRN-16-8 - Rewind to 250 - 6.9% Sparsity

Figure A8: The effective standard deviation of each of the perturbations studied in Section 5 as afunction of mean evaluation accuracy (across ﬁve seeds). Accompanies Figure 8.18ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - Random LabelsRewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining20406080 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - Random LabelsRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Random LabelsRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining8688909294 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Random LabelsRewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

Figure A9: The effect of pre-training CIFAR-10 with random labels. Accompanies Figure 9. E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - Self-Supervised RotationRewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - Self-Supervised RotationRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - Self-Supervised RotationRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining909192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - Self-Supervised RotationRewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs

Figure A10: The effect of pre-training CIFAR-10 with self-supervised rotation. Accompanies Figure9. 19ublished as a conference paper at ICLR 2020 E v a l A cc u r ac y ( % ) VGG-13 (CIFAR-10) - 4x BlurringRewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 51.2 26.2 13.4 6.9 3.5 1.8Percent of Weights Remaining8082848688909294 E v a l A cc u r ac y ( % ) Resnet-56 (CIFAR-10) - 4x BlurringRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9 0.5Percent of Weights Remaining82848688 E v a l A cc u r ac y ( % ) Resnet-18 (CIFAR-10) - 4x BlurringRewind to 0Rewind to 500Rewind to 2000Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs 100.0 51.2 26.2 13.4 6.9 3.5 1.8 0.9Percent of Weights Remaining9192939495 E v a l A cc u r ac y ( % ) WRN-16-8 (CIFAR-10) - 4x BlurringRewind to 0Rewind to 250Rewind to 500Pretrain 2 EpochsPretrain 5 EpochsPretrain 10 EpochsPretrain 40 Epochs