[PDF] Checkpoint Ensembles: Ensemble Methods from a Single Training Process

Abstract

We present the checkpoint ensembles method that can learn ensemble models on a single training process. Although checkpoint ensembles can be applied to any parametric iterative learning technique, here we focus on neural networks. Neural networks' composable and simple neurons make it possible to capture many individual and interaction effects among features. However, small sample sizes and sampling noise may result in patterns in the training data that are not representative of the true relationship between the features and the outcome. As a solution, regularization during training is often used (e.g. dropout). However, regularization is no panacea -- it does not perfectly address overfitting. Even with methods like dropout, two methodologies are commonly used in practice. First is to utilize a validation set independent to the training set as a way to decide when to stop training. Second is to use ensemble methods to further reduce overfitting and take advantage of local optima (i.e. averaging over the predictions of several models). In this paper, we explore checkpoint ensembles -- a simple technique that combines these two ideas in one training process. Checkpoint ensembles improve performance by averaging the predictions from "checkpoints" of the best models within single training process. We use three real-world data sets -- text, image, and electronic health record data -- using three prediction models: a vanilla neural network, a convolutional neural network, and a long short term memory network to show that checkpoint ensembles outperform existing methods: a method that selects a model by minimum validation score, and two methods that average models by weights. Our results also show that checkpoint ensembles capture a portion of the performance gains that traditional ensembles provide.

Full PDF

CCheckpoint Ensembles: Ensemble Methods from a Single Training Process

Hugh Chen

University of WashingtonSeattle, WA

Scott Lundberg

University of WashingtonSeattle, WA

Su-In Lee

University of WashingtonSeattle, WA

Abstract

We present the checkpoint ensembles method that can learnensemble models on a single training process. Althoughcheckpoint ensembles can be applied to any parametric itera-tive learning technique, here we focus on neural networks.Neural networks’ composable and simple neurons make itpossible to capture many individual and interaction effectsamong features. However, small sample sizes and samplingnoise may result in patterns in the training data that are notrepresentative of the true relationship between the featuresand the outcome. As a solution, regularization during train-ing is often used (e.g. dropout). However, regularization isno panacea – it does not perfectly address overﬁtting. Evenwith methods like dropout, two methodologies are commonlyused in practice. First is to utilize a validation set independentto the training set as a way to decide when to stop training.Second is to use ensemble methods to further reduce overﬁt-ting and take advantage of local optima (i.e. averaging overthe predictions of several models). In this paper, we explorecheckpoint ensembles – a simple technique that combinesthese two ideas in one training process. Checkpoint ensem-bles improve performance by averaging the predictions from“checkpoints” of the best models within a single training pro-cess. We use three real-world data sets – text, image, and elec-tronic health record data – using three prediction models: avanilla neural network, a convolutional neural network, and along short term memory network to show that checkpoint en-sembles outperform existing methods: a method that selectsa model by minimum validation score, and two methods thataverage models by weights. Our results also show that check-point ensembles capture a portion of the performance gainsthat traditional ensembles provide.

Introduction

Ensemble methods are learning algorithms that combinemultiple individual methods to create a learning algorithmthat is better than any of its individual parts (Dietterich2000). The simplest methods are random initialization en-sembles (RIE) which run the same model over the same datawith different weight initializations. Ensemble methods havegained popularity because they can outperform any singlelearner on many datasets and machine learning tasks (Kroghand Vedelsby 1995; Dietterich 2000; Naftaly, Intrator, andHorn 1997).

However, when using deep learning models, ensemblemethods are more challenging – simply because trainingdeep neural networks over large datasets takes a corre-spondingly large amount of computation time. In the sim-plest case, a network is trained epoch by epoch using largedatasets, iterating over the training data in order to calculatethe gradient of the loss function, approaching an optimum byshifting parameters in the model (typically weights). Even-tually, after the score (e.g. loss, accuracy, etc.) on the vali-dation set fails to improve after a number of epochs, the setof parameters that achieved the optimal performance on thevalidation set becomes the model to be evaluated on the testset.Figure 1: The rounded boxes going from left to right repre-sent models at each step of a particular training process (e.g.using gradient descent). The shading represents validationscore – lighter shades represent a better score. For either en-semble, we average the predictions from the best models toget the ﬁnal prediction P .Checkpoint ensembles introduce the beneﬁts of ensemblemethods within a single network’s training process by lever-aging the validation scores. Checkpointing refers to savingthe best models in terms of a validation score metric acrossall epochs of a single training process. Then, as in Figure1, based on the top scores we either combine the predic-tions of the top scoring models (checkpoint ensembles) orcombine the models themselves by averaging (checkpointsmoothers). Out of the single training process model averag-ing techniques considered in this paper – minimum valida-tion (MV), checkpoint smoothers (CS), and last k smoothers(LKS) – checkpoint ensembles show the best performance.It is worth noting that while checkpoint ensembles are par-ticularly suitable to neural networks, they can easily be ex-tended to any iterative learning algorithm. Additionally, de- a r X i v : . [ c s . L G ] O c t pite its performance and simplicity, checkpoint ensemblesare surprisingly unexplored. Ju et al. (2017) mentions check-point ensembles in an exploration of different methods forcombining predictions with neural network ensembles, andSennrich et al. (2017) used a checkpoint ensemble based off N sequential epochs. (Ju, Bibaut, and van der Laan 2017;Sennrich et al. 2017). Related Work

For neural networks, one of the most commonly used sin-gle training process techniques for preventing overﬁtting isthe minimum validation model selection (MV) method whichselects the model with the best validation score as the ﬁnalmodel. As such, we used the MV method as a baseline inour experiments (see Experimental Results section).The other two single training process methods are two dif-ferent versions of what we call smoothers : last k smoothers(LKS) and checkpoint smoothers (CS) . Smoothing refers toaveraging the weights of models, a natural parallel to averag-ing the predictions from models. Utans (1996) explains thataveraging parameters can be problematic because (1) differ-ent local minima may be found, and (2) a particular solu-tion can be represented by different permutations of hiddennodes (Utans 1996). Despite these problems, smoothers arestill worthwhile comparisons because using a single trainingprocess may alleviate these problems.LKS takes the best model in terms of validation score andaverages the weights from the last k prior epochs. CS av-erages the weights from k of the best models in terms ofvalidation score. More formally LKS and CS can be imple-mented as:1. Train neural networks in a normal fashion such that atepochs , , · · · , n we learn corresponding models M = { M , M , · · · , M n } as well as validation scores V = { V , V , · · · , V n } , where each model M i has a set ofweight parameters W i .2. Order V to get V o = { V (1) , V (2) , · · · , V ( n ) } and the mod-els M o = { M (1) , M (2) , · · · , M ( n ) } such that M j = M ( i ) where V j = V ( i ) . Depending on the validation score, theordering V o may either be increasing or decreasing suchthat V (1) represents the optimal value.3. Then impose an ordering on M that we denote M (cid:48) = { M (cid:48) , M (cid:48) , · · · , M (cid:48) k } with weights W (cid:48) , W (cid:48) , · · · , W (cid:48) k ,where k is the number of models used for smoothing.(a) For LKS, set k = 5 and impose M (cid:48) = M (1) = M (cid:96) .Then, M (cid:48) , · · · , M (cid:48) k = M (cid:96) − , · · · , M max (1 ,(cid:96) − ( k − .(b) For CS, we set M (cid:48) = { M (1) , M (2) , · · · , M ( k ) } , with k = min ( a + 5 , b, n ) , where a is the number of earlystopping rounds, with b s.t. M b = M (1) , and with n thetotal number of epochs. For the Experimental Resultssection, a = 10 .4. Return model M S with weights W S = k (cid:80) ki =1 W (cid:48) i .In order to demonstrate that checkpoint ensembles cap-ture a portion of the effect garnered from traditional ensem-ble methods we use random initialization ensembles (RIE) for comparison as well. For RIE, run models with different random initializations: M = { M , · · · , M n } , · · · , M k = { M k , · · · , M kn k } . Denote a prediction on a sample point x as M ( x ) . Then, predictions for the ﬁnal model M RIE for asample point x o is M RIE ( x o ) = k (cid:80) ki =1 M i (1) ( x o ) , where M i (1) is the best scoring model in terms of validation fortraining run i . For the Experimental Results section, k = 5 . Pseudocode

For the following pseudocode we assume lower validationscores are better. Pseudocode for predicting with the mini-mum validation (MV) method is as below:

Algorithm 1

Predict with MV models = nn.train(earlyStop,additionalParameters) procedure PRED

MV(models,x) models.sort(by=“val scores”,order=“increase”) return models[0].predict(x)Pseudocode for predicting with both last k smoother(LKS) and checkpoint smoothers (CS) is as below: Algorithm 2

Predict with LKS models = nn.train(earlyStop,additionalParameters) procedure PRED

LKS(models,x) k = min(5, len(models)) models.sort(by=“epochs”,order=“decrease”) models = models[len(models)-bestEpoch(models):] model = nn.emptyModel(additionalParamters) model.weights = avgWeights(models[:k]) return model.predict(x) Algorithm 3

Predict with CS models = nn.train(earlyStop,additionalParameters) procedure PRED

CS(models,x) bestEpoch = bestEpoch(models) k = min(earlyStop+5, bestEpoch, len(models)) models.sort(by=“val scores”,order=“increase”) model = nn.emptyModel(additionalParameters) model.weights = avgWeights(models[:k]) return model.predict(x)Pseudocode for random initialization ensembles (RIE) isas below: Algorithm 4

Predict with RIE k = 5, i = 0, bestModelLst= [] while i < k do models = nn.train(earlyStop,additionalParameters) models.sort(by=“val scores”,order=“increase”) bestModelLst.append(models[0]) i = i+1 procedure PRED

RIE(bestModelLst,x) return average(bestModelLst.predict(x)) heckpoint Ensembles Overview

We make two observations about the parameter space in re-lation to our loss function that serve as intuition. First, ontop of reducing overﬁtting, one intuition for why checkpointensembles (CE) would work well is that even at the end oftraining, the neural network may not end up exactly in a op-timum point in the parameter space, as depicted in Figure2A. Even more crucially, as the network traverses the pa-rameter space, the gradient and the learning rate may sendthe network around a single optimum (Figure 2A) or acrossmultiple local optima (Figure 2B). In either case, the modelscharacterized by the weights at each of these epochs may beparticularly conﬁdent when making predictions in unique re-gions of the space of all possible prediction problems (whichwe denote as Γ ).To further elaborate, imagine model M is good at pre-diction problems in the subspace α ∈ Γ and mediocre at thesubspace β ∈ Γ whereas M is good at prediction problemsin the subspace β ∈ Γ and mediocre at the subspace α ∈ Γ .If this is the case, it stands to reason that both models willhave high validation accuracy and therefore be checkpoints .Then, the ﬁnal model M CE , which is an ensemble of model M with model M , will reﬂect the conﬁdence M has forthe region α . Averaging M ’s high probabilities for the pre-dictions associated with region α with model M ’s mediocrepredictions for region β , result in M CE behaving like M for region α . A parallel argument applies for M and β . Ourresulting model M CE should then outperform either indi-vidual model M or M .Figure 2: Pictoral representation of scenarios for gradientdescent: (A) when there is one optimal point and (B) whenthere are two local optima. The shading represents the opti-mum in terms of our loss function, plotted against our (two)parameters. The whiter the shade, the closer to optimal. Thearrows represent gradient descent.The second observation is that scenarios like the ones inFigure 2 are more common with moderately higher learn-ing rates. A model with a higher learning rate explores theparameter space more adventurously - likely visiting morevalleys (optima) of the parameter space than a slightly lowerlearning rate. These higher learning rates also come with an-other nice side effect; they generally require less epochs forthe model to “converge.” This means that checkpoint ensem-bles potentially perform better with higher learning rates andthus need less epochs to reach their full potential than a neu-ral network that chosen by minimum validation. Algorithm

1. Train neural networks normally such that at epochs , , · · · , n we learn corresponding models M = { M , M , · · · , M n } as well as validation scores V = { V , V , · · · , V n } .2. Order V to get V o = { V (1) , V (2) , · · · , V ( n ) } and the mod-els M o = { M (1) , M (2) , · · · , M ( n ) } such that M j = M ( k ) where V j = V ( k ) . Depending on the validation score, theordering V o may either be increasing or decreasing suchthat V (1) represents the optimal value.3. Return model M CE where predicting on a sample point x o is M CE ( x o ) = k (cid:80) ki =1 M ( i ) ( x o ) .To select k , a good heuristic is k = min ( a + 5 , b, n ) ,where a is the number of early stopping rounds, with b suchthat M b = M (1) , and with n the total number of epochs. Forthe prediction problems in the Experimental Results section,we consider a = 10 . We determined the heuristic k by test-ing on the operating room data sets (see Experimental Re-sults section). Pseudocode

Algorithm 5

Predict with CE models = nn.train(earlyStop,additionalParameters) procedure PRED

CE(models,x) bestEpoch = bestEpoch(models) k = min(earlyStop+5, bestEpoch, len(models)) models.sort(by=“val scores”,order=“increase”) return average(models[:k].predict(x)) Experimental Results

Data

We consider the following three data sets:

Reuters is a popular data set containing 11,228 newswiresfrom Reuters with 46 topics - text categorization.

CIFAR-10 is a popular data set containing 60,000 32x32color images with 10 classes - image classiﬁcation.

Operating Room Data contains 57,000 surgeries contain-ing time series data and static summary information ob-tained under appropriate Institutional Review Board (IRB)approval. After splitting surgeries into multiple time points,about 8,000,000 desaturation labels are present with about120,000 positive examples. In addition there are about3,000,000 hypocapnia labels with about 240,000 positive ex-amples. Both of these dataset labels represent time series bi-nary classiﬁcation problems.

Models

We consider three neural network prediction models: vanillaneural networks, CNNs, and LSTMs.Neural networks are well suited for checkpoint ensem-bles for two reasons: (1) training is a stochastic, iterativeprocess, and (2) they are often applied on huge data setshere training time is expensive. We implemented our net-works in Python using Keras, a package that provides a con-venient frontend to Tensorﬂow (Chollet and others 2015;Lipton et al. 2015).

CNNs utilize convolutions and have been applied withgreat success to image classiﬁcation (Krizhevsky, Sutskever,and Hinton 2012).

LSTMs were introduced by Hochreiter and Schmidhuberas a variant on recurrent neural networks that avoid the van-ishing gradient problem (Hochreiter and Schmidhuber 1997;Hochreiter 1998).

Evaluation Metrics

Accuracy is a useful metric in its straightforward nature andease of use. In multilabel prediction problems for machinelearning, a common approach is to predict the probabilitiesfrom the network with a given sample point and label thepoint based off the maximally probable class. Then, accu-racy is the percentage of labels that match the true labels fora given data set.

Area under the precision-recall (PR) curve is consideredas an evaluation metric. PR curves are widely used for bi-nary classiﬁcation tasks to summarize the predictive accu-racy of a model. PR curves are popular for classiﬁcationproblems with imbalanced labels. True positives (

T P ) arepositive sample points that are classiﬁed as positive whereastrue negatives (

T N ) are negative sample points that are clas-siﬁed as negative. Then, false positives (

F P ) are negativesample points that are classiﬁed as positive whereas falsenegatives (

F N ) are positive sample points that are classi-ﬁed as negative. Precision is deﬁned as tptp + fp and recallis tptp + fn . The PR curve is plotted with precision (y-axis)for each value of recall (x-axis). In order to summarize thiscurve, it is conventional to use area under the curve (AUC)to measure prediction performance. Performance

Reuters

The Reuters data set serves as a simple proof ofconcept. We train a feedforward network structured as fol-lows: 1,000 input nodes, 512 hidden node layer, ReLU ac-tivation, 0.5 dropout, 46 node layer, and Softmax. We trainuntil there was no improvement in validation score after tenepochs. We generate ﬁve models that vary only across eachlearning rate in the set − . , − . , − . , · · · , − . (a total of 205 samples). For each model we ﬁnd the scoresand epochs for MV, CE, CS, and LKS. Then, for each learn-ing rate we average across the MV predictions for the ﬁvemodels to get the RIE scores and add up the epochs to getthe RIE epochs.In Table 1 we see that checkpoint ensembles (CE) andrandom initialization ensembles (RIE) consistently performbetter than the baseline, minimum validation (MV), withRIE showing the best performance as one would expect.The smoothers do not perform quite as well although thereis a signiﬁcant improvement in checkpoint smoothers (CS)over baseline that is not present for the last k smoother(LKS). The improvement CS displays suggests that smooth-ing weights is a tenable approach under certain scenarios. Table 1: Improvement over MV (Reuters)Method Difference CI 95% Difference p-value CE [ . , . . × − CS [ . , . . × − LKS [ − . , − . . × − RIE [ . , . . × − We report p-values and conﬁdence intervals for the one sam-ple t-test to test the null hypothesis that we had zero differ-ence from our baseline (MV). The conﬁdence intervals in-dicate the direction the performance moved relative to thebaseline - positive values indicate better performance andnegative values indicate worse performance.We averaged the accuracy and number of epochs for theﬁve runs we used in Table 1 for MV and CE for Figure 3(we exclude LKS and CS for the sake of clarity, because theyperformed worse than CE). For RIE, we already had a singleestimate for the ﬁve runs across any particular learning rate.First, we can notice that CE consistently outperforms select-ing by the MV in terms of accuracy, and appears to capturesome portion of the beneﬁt RIE affords. Secondly, we notethat CE appears to converge at a higher learning rate thanMV does.Now, looking at the epochs in Figure 3, it is immedi-ately obvious there is high variance in the epochs to theright of learning rate − . , because excessively high learn-ing rates result in unreliable convergence. At lower learningrates ( [10 − . , − . ] ), we see that the number of epochs toconvergence grows as the learning rate decreases since thenetwork moves through the space of parameters slowly. Ad-ditionally, we observe that the maximum point for CE trans-lates to approximately two epochs of training time whereasthe maximum point for MV translates to ﬁve epochs of train-ing time. In this case, one should generally prefer CE be-cause it requires less running time and achieves better per-formance than MV. Comparing to RIE, we see that RIE takesabout twelve epochs to converge under it’s optimal learningrate compared to CE’s two epochs. This gain in convergencespeed is less signiﬁcant for the Reuters data set because it issmall and straightforward; however for larger training datasets, a single epoch can easily take hours or days. In thesesettings, CE might be a better choice of ensemble. Finally,we observe that the optimal learning rate is very similar be-tween CE and RIE. Since the optimal learning rate betweenMV and RIE is quite different, another practical use for CEcould be to tune the optimal learning rate for RIE. CIFAR-10

The CIFAR-10 data set is more complicatedthan the Reuters data. Since the data are images, we ap-ply CNNs to solve the classiﬁcation problem. Our net-work is structured as follows: 32 neuron (3x3) Convolutionlayer, ReLU activation, 32 neuron (3x3) Convolution layer,ReLU activation, 2x2 MaxPooling, 0.25 Dropout, 64 neu-ron (3x3) Convolution layer, ReLU activation, 64 neuron(3x3) Convolution layer, ReLU activation, 2x2 MaxPooling, . . . . Accuracy by Learning Rate (Reuters)

Learning Rate A cc u r a cy −3.5 −3 −2.5 −2 −1.5 −3.5 −3 −2.5 −2 −1.5 −3.5 −3 −2.5 −2 −1.5 MVCERIE

Training Time by Learning Rate (Reuters)

Learning Rate T r a i n i ng T i m e ( E po c h s ) −3.5 −3 −2.5 −2 −1.5 One Process (CE,MV)Five Processes (RIE)

Figure 3: Accuracy on the test set and epochs to convergence(i.e. number of sequential epochs to the maximum validationaccuracy) for different learning rates. We ﬁt a spline to theaccuracy and draw vertical lines through the maximum pointon each of the splines.0.25 Dropout, Flatten layer, 512 neuron Dense layer, ReLUactivation, 0.5 Dropout, 10 neuron Dense layer, and Soft-max. We generate ﬁve models that vary across each learningrate in the set − . , − . , − . , · · · , − . (a totalof 185 samples).In Table 2 we see signiﬁcant improvements from CE andRIE as we did in the Reuters data set. Additionally, we cansee that the smoothers (LKS, and CS) are not good optionsfor intra-process model averaging. In fact, for the CIFAR-10data set Checkpoint Smoothing (CS) has a generally nega-tive effect, whereas it has a generally positive effect on theReuters data set.Once again, we average accuracy and number of epochsfor the ﬁve runs we used in Table 2 for MV and CE. In Fig-ure 4 there is a similar pattern emerging in terms of accuracy.We see that CE outperforms MV in terms of accuracy andepochs, allowing for a bump in optimal performance as wellas a reduction in training epochs required for the maximumaccuracy - 70 to 50 (Figure 4). When inference is cheap CEshould always be preferable to MV for training neural net- Table 2: Improvement over MV (CIFAR-10)Method Difference CI 95% Difference p-value CE [ . , . . × − CS [ − . , − . . × − LKS [ − . , − . . × − RIE [ . , . . × − We report p-values and conﬁdence intervals for the one sam-ple t-test to test the null hypothesis that we had zero differ-ence from our baseline (MV). The conﬁdence intervals in-dicate the direction the performance moved relative to thebaseline - positive values indicate better performance andnegative values indicate worse performance.works. Then, comparing to RIE, we capture a portion of thebeneﬁt RIE provides with 50 epochs rather than 280 (Fig-ure 4). In certain settings with extremely high training times,checkpoint ensembles could be preferable to RIE as well. Fi-nally, we observe that the optimal learning rate is very simi-lar between CE and RIE once again, further supporting usingCE as a cheap way to tune the optimal learning rate for RIE.

Operating Room Data

In this section we examine theperformance of our methods on a signiﬁcantly larger dataset. This data set contains both time series data and staticsummary information about patients in operating rooms.Since it contains time series data, we use an LSTM networkfor the two prediction tasks. In order to keep the numberof features manageable, we select the most common timeseries (e.g. SAO2, and ETCO2) features as well as naturalstatic features (age, gender, height, weight, and ASACode).Because this data set is so large, we only generate results forﬁve learning rates. Additionally, we use the AUC of the PRcurve to measure our performance because PR curves are of-ten used when the positive class is more interesting than thenegative class, as is the case in these prediction problems.The ﬁrst prediction task is oxygen desaturation, a medicalcondition that we deﬁne as the blood oxygen dropping be-low . The current state of the art method for this datasetis to XGBoost with pre-processed features (primarily expo-nential moving averages for the time series features). Run-ning XGBoost with early stopping, a step size of . for iterations on the subset of processed features yielded amodel that achieved an AUC of the PR curve of . onthe test set.As a comparison, we ran an LSTM structured as follows:41 input nodes, 400 LSTM nodes with . recurrent dropout,400 LSTM nodes with . recurrent dropout, 0.5 dropout,and 1 output node with a sigmoid activation. Rather thangenerating multiple models for each learning rate, we boot-strapped the test data ﬁfty times in order to have a distri-bution of the possible prediction tasks. Using these boot-strapped test sets, we calculated estimates for the test AUCsas well as standard deviations.In Table 3 LSTMs did indeed improve over XGBoost.Furthermore, we see that our gain in performance from us- . . . . Accuracy by Learning Rate (CIFAR−10)

Learning Rate A cc u r a cy −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 −4 −3.5 −3 −2.5 MVCERIE

Training Time by Learning Rate (CIFAR−10)

Learning Rate T r a i n i ng T i m e ( E po c h s ) −4 −3.5 −3 −2.5 One Process (CE,MV)Five Processes (RIE)

Figure 4: Accuracy on the test set and epochs to convergence(i.e. number of sequential epochs to the maximum validationaccuracy) for different learning rates. We ﬁt a spline to theaccuracy and draw vertical lines through the maximum pointon each of the splines.ing checkpoint ensembles is signiﬁcantly higher than thebootstrapped standard deviation. Since this gain is past theAUC that the state of the art (XGBoost) achieves, it appearsthat checkpoint ensembles can offer signiﬁcant performancegains in fairly difﬁcult space of prediction problems.Our next prediction task is hypocapnia, which is a medicalcondition deﬁned as an end tidal CO2 of less than mmHg.Using XGBoost on processed features with early stopping,a step size of . for iterations we found an AUC of . on the test set.As a comparison, we ran an LSTM structured as follows:41 input nodes, 200 LSTM nodes with . recurrent dropout,200 LSTM nodes with . recurrent dropout, 0.5 dropout,and 1 output node with a sigmoid activation. Once again, webootstrapped the test data to obtain standard deviations forour estimates of performance.First of all, in Table 4 we see that our estimate for gainis once again signiﬁcantly higher than its bootstrapped stan-dard deviation. Additionally we see that LSTMs do not gen-erally improve over XGBoost in this setting, however uti- Table 3: Test AUC (OR Data - Desaturation)LearningRate 0.01 0.005 0.001 0.0005 0.0001MV 0.1057 0.2102 0.2333 0.2261 0.2222MV ( σ ) 0.0019 0.0033 0.0030 0.0029 0.0032CE 0.1057 0.2136 0.2363 0.2323 0.2252CE ( σ ) 0.0019 0.0034 0.0030 0.0030 0.0032Gain 0.0000 0.0033 0.0030 0.0062 0.0030Gain ( σ ) 0.0000 0.0006 0.0005 0.0004 0.0008Epoch 1 3 13 19 80This table reports AUC on the test set for desaturation. Gainreports the improvement of CE over MV, and Epoch is theepoch where the model had the best validation score.Table 4: Test AUC (OR Data - Hypocapnia)LearningRate 0.01 0.005 0.001 0.0005 0.0001MV 0.1398 0.4059 0.4279 0.4247 0.4256MV ( σ ) 0.0011 0.0030 0.0030 0.0027 0.0029CE 0.1398 0.4186 0.4365 0.4307 0.4283CE ( σ ) 0.0011 0.0030 0.0031 0.0027 0.0027Gain 0.0000 0.0127 0.0087 0.0060 0.0027Gain ( σ ) 0.0000 0.0006 0.0005 0.0004 0.0007Epoch 1 7 11 10 63This table reports AUC on the test set for hypocapnia. Gainreports the improvement of CE over MV, and Epoch is theepoch where the model had the best validation score.lizing checkpoint ensembles affords a performance gain thatbrings comparable performance at a learning rate of . .Additionally, Table 4 shows that checkpoint ensembles gen-erally perform better for learning rates on the higher sidewhich is consistent with the Reuters and CIFAR-10 datasets. Discussion and Future Work

We present and analyze a method to capture effects of tra-ditional ensemble methods within a single training process.Checkpoint ensembles (CE) provide the following beneﬁts:1. CE attains a signiﬁcant amount of the beneﬁt of traditionalensembles with signiﬁcantly less training epochs.2. CE attains optimal performance before minimum vali-dation model selection, suggesting a necessity for lessepochs. Additionally, CE’s optimal performance is higherthan MV.3. CE can afford performance gains over minimum valida-tion in simple neural networks, convolutional neural net-works, and long short term memory networks.4. CE offers a cheaper method to tune random initializationensembles as we have seen that minimum validation isenerally not a good approximation of the optimal param-eter settings for RIE.The limitation of checkpoint ensembles is extra predic-tion time (capped out at a constant a + 5 times the predictiontime of a single model where a is the number of early stop-ping rounds). If prediction time is not a problem, CE shouldalways be favored over MV.As far as future work, checkpoint ensembles can be ex-plored with more sophisticated schemes of ensembling suchas the Bayes Optimal Classiﬁer or the discrete Super Learnerrather than a simple unweighted average (van der Laan, Pol-ley, and Hubbard 2007; Ju, Bibaut, and van der Laan 2017).Additionally, the utility of checkpoint ensembles could beexplored in iterative learning techniques outside of neuralnetworks. References [Chollet and others 2015] Chollet, F., et al. 2015. Keras. https://github.com/fchollet/keras .[Dietterich 2000] Dietterich, T. G. 2000. Ensemble methodsin machine learning. In

Proceedings of the First Interna-tional Workshop on Multiple Classiﬁer Systems , MCS ’00,1–15. London, UK, UK: Springer-Verlag.[Hochreiter and Schmidhuber 1997] Hochreiter, S., andSchmidhuber, J. 1997. Long short-term memory.

NeuralComput.

International Journal of Uncertainty, Fuzzi-ness and Knowledge-Based Systems

CoRR abs/1704.01664.[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classiﬁca-tion with deep convolutional neural networks. In Pereira,F.; Burges, C. J. C.; Bottou, L.; and Weinberger, K. Q., eds.,

Advances in Neural Information Processing Systems 25 .Curran Associates, Inc. 1097–1105.[Krogh and Vedelsby 1995] Krogh, A., and Vedelsby, J.1995. Neural network ensembles, cross validation, and ac-tive learning. In

Advances in Neural Information ProcessingSystems , 231–238. MIT Press.[Lipton et al. 2015] Lipton, Z. C.; Kale, D. C.; Elkan, C.; andWetzel, R. C. 2015. Learning to diagnose with LSTM re-current neural networks.

CoRR abs/1511.03677.[Naftaly, Intrator, and Horn 1997] Naftaly, U.; Intrator, N.;and Horn, D. 1997. Optimal ensemble averaging of neu-ral networks.

Network: Computation in Neural Systems

Proceedings of the Sec-ond Conference on Machine Translation, Volume 2: SharedTask Papers . [Utans 1996] Utans, J. 1996. Weight averaging for neuralnetworks and local resampling schemes. In