Investigating the Effect of Intraclass Variability in Temporal Ensembling
IInvestigating the Effect of Intraclass Variability in Temporal Ensembling
Siddharth Vohra ∗ , Manikandan Ravikiran University of California, San Diego Research & Development Center, Hitachi India Pvt Ltd., Bangalore, India [email protected]@hitachi.co.in
Abstract
Temporal Ensembling is a semi-supervised approach that allows training deep neural networkmodels with a small number of labeled images. In this paper, we present our preliminary studyon the effect of intraclass variability on temporal ensembling, with a focus on seed size and seedtype, respectively. Through our experiments we find that (a) there is a significant drop in accuracywith datasets that offer high intraclass variability, (b) more seed images offer consistently higheraccuracy across the datasets, and (c) seed type indeed has an impact on the overall efficiency,where it produces a spectrum of accuracy both lower and higher. Additionally, based on ourexperiments, we also find KMNIST to be a competitive baseline for temporal ensembling.
Deep neural networks have seen broad applications across vision speech and language in recent times.Yet this success is contingent on acquiring a large number of labeled datasets, which is expensive andtime-consuming. Further, labeling is mostly manual, done by humans, due to its higher meticulousness.Recently, to address this concern of manual labeling variety of approaches have been designed, includingSemi-Supervised learning algorithms (Gordon and Hern´andez-Lobato, 2017; Laine and Aila, 2017; Lee,2013), which typically proffer with higher results with a small number of labeled examples (seeds).Notable among this is the Temporal Ensembling (Laine and Aila, 2017), which uses an ensemble ofthe earlier outputs of a neural network as an unsupervised target label and achieved high accuracy onSVHN and CIFAR-10 with just 500 and 4000 labeled samples with both naturally offering lower intraclassvariances. Besides, to the best of our knowledge, there is no explicit study of temporal ensembling in thecontext of datasets with large intraclass variability. As such, in this work, we attempt to investigate thisgap by answering the following research questions.–
RQ1:
Does intraclass variability impact the accuracy of temporal ensembling? Here the intention isto check (a) how accuracy varies and (b) if there is any unique observable behavior with temporalensembling under different intraclass variances. The assumption here being that the intraclassvariability is a spectrum with a range of low to high intraclass variation. To this end, we experimenton datasets of Fashion-MNIST (Xiao et al., 2017) and KMNIST (Clanuwat et al., 2018) to find thatthere is a sheer drop in performance when using temporal ensembling.–
RQ2:
Under settings of intraclass variability, how does seed size impact temporal ensembling?. Herewe hypothesize and verify that upon increasing seed size, there is an improvement in performance.–
RQ3:
What’s the effect of seed selection on temporal ensembling? More specifically, we see ifthe diversity of seeds have an impact on results. The preliminary experimental results show thatperformance is lower with some category of seeds over others.The rest of the paper is organized as follows. In section 2, we review related research in semi-supervisedlearning. In section 3, we introduce the temporal ensembling approach in brief. In section 4, we presentdataset, experimental setup, and answer each of the research questions with analysis in section 5. Finally,we conclude in section 6 with possible implication on future works. ∗ Work performed while interning at Hitachi R&D India a r X i v : . [ c s . C V ] A ug Related Work
Semi-Supervised learning has seen an extensive assortment of works originating with pseudo-labeling(Lee, 2013), which assigns pseudo-labels to unlabelled data using the model forecasts at each epoch andtraining on both labeled and pseudo-labeled data. Then there are variational autoencoders (Kingma andWelling, 2014), which uses a generative deep learning model (Kingma et al., 2014) for semi-supervisedlearning. Similarly, there are Ladder Networks (Rasmus et al., 2015), which uses noise to the input ofeach layer in the neural network and combines it with denoising function. Ladder network combinesdenoising loss with the supervised loss during final training. More recently, there is Virtual AdversarialTraining (VAT) (Miyato et al., 2015), which uses perturbations generated by adversarial learning, TemporalEnsembling (Laine and Aila, 2017), which is like pseudo-labeling, except it matches the output of previousmodels and Mean-Teacher (Tarvainen and Valpola, 2017), which addresses the drawback of the storagerequirement of temporal ensembling by using an exponential moving average. There are many othernotable works, including works on co-training and feature learning (Blum and Mitchell, 1998; Sindhwaniet al., 2005), and graph-based methods (Ng and Silva, 2018), metric-learning based methods (Yu et al.,2017) and sampling noisy labels (Vahdat, 2017; Ravikiran et al., 2020). In this work, we concentrate ontemporal ensembling due to its simplicity and study it in the context of datasets with diverse intraclassvariability.
Temporal Ensembling is an enhancement of the Π -model (Laine and Aila, 2017) and is based on theidea of self-ensembling where it ensembles earlier outputs as an unsupervised target (See Equations 1-3).This process of self-ensembling is analogous to that of pseudo-labels. More specifically, in temporalensembling training is done with dataset under different augmentations and dropout regularization, whichpresents the network to acquire noise-invariant features. l B ( z, ˜ z, y ) = masked crossentropy ( z, y ) + w ( t ) ∗ M SE ( z, ˜ z ) (1) masked crossentropy ( z, y ) = − | B ∩ L | (cid:88) i ∈ ( B ∩ L ) log z i [ y i ] (2) M SE ( z, ˜ z ) = 1 C | B | (cid:88) i ∈ B || z i − ˜ z i || (3)This, in turn, allows the neural network to not shift its predictions in slightly modified variants of thesame inputs. After every training epoch, the ensemble outputs refresh with the current new predictionand the previous ensemble prediction through the exponential moving average (EMA) method shown inEquation 4. Also, to generate the input training target ˜ z as shown in equation 5, temporal ensemblingcorrects the startup bias of Z . More details on temporal ensembling can be found in (Laine and Aila,2017) and (Ferret, 2018). Z = αZ + (1 − α ) z (4) ˜ z = Z − α t (5) To answer the research questions from section 1 in this work, we compare and contrast performance oftemporal ensembling on various datasets under varying settings of labeled samples. This section providesan brief overview of the datasets (section 4.1) and parameter settings (section 4.2) used. .1 Dataset
For this work, we use the datasets, which are presented in Table 1. More details of datasets are presentedinplace across sections 5.1-5.3 when necessary.
Dataset Image Size Train Size Test Size DescriptionMNIST
KMNIST
Handwritten Kuzushiji Characters
Fashion-MNIST
Clothing images from Zolando.comTable 1: Summary of dataset and its characteristics used experiments 5.1-5.3.
The various settings employed to answer each of the RQ’s are as described. Besides, all the RQ’s employstandard settings for most of the parameters, as shown in Table 2 below.•
RQ1:
For analysis of RQ1, we train and test temporal ensembling models with all the three datasets,under common parameter settings from Table 2 with 300 and 500 epochs respectively.•
RQ2:
For RQ2, we use parameters from Table 2 and vary labeled examples in range 100-500.•
RQ3:
For RQ3, in addition to setting mentioned in Table 2, we run experiments with 10 differentrandomly sampled seeds.
Hyperparameters ValuesDropout
Standard Deviation
Feature maps in Conv 1 Feature maps in Conv 2 Weight Normalization
True
Learning Rate
Beta
Batch Size
Alpha
Data Normalization channel wiseTable 2: Hyperparameter settings used for experiments in sections 5.1-5.3.
The first research question asks how intraclass variability influences the accuracy using temporal en-sembling . To answer this, the accuracy score is reported on the datasets of MNIST, KMNIST, andFashion-MNIST, respectively. The reason behind selecting these datasets is that all of them are grayscalewith the same image size and proportion of images, but KMNIST and Fashion-MNIST offer wide intra-class variability. None of these datasets require any size normalization, thus making results comparableacross the datasets.Besides, as mentioned in section 4.2, here we trained temporal ensembling for 100, 300, and 500 epochs,respectively. The consolidated results so obtained are as shown in Table 3. Also, we ran five episodesof training with varying seed sampling. Hence the results presented in Table 3 are averaged with a netstandard deviation of accuracy. Detailed episode level results are in section 7.1. Comparing the resultsof different datasets, we can see that MNIST gives the highest accuracy, followed by Fashion-MNIST pochs 100 300 500MNIST ± ± ± KMNIST ± ± ± Fashion-MNIST ± ± ± (a) Supervised Loss(b) Unsupervised Loss Figure 1: Training loss behavior of MNIST using Temporal Ensemblingand KMNIST, respectively. Besides, MNIST has faster convergence and stagnation in the loss (SeeFigure 1), while KMNIST (Figure 2) and Fashion-MNIST (Figure 3) show delay in convergence timeand increase in loss values with higher epochs for both supervised and unsupervised components. Webelieve such behavior is due to two reasons, namely (i) similarity (low variance) in train-test samples andsamples within same classes for MNIST compared to KMNIST and Fashion-MNIST and (ii) unsupervisedlabels getting biased towards one of the classes, and no updates in the subsequent iterations of temporalensembling. However, more experiments are warranted to validate both (i) and (ii).Further, from Figure 1 and 2, we can see that such behavior is consistent across episodes with differentseed samples for both KMNIST and Fashion-MNIST. Besides, the results with KMNIST and Fashion-MNIST with some of the seeds differ by a large margin (See Tables 8-10 in Appendix). Overall, analysisof temporal ensembling here shows that it is less tuned for datasets that have higher intraclass variancesand might, therefore, have higher performances across varying datasets that do not match such criteria.Overall to summarize, our findings are as follows.• Accuracy score is highest for MNIST (97.21%), and among datasets that offer intraclass variance,there is a sheer drop in performance with KMNIST producing 60.66% and Fashion-MNIST showing71.31%. We believe such a drop is due to intraclass variances and variance between train-test samples,both of which require further analysis.• Unlike MNIST, both KMNIST and Fashion-MNIST present with a unique behavior of lack of a) Supervised Loss(b) Unsupervised Loss
Figure 2: Training loss behavior of KMNIST using Temporal Ensembling (a) Supervised Loss(b) Unsupervised Loss
Figure 3: Training loss behavior of Fashion-MNIST using Temporal Ensemblingonvergence for unsupervised loss and instead show an increase at higher epochs. This raises aquestion that the unsupervised labels may be biased towards a specific set of class(es) and aren’tchanging across iterations. At the same time, it can also be conjectured that this could be due to theselection of learning rate or any other components of the temporal ensembling algorithm. Validatingthese requires a more detailed analysis of temporal ensembling and its components.• Temporal ensembling is less suitable for datasets that offer higher intraclass variances, as evidentfrom results. As such, it is not directly usable despite the similarity in size and proportion of datasets.
Previously in section 5.1, we analyzed the impact of intraclass variance on temporal ensembling bystudying performance on the MNIST, KMNIST, and Fashion-MNIST datasets. However, for all theexperiments in section 5.1, we maintained a uniform seed size of 100 labeled samples. Besides, in Figures2 and 3, we saw that seeds indeed have an impact on the overall results. As such, in this section, we shallinvestigate the effect of seed size, i.e., the number of labeled samples. More specifically, for RQ2, wevary seed size in the range of 100-500 and analyze the performance of models trained for both 300 and500 epochs, respectively.
Seed Size 100 200 300 400 500MNIST ± ± ± ± .3285 97.37 ± KMNIST ± ± ± ± ± Fashion-MNIST ± ± ± ± ± Seed Size 100 200 300 400 500MNIST ± ± ± ± ± KMNIST ± ± ± ± ± Fashion-MNIST ± ± ± ± ± a) Supervised Loss(b) Unsupervised Loss Figure 4: Training loss behavior of MNIST using Temporal Ensembling with 300 Seeds (a) Supervised Loss(b) Unsupervised Loss
Figure 5: Training loss behavior of MNIST using Temporal Ensembling with 500 Seeds a) Supervised Loss(b) Unsupervised Loss
Figure 6: Training loss behavior of KMNIST using Temporal Ensembling with 300 Seeds (a) Supervised Loss(b) Unsupervised Loss
Figure 7: Training loss behavior of KMNIST using Temporal Ensembling with 500 Seeds a) Examples from
Seed 14 used in 300 seed experiments(b) Examples from
Seed 14 used in 500 seed experiments
Figure 8: Examples KMNIST seeds used in experiments.from 100 to 500 improves the result by 25% with 300 epochs and 14.7% with 500 epochs to result of75%. Meanwhile, Figure 6 and 7 show the loss curves with 300 and 500 seeds respectively across the fiveepisodes of training under each epoch settings. As we can see with higher seeds, the network trainingbehavior is indeed different, where we see lower values of both supervised and unsupervised loss. Further, a) Supervised Loss(b) Unsupervised Loss
Figure 9: Training loss behavior of Fashion-MNIST using Temporal Ensembling with 300 Seeds (a) Supervised Loss(b) Unsupervised Loss
Figure 10: Training loss behavior of Fashion-MNIST using Temporal Ensembling with 500 Seeds a) Examples from
Seed 14 used in 300 seed experiments(b) Examples from
Seed 14 used in 500 seed experiments
Figure 11: Examples Fashion-MNIST seeds used in experiments.such behavior is constant across various episodes of training. Examples from 300 and 500 seeds usedduring the preparation of KMNIST is as shown in Figure 8.Fashion-MNIST again shows similar behavior like KMNIST except with net improvement of 8%, witha maximum 81.53% with 500 seeds. Yet the loss curves (Figures 9 & 10) and seeds (Figure 11) showsimilar behavior like KMNIST. What is very interesting to see is that performance of Fashion-MNIST isery close to the two-layer convolution baseline (TLCB) of 87% (Table 6), which is trained with complete60k images. So even though temporal ensembling uses 150% lesser images than TLCB, the performancedifference is only 7% lower. Similarly, for KMNIST, we can see that performance 17% lower than thesimplest K-Nearest Neighbor baseline of 92.1% without any feature engineering (Table 6).
Dataset Approach Seed Size ResultsK-Nearest Neighbors Baseline
60k 92%
KMNIST Temporal Ensembling (500 epochs)
500 75.35%
Two-Layer Convolution Neural Network
60k 87%
Fashion-MNIST Temporal Ensembling (500 epochs)
500 81.53%
Table 6: Comparison of Results from Temporal Ensembling with Varying Seed Size (this work) againstrespective baselines for KMNIST and Fashion-MNIST with 60k images.These two results confirm that there is some correlation between test data and seed size. Besides, thisalso shows that KMNIST to be a competitive baseline for temporal ensembling compared to MNIST andFashion-MNIST in the context of intraclass variances. Also, these results strengthen our initial argumentfrom section 1, that there is a need for experimental studies, such as those involving intraclass variabilityin semi-supervised learning approaches to find out what, apart from the number of labeled samples, whatother aspects impacts performance. To summarise, our findings are:• Seed size drastically impacts results under the conditions of intraclass variability, wherewith largerseeds size produces better results.• KMNIST and Fashion-MNIST show an improvement of 14.7% and 8% respectively with an increasein seed size from 100 to 500, producing competitive results again baselines trained on complete 60kimages. While the net improvement obtained vary across both the datasets, the trend is consistent.• MNIST is a simple dataset, and from results, we can see that its not a strong baseline for temporalensembling under the context of intraclass variability, which in turn could be fulfilled by KMNIST.• Besides, from results, we argue that there is a need for more detailed studies on the effect of intraclassvariability in the context of general semi-supervised learning.
Previously in section 5.2, we saw that with increasing seed size, the accuracy improved for all the datasets,and in the case of KMNIST and Fashion-MNIST, we observed substantial gains. Here we will examinehow the type of seed will impact the results. This is similar to the argument of generalization in neuralnetworks, where the trained model should at least correctly classify images identical to the one it wasprepared for. However, in the context of temporal ensembling, the labeled examples serve two purposes,namely improving the quality of self-ensembled labels and, at the same time, also improve accuracy.As such, one can hypothesize that starting with seeds with distributions closer to test set samples mayachieve higher results and vice versa. Previous works have shown that all the datasets from section 4.1have examples of those similar to that of train distribution, and some are profoundly different. As such, toanswer this RQ, we repeat experiments in line with RQ2, except we perform training with ten episodeswith 300 seeds and 500 epochs. Also, to account for training randomness on the accuracy, each of themodels was tested at every epoch. Finally, the model from the th epoch was chosen due to its highestaccuracy . Consolidated results with standard deviation over ten episodes are as shown in Table 7.Comparing Table 7 with results from RQ2 (Table 5), with different samples of seeds, we can see that wehave both higher and lower results than those from RQ2. Similar to findings from RQ1 and RQ2, MNISTproduces results closer to 97% despite random sampling of seeds across episodes. Further, the lowesteffect of 96.95% and the highest of 98.03% could be observed. Figure 14 shows, convergence behavior ofboth the models. As we can see, both the models converge similarly and inline with RQ1 and RQ2. Please note that, the results for RQ3 is still preliminary and requires further analysis & validation. raining Episode MNIST KMNIST Fashion-MNIST (a) Supervised Loss(b) Unsupervised Loss
Figure 12: Training loss behavior of MNIST trained for 10 episodes with varying seeds. Highest andLowest models corresponds to seeds seed 14 & seed 102 respectivelyMeanwhile, KMNIST and Fashion-MNIST show widely varying results (See Table 7). To begin with,KMNIST produces the lowest accuracy of 66.86% and the highest of 77.40%. The lowest accuracy ofKMNIST is closer to results obtained with 100 seeds, which in turn suggests that some of the seedsselected were practically useless. Similar behavior could be observed with the Fashion-MNIST datasetwith the highest accuracy of 80.05 and lowest of 77.94%, where again, the lowest result is close to that ofresults with 200 seeds in Table 5.The fact that the performance of some experiments with varying seeds is significantly lower than theother would raise a question that the models may have to overfit, and results obtained could be lower.However, as mentioned earlier, this was avoided by checking for model accuracy at each epoch. Also,if this is indeed the cases, then it, in turn, raises a new question of if there are samples that would allowfaster and better convergence. Overall from these preliminary experiments a) Supervised Loss(b) Unsupervised Loss Figure 13: Training loss behavior of KMNIST trained for 10 episodes with varying seeds. Highest andLowest models corresponds to seeds seed 179 & seed 92 respectively. See Figures 15 & 16 in appendix.• We find that seed selection indeed impacts the results across all the datasets. For the cases ofKMNIST, we can see that the highest and lowest results obtained with different seeds differ by 2%and 9%, respectively. Similarly, for Fashion-MNIST, the numbers are 0.5% and 3%, respectively.• Performance with some of the seeds is significantly lower, which suggests some of the seeds offermore to training than others, which points to the identification of optimal seed images for temporalensembling.• We further emphasize that the results shown here require further in-depth experimentation, analysisof results, and theoretical grounding . This paper investigated the effect of intraclass variability on temporal ensembling. Firstly by analyzingdifferent datasets, we showed that the dataset differs widely in terms of their intraclass variations. KMNISToffers the highest intraclass variability, followed by Fashion-MNIST and MNIST. From our study on RQ1,we find that given constant parameter settings, intraclass variability indeed affects the overall performancewith KMNIST producing 60.66% and Fashion-MNIST showing 71.31%. Further, we also found thatKMNIST and Fashion-MNIST present with a problem of lack of convergence of unsupervised loss andinstead show an increase at higher epochs. Overall, our study of RQ1 suggests that temporal ensemblingis not directly usable for datasets with high intraclass variability.From our review of the effect of seed size in RQ2, we see that higher seed size results in better accuracyacross all the datasets KMNIST and Fashion-MNIST show an improvement of 14.7% and 8% respectivelywith an increase in seed size from 100 to 500, producing competitive results again baselines trained oncomplete 60k images. Further, we could see that KMNIST serves as a competitive baseline for temporalensembling as it accounts for intraclass variability. Finally, in RQ3, we demonstrate that with differentseed images, we get different results. a) Supervised Loss(b) Unsupervised Loss
Figure 14: Loss Behavior of KMNIST trained for 10 episodes with varying seeds. Highest and Lowestmodels corresponds to seeds seed 102 & seed 20 respectivelyOverall across a broad range of datasets, we examined how intraclass variability temporal ensemblingperforms. While there is considerable variation within the classes across the datasets, it is consistentthat the class with more intraclass variability is harder to do classify. This connection with intraclassvariability occurs to such an extent that, in fact, temporal ensembling with 1/3 of the selected seeds offerssimilar accuracy (See section 5.3). However, at the same time, this result serves as a decent baseline. Webelieve seeds are critical, and the temporal ensembling with different seeds is not sufficiently effective atgeneralizing beyond the image contexts found in training data. To close this gap and advance temporalensembling for practical applications various aspects need more in-depth exploration including (a) effectof varying other hyperparameters (Table 4.2) in temporal ensembling, (b) relationship between seed typesand data distribution and (c) reason for the rise in the unsupervised loss at higher epochs. References
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In
COLT’ 98 .Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. 2018.Deep learning for classical japanese literature.
ArXiv , abs/1812.01718.Johan Ferret. 2018.
Blog on Semi-supervised image classification via Temporal Ensembling .Jonathan Gordon and Jos´e Miguel Hern´andez-Lobato. 2017. Bayesian semisupervised learning with deep genera-tive models. arXiv: Machine Learning .Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes.
CoRR , abs/1312.6114.Diederik P. Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervisedlearning with deep generative models. In
NIPS .Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning.
ArXiv , abs/1610.02242.ong-Hyun Lee. 2013. Pseudo-label : The simple and efficient semi-supervised learning method for deep neuralnetworks.Takeru Miyato, Shin ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. 2015. Distributional smoothingwith virtual adversarial training. arXiv: Machine Learning .Yin Cheng Ng and Ricardo Silva. 2018. Bayesian semi-supervised learning with graph gaussian processes. In
NeurIPS .Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. 2015. Semi-supervisedlearning with ladder networks. In
NIPS .Manikandan Ravikiran, Amin Ekant Muljibhai, Toshinori Miyoshi, Hiroaki Ozaki, Yuta Koreeda, and SakataMasayuki. 2020. Hitachi at semeval-2020 task 12: Offensive language identification with noisy labels usingstatistical sampling and post-processing.
ArXiv , abs/2005.00295.Vikas Sindhwani, P. Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi-supervised learningwith multiple views. In
ICML 2005 .Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistencytargets improve semi-supervised deep learning results. In
NIPS .Arash Vahdat. 2017. Toward robustness against label noise in training deep discriminative neural networks. In
NIPS .Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. Fashion-mnist: a novel image dataset for benchmarkingmachine learning algorithms.
ArXiv , abs/1708.07747.Jun Yu, Xiaokang Yang, Fei Gao, and Dacheng Tao. 2017. Deep multimodal distance metric learning using clickconstraints for image ranking.
IEEE Transactions on Cybernetics , 47:4014–4024.
Acknowledgements
We would like to thank Johan Ferret (Ferret, 2018), for his pytorch 0.4 implementation of temporalensembling.
Author Contributions
MR designed the study. SV carried out the primary experiments. Both the authors carried out analysis,secondary experiments, contributed to developing the outline and editing the manuscript. MR wrote themanuscript and SV reviewed the manuscript.
We now present various experimental details and results for each of the RQ´s.
Tables 8-10 present detailed results for RQ1.
Tables 11-14, shows experimental results obtained with varying seeds from 200 to 500 respectively.
Figures 15-16, shows seeds that produce highest and lowest results for MNIST datasets in RQ3. ataset Epochs Experiment Accuracy Accuracy (Best Model)MNIST 100
Average
Average
Dataset Epochs Experiment Accuracy Accuracy (Best Model)KMNIST 100
Average
Average
Average ataset Epochs Experiment Accuracy Accuracy (Best Model)Fashion-MNIST 100
Average
Average ataset Epochs Experiment Accuracy Accuracy (Best Model)MNIST 100
KMNIST 100
Fashion-MNIST 100
Table 11: Results of MNIST, KMNIST and Fashion-MNIST with 200 seeds. ataset Epochs Experiment Accuracy Accuracy (Best Model)MNIST 100
Average 96.24% 96.28%300
Average 97.85% 97.82%500
Average 97.65% 97.71%KMNIST 100
Average 70.18% 68.74%300
Average 69.88% 70.02%500
Average 67.86% 70.11%Fashion-MNIST 100
Average 78.41% 78.54%300
Average 78.61% 78.70%500
Average 79.49% 79.92%
Table 12: Results of MNIST, KMNIST and Fashion-MNIST with 300 seeds. ataset Epochs Experiment Accuracy Accuracy (Best Model)MNIST 100
Average 96.36% 96.03%300
Average 97.43% 97.41%500
Average 97.74% 97.73%KMNIST 100
Average 72.16% 71.95%300
Average 73.60% 72.24%500
Average 73.71% 74.27%Fashion-MNIST 100
Average 79.35% 79.19%300
Average 79.75% 80.40%500
Average 79.51% 80.68%
Table 13: Results of MNIST, KMNIST and Fashion-MNIST with 400 seeds. ataset Epochs Experiment Accuracy Accuracy (Best Model)MNIST 100
Average 96.27% 96.10%300
Average 97.24% 97.37%500
Average 97.31% 97.35%KMNIST 100
Average 73.64% 72.66%300
Average 74.80% 75.35%500
Average 75.17% 75.36%Fashion-MNIST 100
Average 80.73% 80.90%300
Average 80.74% 81.53%500
Average 80.12% 80.40%
Table 14: Results of MNIST, KMNIST and Fashion-MNIST with 500 seeds.igure 15: Seed seed 14 used in MNIST experiment as part of RQ3igure 16: Seed seed 102seed 102