[PDF] Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology

Abstract

While many approaches to make neural networks more fathomable have been proposed, they are restricted to interrogating the network with input data. Measures for characterizing and monitoring structural properties, however, have not been developed. In this work, we propose neural persistence, a complexity measure for neural network architectures based on topological data analysis on weighted stratified graphs. To demonstrate the usefulness of our approach, we show that neural persistence reflects best practices developed in the deep learning community such as dropout and batch normalization. Moreover, we derive a neural persistence-based stopping criterion that shortens the training process while achieving comparable accuracies as early stopping based on validation loss.

Full PDF

PPublished as a conference paper at ICLR 2019 N EURAL P ERSISTENCE : A C

OMPLEXITY M EASUREFOR D EEP N EURAL N ETWORKS U SING A LGEBRAIC T OPOLOGY

Bastian Rieck , , † , Matteo Togninalli , , † , Christian Bock , , † ,Michael Moor , , Max Horn , , Thomas Gumbsch , , Karsten Borgwardt , D EPARTMENT OF B IOSYSTEMS S CIENCE AND E NGINEERING , ETH Z

URICH , S

WITZERLAND SIB S

WISS I NSTITUTE OF B IOINFORMATICS , S

WITZERLAND † These authors contributed equally A BSTRACT

While many approaches to make neural networks more fathomable have beenproposed, they are restricted to interrogating the network with input data. Measuresfor characterizing and monitoring structural properties, however, have not beendeveloped. In this work, we propose neural persistence , a complexity measure forneural network architectures based on topological data analysis on weighted strati-ﬁed graphs. To demonstrate the usefulness of our approach, we show that neuralpersistence reﬂects best practices developed in the deep learning community such asdropout and batch normalization. Moreover, we derive a neural persistence-basedstopping criterion that shortens the training process while achieving comparableaccuracies as early stopping based on validation loss.

NTRODUCTION

The practical successes of deep learning in various ﬁelds such as image processing (Simonyan &Zisserman, 2015; He et al., 2016; Hu et al., 2018), biomedicine (Ching et al., 2018; Rajpurkar et al.,2017; Rajkomar et al., 2018), and language translation (Bahdanau et al., 2015; Sutskever et al.,2014; Wu et al., 2016) still outpace our theoretical understanding. While hyperparameter adjustmentstrategies exist (Bengio, 2012), formal measures for assessing the generalization capabilities of deepneural networks have yet to be identiﬁed (Zhang et al., 2017). Previous approaches for improvingtheoretical and practical comprehension focus on interrogating networks with input data. Thesemethods include i) feature visualization of deep convolutional neural networks (Zeiler & Fergus,2014; Springenberg et al., 2015), ii) sensitivity and relevance analysis of features (Montavon et al.,2017), iii) a descriptive analysis of the training process based on information theory (Tishby &Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017; Saxe et al., 2018; Achille & Soatto, 2018), and iv) astatistical analysis of interactions of the learned weights (Tsang et al., 2018). Additionally, Raghuet al. (2017) develop a measure of expressivity of a neural network and use it to explore the empiricalsuccess of batch normalization, as well as for the deﬁnition of a new regularization method. They notethat one key challenge remains, namely to provide meaningful insights while maintaining theoreticalgenerality. This paper presents a method for elucidating neural networks in light of both aspects.We develop neural persistence , a novel measure for characterizing neural network structural complex-ity. In doing so, we adopt a new perspective that integrates both network weights and connectivitywhile not relying on interrogating networks through input data. Neural persistence builds on com-putational techniques from algebraic topology, speciﬁcally topological data analysis (TDA), whichwas already shown to be beneﬁcial for feature extraction in deep learning (Hofer et al., 2017) anddescribing the complexity of GAN sample spaces (Khrulkov & Oseledets, 2018). More precisely,we rephrase deep networks with fully-connected layers into the language of algebraic topology anddevelop a measure for assessing the structural complexity of i) individual layers, and ii) the entirenetwork. In this work, we present the following contributions:- We introduce neural persistence , a novel measure for characterizing the structural complexity ofneural networks that can be efﬁciently computed.1 a r X i v : . [ c s . L G ] S e p ublished as a conference paper at ICLR 2019- We prove its theoretical properties, such as upper and lower bounds, thereby arriving at a normaliz-ation for comparing neural networks of varying sizes.- We demonstrate the practical utility of neural persistence in two scenarios: i) it correctly capturesthe beneﬁts of dropout and batch normalization during the training process, and ii) it can be easilyused as a competitive early stopping criterion that does not require validation data. ACKGROUND : T

OPOLOGICAL DATA ANALYSIS

Topological data analysis (TDA) recently emerged as a ﬁeld that provides computational tools foranalysing complex data within a rigorous mathematical framework that is based on algebraic topology .This paper uses persistent homology, a theory that was developed to understand high-dimensionalmanifolds (Edelsbrunner et al., 2002; Edelsbrunner & Harer, 2010), and has since been successfullyemployed in characterizing graphs (Sizemore et al., 2017; Rieck et al., 2018), ﬁnding relevant featuresin unstructured data (Lum et al., 2013), and analysing image manifolds (Carlsson et al., 2008). Thissection gives a brief summary of the key concepts; please refer to Edelsbrunner & Harer (2010) foran extensive introduction.

Simplicial homology

The central object in algebraic topology is a simplicial complex K , i.e. ahigh-dimensional generalization of a graph, which is typically used to describe complex objects suchas manifolds. Various notions to describe the connectivity of K exist, one of them being simplicialhomology. Brieﬂy put, simplicial homology uses matrix reduction algorithms (Munkres, 1996) toderive a set of groups, the homology groups, for a given simplicial complex K . Homology groupsdescribe topological features—colloquially also referred to as holes—of a certain dimension d , suchas connected components ( d = 0 ), tunnels ( d = 1 ), and voids ( d = 2 ). The information from the d thhomology group is summarized in a simple complexity measure, the d th Betti number β d , whichmerely counts the number of d -dimensional features: a circle, for example, has Betti numbers (1 , ,i.e. one connected component and one tunnel, while a ﬁlled circle has Betti numbers (1 , , i.e. oneconnected component but no tunnel. In the context of analysing simple feedforward neural networksfor two classes, Bianchini & Scarselli (2014) calculated bounds of Betti numbers of the decisionregion belonging to the positive class, and were thus able to show the implications of differentactivation functions. These ideas were extended by Guss & Salakhutdinov (2018) to obtain a measureof the topological complexity of decision boundaries. Persistent homology

For the analysis of real-world data sets, however, Betti numbers turn outto be of limited use because their representation is too coarse and unstable. This prompted thedevelopment of persistent homology. Given a simplicial complex K with an additional set ofweights a ≤ a ≤ · · · ≤ a m − ≤ a m , which are commonly thought to represent the ideaof a scale, it is possible to put K in a ﬁltration, i.e. a nested sequence of simplicial complexes ∅ = K ⊆ K ⊆ · · · ⊆ K m − ⊆ K m = K . This ﬁltration is thought to represent the ‘growth’ of K as the scale is being changed. During this growth process, topological features can be created (newvertices may be added, for example, which creates a new connected component) or destroyed (twoconnected components may merge into one). Persistent homology tracks these changes and representsthe creation and destruction of a feature as a point ( a i , a j ) ∈ R for indices i ≤ j with respect tothe ﬁltration. The collection of all points corresponding to d -dimensional topological features iscalled the d th persistence diagram D d . It can be seen as a collection of Betti numbers at multiplescales. Given a point ( x, y ) ∈ D d , the quantity pers( x, y ) := | y − x | is referred to as its persistence .Typically, high persistence is considered to correspond to features, while low persistence is consideredto indicate noise (Edelsbrunner et al., 2002). NOVEL MEASURE FOR NEURAL NETWORK COMPLEXITY

This section details neural persistence , our novel measure for assessing the structural complexity ofneural networks. By exploiting both network structure and weight information through persistent ho-mology, our measure captures network expressiveness and goes beyond mere connectivity properties.Subsequently, we describe its calculation, provide theorems for theoretical and empirical bounds,and show the existence of neural networks complexity regimes. To summarize this section, Figure 1illustrates how our method treats a neural network.2ublished as a conference paper at ICLR 2019 w ′ = 1 w ′ = 0 . w ′ = 0 ⊆ ⊆ l l l l l l w ′ d w ′ c Figure 1: Illustrating the neural persistence calculation of a network with two layers ( l and l ).Colours indicate connected components per layer. The ﬁltration process is depicted by colouringconnected components that are created or merged when the respective weights are greater than orequal to the threshold w (cid:48) i . As w (cid:48) i decreases, network connectivity increases. Creation and destructionthresholds are collected in one persistence diagram per layer (right), and summarized according toEquation 1 for calculating neural persistence.3.1 N EURAL PERSISTENCE

Given a feedforward neural network with an arrangement of neurons and their connections E , let W refer to the set of weights. Since W is typically changing during training, we require a function ϕ : E → W that maps a speciﬁc edge to a weight. Fixing an activation function, the connectionsform a stratiﬁed graph . Deﬁnition 1 (Stratiﬁed graph and layers) . A stratiﬁed graph is a multipartite graph G = ( V, E ) satisfying V = V (cid:116) V (cid:116) . . . , such that if u ∈ V i , v ∈ V j , and ( u, v ) ∈ E , we have j = i + 1 . Hence,edges are only permitted between adjacent vertex sets. Given k ∈ N , the k th layer of a stratiﬁedgraph is the unique subgraph G k := ( V k (cid:116) V k +1 , E k := E ∩ { V k × V k +1 } ) . This enables calculating the persistent homology of G and each G k , using the ﬁltration inducedby sorting all weights, which is common practice in topology-based network analysis (Carstens& Horadam, 2013; Horak et al., 2009) where weights often represent closeness or node similarity.However, our context requires a novel ﬁltration because the weights arise from an incremental ﬁttingprocedure, namely the training, which could theoretically lead to unbounded values. When analysinggeometrical data with persistent homology, one typically selects a ﬁltration based on the (Euclidean)distance between data points (Bubenik, 2015). The ﬁltration then connects points that are increasinglydistant from each other, starting from points that are direct neighbours. Our network ﬁltration aims tomimic this behaviour in the context of fully-connected neural networks. Our framework does not explicitly take activation functions into account; however, activation functions inﬂuence the evolutionof weights during training. Filtration

Given the set of weights W for one training step, let w max := max w ∈W | w | . Further-more, let W (cid:48) := {| w | /w max | w ∈ W} be the set of transformed weights, indexed in non-ascendingorder, such that w (cid:48) ≥ w (cid:48) ≥ · · · ≥ . This permits us to deﬁne a ﬁltration for the k th layer G k as G (0) k ⊆ G (1) k ⊆ . . . , where G ( i ) k := ( V k (cid:116) V k +1 , { ( u, v ) | ( u, v ) ∈ E k ∧ ϕ (cid:48) ( u, v ) ≥ w (cid:48) i } ) and ϕ (cid:48) ( u, v ) ∈ W (cid:48) denotes the transformed weight of an edge. We tailored this ﬁltration towards theanalysis of neural networks, for which large (absolute) weights indicate that certain neurons exert alarger inﬂuence over the ﬁnal activation of a layer. The strength of a connection is thus preserved bythe ﬁltration, and weaker weights with | w | ≈ remain close to . Moreover, since w (cid:48) ∈ [0 , holdsfor the transformed weights, this ﬁltration makes the network invariant to scaling, which simpliﬁesthe comparison of different networks. Persistence diagrams

Having set up the ﬁltration, we can calculate persistent homology for everylayer G k . As the ﬁltration contains at most -simplices (edges), we capture zero-dimensionaltopological information, i.e. how connected components are created and merged during the ﬁltration.These information are structurally equivalent to calculating a maximum spanning tree using theweights, or performing hierarchical clustering with a speciﬁc setup (Carlsson & Mémoli, 2010).While it would theoretically be possible to include higher-dimensional information about eachlayer G k , for example in the form of cliques (Rieck et al., 2018), we focus on zero-dimensionalinformation in this paper, because of the following advantages: i) the resulting values are easily3ublished as a conference paper at ICLR 2019 Algorithm 1

Neural persistence calculation

Require:

Neural network with l layers and weights W w max ← max w ∈W | w | (cid:46) Determine largest absolute weight W (cid:48) ← {| w | /w max | w ∈ W} (cid:46) Transform weights for ﬁltration for k ∈ { , . . . , l − } do F k ← G (0) k ⊆ G (1) k ⊆ . . . (cid:46) Establish ﬁltration of k th layer D k ← P ERSISTENT H OMOLOGY ( F k ) (cid:46) Calculate persistence diagram end for return {(cid:107)D (cid:107) p , . . . , (cid:107)D l − (cid:107) p } (cid:46) Calculate neural persistence for each layerinterpretable as they essentially describe the clustering of the network at multiple weight thresholds,ii) previous research (Rieck & Leitte, 2016; Hofer et al., 2017) indicates that zero-dimensionaltopological information is already capturing a large amount of information, and iii) persistenthomology calculations are highly efﬁcient in this regime (see below). We thus calculate zero-dimensional persistent homology with this ﬁltration. The resulting persistence diagrams have aspecial structure: since our ﬁltration solely sorts edges , all vertices are present at the beginning ofthe ﬁltration, i.e. they are already part of G (0) k for each k . As a consequence, they are assigned aweight of , resulting in | V k × V k +1 | connected components. Hence, entries in the correspondingpersistence diagram D k are of the form (1 , x ) , with x ∈ W (cid:48) , and will be situated below the diagonal,similar to superlevel set ﬁltrations (Bubenik, 2015; Cohen-Steiner et al., 2009). Using the p -norm of apersistence diagram, as introduced by Cohen-Steiner et al. (2010), we obtain the following deﬁnitionfor neural persistence. Deﬁnition 2 (Neural persistence) . The neural persistence of the k th layer G k , denoted by NP( G k ) ,is the p -norm of the persistence diagram D k resulting from our previously-introduced ﬁltration, i.e. NP( G k ) := (cid:107)D k (cid:107) p := (cid:16) (cid:88) ( c,d ) ∈D k pers( c, d ) p (cid:17) p , (1) which (for p = 2 ) captures the Euclidean distance of points in D k to the diagonal. The p -norm is known to be a stable summary (Cohen-Steiner et al., 2010) of topological features in apersistence diagram. For neural persistence to be a meaningful measure of structural complexity, itshould increase as a neural network is learning. We evaluate this and other properties in Section 4.Algorithm 1 provides pseudocode for the calculation process. It is highly efﬁcient: the ﬁltra-tion (line 4) amounts to sorting all n weights of a network, which has a computational complexityof O ( n log n ) . Calculating persistent homology of this ﬁltration (line 5) can be realized using analgorithm based on union–ﬁnd data structures Edelsbrunner et al. (2002). This has a computationalcomplexity of O ( n · α ( n )) , where α ( · ) refers to the extremely slow-growing inverse of the Acker-mann function (Cormen et al., 2009, Chapter 22). We make our implementation and experimentsavailable under https://github.com/BorgwardtLab/Neural-Persistence .3.2 P ROPERTIES OF NEURAL PERSISTENCE

We elucidate properties about neural persistence to permit the comparison of networks with differentarchitectures. As a ﬁrst step, we derive bounds for the neural persistence of a single layer G k . Theorem 1.

Let G k be a layer of a neural network according to Deﬁnition 1. Furthermore, let ϕ k : E k → W (cid:48) denote the function that assigns each edge of G k a transformed weight. Using theﬁltration from Section 3.1 to calculate persistent homology, the neural persistence NP( G k ) of the k th layer satisﬁes ≤ NP( G k ) ≤ (cid:18) max e ∈ E k ϕ k ( e ) − min e ∈ E k ϕ k ( e ) (cid:19) ( | V k × V k +1 | − p , (2) where | V k × V k +1 | denotes the cardinality of the vertex set, i.e. the number of neurons in the layer.Proof. We prove this constructively and show that the bounds can be realized. For the lower bound,let G − k be a fully-connected layer with | V k | vertices and, given θ ∈ [0 , , let ϕ k ( e ) := θ for4ublished as a conference paper at ICLR 2019every edge e . Since a vertex v is created before its incident edges, the ﬁltration degenerates to alexicographical ordering of vertices and edges, and all points in D k will be of the form ( θ, θ ) . Thus, NP( G − k ) = 0 . For the upper bound, let G + k again be a fully-connected layer with | V k | ≥ verticesand let a, b ∈ [0 , with a < b . Select one edge e (cid:48) at random and deﬁne a weight function as ϕ ( e (cid:48) ) := b and ϕ ( e ) := a otherwise. In the ﬁltration, the addition of the ﬁrst edge will create a pairof the form ( b, b ) , while all other pairs will be of the form ( b, a ) . Consequently, we have NP( G + k ) = (cid:16) pers( b, b ) p + ( n − · pers( b, a ) p (cid:17) p = ( b − a ) · ( n − p (3) = (cid:18) max e ∈ E k ϕ ( e ) − min e ∈ E k ϕ ( e ) (cid:19) ( | V k | − p , (4)so our upper bound can be realized. To show that this term cannot be exceeded by NP( G ) for any G , suppose we perturb the weight function (cid:101) ϕ ( e ) := ϕ ( e ) + (cid:15) ∈ [0 , . This cannot increase NP ,however, because each difference b − a in Equation 3 is maximized by max ϕ ( e ) − min ϕ ( e ) .We can use the upper bound of Theorem 1 to normalize the neural persistence of a layer, making itpossible to compare layers (and neural networks) that feature different architectures, i.e. a differentnumber of neurons. Deﬁnition 3 (Normalized neural persistence) . For a layer G k following Deﬁnition 1, using the upperbound of Theorem 1, the normalized neural persistence (cid:102) NP( G k ) is deﬁned as the neural persistenceof G k divided by its upper bound, i.e. (cid:102) NP( G k ) := NP( G k ) · NP( G + k ) − . The normalized neural persistence of a layer permits us to extend the deﬁnition to an entire network.While this is more complex than using a single ﬁltration for a neural network, this permits us toside-step the problem of different layers having different scales.

Deﬁnition 4 (Mean normalized neural persistence) . Considering a network as a stratiﬁed graph G according to Deﬁnition 1, we sum the neural persistence values per layer to obtain the meannormalized neural persistence , i.e. NP( G ) := 1 /l · (cid:80) l − k =0 (cid:102) NP( G k ) . While Theorem 1 gives a lower and upper bound in a general setting, it is possible to obtain empiricalbounds when we consider the tuples that result from the computation of a persistence diagram.Recall that our ﬁltration ensures that the persistence diagram of a layer contains tuples of the form (1 , w i ) , with w i ∈ [0 , being a transformed weight. Exploiting this structure permits us to obtainbounds that could be used prior to calculating the actual neural persistence value in order to make theimplementation more efﬁcient. Theorem 2.

Let G k be a layer of a neural network as in Theorem 1 with n vertices and m edgeswhose edge weights are sorted in non-descending order, i.e. w ≤ w ≤ · · · ≤ w m − . Then NP( G k ) can be empirically bounded by (cid:107) − w max (cid:107) p ≤ NP( G k ) ≤ (cid:107) − w min (cid:107) p , (5) where w max = ( w m − , w m − , . . . , w m − n ) T and w min = ( w , w , . . . , w n − ) T are the vectorscontaining the n largest and n smallest weights, respectively.Proof. See Section A.2 in the appendix.

Complexity regimes in neural persistence

As an application of the two theorems, we brieﬂy takea look at how neural persistence changes for different classes of simple neural networks. To thisend, we train a perceptron on the ‘MNIST’ data set. Since our measure uses the weight matrix ofa perceptron, we can compare its neural persistence with the neural persistence of random weightmatrices, drawn from different distributions. Moreover, we can compare trained networks with respectto their initial parameters. Figure 2 depicts the neural persistence values as well as the lower boundsaccording to Theorem 2 for different settings. We can see that a network in which the optimizerdiverges (due to improperly selected parameters) is similar to a random Gaussian matrix. Trainednetworks, on the other hand, are clearly distinguished from all other networks. Uniform matriceshave a signiﬁcantly lower neural persistence than Gaussian ones. This is in line with the intuition5ublished as a conference paper at ICLR 2019

Neural persistence J itt e r Figure 2: Neural persistence values of trained perceptrons (green), diverging ones (yellow), randomGaussian matrices (red), and random uniform matrices (black). We performed 100 runs per category;dots indicate neural persistence while crosses indicate the predicted lower bound according toTheorem 2. The bounds according to Theorem 1 are shown as dashed lines.that the latter type of networks induces functional sparsity because few neurons have large absoluteweights. For clarity, we refrain from showing the empirical upper bounds because most weightdistributions are highly right-tailed; the bound will not be as tight as the lower bound. These resultsare in line with a previous analysis (Sizemore et al., 2017) of small weighted networks, in whichpersistent homology is seen to outperform traditional graph-theoretical complexity measures suchas the clustering coefﬁcient (see also Section A.1 in the appendix). For deeper networks, additionalexperiments discuss the relation between validation accuracy and neural persistence (Section A.5), theimpact of different data distributions, as well as the variability of neural persistence for architecturesof varying depth (Section A.6).

XPERIMENTS

This section demonstrates the utility and relevance of neural persistence for fully connected deepneural networks. We examine how commonly used regularization techniques (batch normalizationand dropout) affect neural persistence of trained networks. Furthermore, we develop an early stoppingcriterion based on neural persistence and we compare it to the traditional criterion based on validationloss. We used different architectures with

ReLU activation functions across experiments. The bracketsdenote the number of units per hidden layer. In addition, the Adam optimizer with hyperparameterstuned via cross-validation was used unless noted otherwise. Please refer to Table A.1 in the appendixfor further details about the experiments.4.1 D

EEP LEARNING BEST PRACTICES IN LIGHT OF NEURAL PERSISTENCE

We compare the mean normalized neural persistence (see Deﬁnition 4) of a two-layer (with anarchitecture of [650 , ) neural network to two models where batch normalization (Ioffe & Szegedy,2015) or dropout (Srivastava et al., 2014) are applied. Figure 3 shows that the networks designedaccording to best practices yield higher normalized neural persistence values on the ‘MNIST’ dataset in comparison to an unmodiﬁed network. The effect of dropout on the mean normalized neuralpersistence is more pronounced and this trend is directly analogous to the observed accuracy onthe test set. These results are consistent with expectations if we consider dropout to be similar toensemble learning (Hara et al., 2016). As individual parts of the network are trained independently, .

48 0 .

50 0 .

52 0 .

54 0 .

56 0 .

58 0 .

60 0 .

62 0 . Mean normalized neural persistence

Figure 3: Comparison of mean normalized neural persistence for trained networks without modiﬁca-tions (green), with batch normalization (yellow), and with 50% of the neurons dropped out duringtraining (red) for the ‘MNIST’ data set (50 runs per setting).6ublished as a conference paper at ICLR 2019a higher degree of per-layer redundancy is expected, resulting in a different structural complexity.Overall, these results indicate that for a ﬁxed architecture approaches targeted at increasing the neuralpersistence during the training process may be of particular interest.4.2 E

ARLY STOPPING BASED ON NEURAL PERSISTENCE

Neural persistence can be used as an early stopping criterion that does not require a validation dataset to prevent overﬁtting: if the mean normalized neural persistence does not increase by morethan ∆ min during a certain number of epochs g , the training process is stopped. This procedure iscalled ‘patience’ and Algorithm 2 describes it in detail. A similar variant of this algorithm, usingvalidation loss instead of persistence, is the state-of-the-art for early stopping in training (Bengio,2012; Chollet et al., 2015). To evaluate the efﬁcacy of our measure, we compare it against validationloss in an extensive set of scenarios. More precisely, for a training process with at most G epochs, wedeﬁne a G × G parameter grid consisting of the ‘patience’ parameter g and a burn-in rate b (bothmeasured in epochs). b deﬁnes the number of epochs after which an early stopping criterion startsmonitoring, thereby preventing underﬁtting. Subsequently, we set ∆ min = 0 for all measures toremain comparable and scale-invariant, as non-zero values could implicitly favour one of them dueto scaling. For each data set, we perform 100 training runs of the same architecture, monitoringvalidation loss and mean normalized neural persistence every quarter epoch. The early stoppingbehaviour of both measures is simulated for each combination of b and g and their performanceover all runs is summarized in terms of median test accuracy and median stopping epoch; if acriterion is not triggered for one run, we report the test accuracy at the end of the training andthe number of training epochs. This results in a scatterplot, where each point (corresponding to asingle parameter combination) shows the difference in epochs and the absolute difference in testaccuracy (measured in percent). The quadrants permit an intuitive explanation: Q , for example,contains all conﬁgurations for which our measure stops earlier , while achieving a higher accuracy.Since b and g are typically chosen to be small in an early stopping scenario, we use grey pointsto indicate uncommon conﬁgurations for which b or g is larger than half of the total number ofepochs. Furthermore, to summarize the performance of our measure, we calculate the barycentre ofall conﬁgurations (green square).Figure 4a depicts the comparison with validation loss for the ‘Fashion-MNIST’ (Xiao et al., 2017)data set; please refer to Section A.3 in the appendix for more data sets. Here, we observe thatmost common conﬁgurations are in Q or in Q , i.e our criterion stops earlier. The barycentre is at ( − . , − . , showing that out of 625 conﬁgurations, on average we stop half an epoch earlier thanvalidation loss, while losing virtually no accuracy ( . ). Figure 4c depicts detailed differencesin accuracy and epoch for our measure when compared to validation loss; each cell in a heatmapcorresponds to a single parameter conﬁguration of b and g . In the heatmap of accuracy differences,blue, white, and red represent parameter combinations for which we obtain higher, equal, or lower accuracy, respectively, than with validation loss for the same parameters. Similarly, in the heatmap ofepoch differences, green represents parameter combinations for which we stop earlier than validationloss. For b ≤ , we stop earlier ( . epochs on average), while losing only . accuracy. Finally, Algorithm 2

Early stopping based on mean normalized neural persistence

Require:

Weighted neural network N , patience g , ∆ min P ← , G ← (cid:46) Initialize highest observed value and patience counter procedure E ARLY S TOPPING ( N , g , ∆ min ) (cid:46) Callback that monitors training at every epoch P (cid:48) ← NP( N ) if P (cid:48) > P + ∆ min then (cid:46) Update mean normalized neural persistence and reset counter P ← P (cid:48) , G ← else (cid:46) Update patience counter G ← G + 1 end if if G ≥ g then (cid:46) Patience criterion has been triggered return

P (cid:46)

Stop training and return highest observed value end if end procedure − − − − . − . . . Q Q Q Q Epoch difference A cc u r ac yd i ff e r e n ce (a) Fashion-MNIST Data set Barycentre Final test accuracyFashion-MNIST ( − . , − .

08) 86 . ± . MNIST (+0 . , − .

06) 96 . ± . CIFAR-10 ( − . , − .

13) 52 . ± . IMDB ( − . , +0 .

07) 87 . ± . (b) Summary g b Accuracy difference −0.4−0.3−0.2−0.10.00.10.20.30.4 g b Epoch difference −4−2024 (c) Accuracy and epoch differences g b NP g b Validation loss 020406080100 (d) Number of triggers

Figure 4: The visualizations depict the differences in accuracy and epoch for all comparison scenariosof mean normalized neural persistence versus validation loss, while the table summarizes the resultson other data sets. Final test accuracies are shown irrespectively of early stopping to put the accuracydifferences into context.Figure 4d shows how often each measure is triggered. Ideally, each measure should consist of a darkgreen triangle, as this would indicate that each conﬁguration stops all the time. For this data set,we observe that our method stops for more parameter combinations than validation loss, but not asfrequently for all of them. To ensure comparability across scenarios, we did not use the validationdata as additional training data when stopping with neural persistence; we refer to Section A.7 foradditional experiments in data scarcity scenarios. We observe that our method stops earlier whenoverﬁtting can occur, and it stops later when longer training is beneﬁcial.

ISCUSSION

In this work, we presented neural persistence , a novel topological measure of the structural complexityof deep neural networks. We showed that this measure captures topological information that pertainsto deep learning performance. Being rooted in a rich body of research, our measure is theoreticallywell-deﬁned and, in contrast to previous work, generally applicable as well as computationallyefﬁcient. We showed that our measure correctly identiﬁes networks that employ best practices suchas dropout and batch normalization. Moreover, we developed an early stopping criterion that exhibitscompetitive performance while not relying on a separate validation data set. Thus, by saving valuabledata for training, we managed to boost accuracy, which can be crucial for enabling deep learningin regimes of smaller sample sizes. Following Theorem 2, we also experimented with using the p -norm of all weights of the neural network as a proxy for neural persistence. However, this didnot yield an early stopping measure because it was never triggered, thereby suggesting that neuralpersistence captures salient information that would otherwise be hidden among all the weights of anetwork. We extended our framework to convolutional neural networks (see Section A.4) by derivinga closed-form approximation, and observed that an early stopping criterion based on neural persistencefor convolutional layers will require additional work. Furthermore, we conjecture that assessingdissimilarities of networks by means of persistence diagrams (making use of higher-dimensionaltopological features), for example, will lead to further insights regarding their generalization andlearning abilities. Another interesting avenue for future research would concern the analysis ofthe ‘function space’ learned by a neural network. On a more general level, neural persistence demonstrates the great potential of topological data analysis in machine learning.8ublished as a conference paper at ICLR 2019 R EFERENCES

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, AndrewHarp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, ManjunathKudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, ChrisOlah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, MartinWattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machinelearning on heterogeneous systems: Simple, end-to-end, LeNet-5-like convolutional MNIST modelexample, 2015. URL https://github.com/tensorflow/models/blob/master/tutorials/image/mnist/convolutional.py .Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deeprepresentations.

Journal of Machine Learning Research , 18:1–34, 2018.Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointlylearning to align and translate. In

International Conference on Learning Representations (ICLR) ,2015.Yoshua Bengio. Practical recommendations for gradient-based training of deep architectures. InGrégoire Montavon, Geneviève B. Orr, and Klaus-Robert Müller (eds.),

Neural Networks: Tricks ofthe Trade , volume 7700 of

Lecture Notes in Computer Science , pp. 437–478. Springer, Heidelberg,Germany, 2012.Monica Bianchini and Franco Scarselli. On the complexity of neural network classiﬁers: A com-parison between shallow and deep architectures.

IEEE Transactions on Neural Networks andLearning Systems , 25(8):1553–1565, 2014.Peter Bubenik. Statistical topological data analysis using persistence landscapes.

Journal of MachineLearning Research , 16:77–102, 2015.Gunnar Carlsson and Facundo Mémoli. Characterization, stability and convergence of hierarchicalclustering methods.

Journal of Machine Learning Research , 11:1425–1470, 2010.Gunnar Carlsson, Tigran Ishkhanov, Vin de Silva, and Afra Zomorodian. On the local behavior ofspaces of natural images.

International Journal of Computer Vision , 76(1):1–12, 2008.Corrie J. Carstens and Kathy J. Horadam. Persistent homology of collaboration networks.

Mathemat-ical Problems in Engineering , 2013:815035, 2013.Travers Ching, Daniel S. Himmelstein, Brett K. Beaulieu-Jones, Alexandr A. Kalinin, Brian T.Do, Gregory P. Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M. Hoff-man, Weil Xie, Gail L. Rosen, Benjamin J. Lengerich, Johnny Israeli, Jack Lanchantin, StephenWoloszynek, Anne E. Carpenter, Avanti Shrikumar, Jinbo Xu, Evan M. Cofer, Christopher A.Lavender, Srinivas C. Turaga, Amr M. Alexandri, Zhiyong Lu, David J. Harris, Dave DeCaprio,Yanjun Qi, Anshul Kundaje, Yifan Peng, Laura K. Wiley, Marwin H.S. Segler, Simina M. Boca,S. Joshua Swamidass, Austin Huang, Anthony Gitter, and Casey S. Greene. Opportunities andobstacles for deep learning in biology and medicine.

Journal of The Royal Society Interface , 15(141):20170387, 2018.François Chollet et al. Keras. https://keras.io , 2015.David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Extending persistence using Poincaréand Lefschetz duality.

Foundations of Computational Mathematics , 9(1):79–103, 2009.David Cohen-Steiner, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. Lipschitz functionshave L p -stable persistence. Foundations of Computational Mathematics , 10(2):127–139, 2010.Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.

Introduction toalgorithms . MIT Press, Cambridge, MA, USA, 3rd edition, 2009.Herbert Edelsbrunner and John Harer.

Computational topology: An introduction . American Mathem-atical Society, Providence, RI, USA, 2010. 9ublished as a conference paper at ICLR 2019Herbert Edelsbrunner, David Letscher, and Afra J. Zomorodian. Topological persistence and simpli-ﬁcation.

Discrete & Computational Geometry , 28(4):511–533, 2002.William H Guss and Ruslan Salakhutdinov. On characterizing the capacity of neural networks usingalgebraic topology. arXiv preprint arXiv:1802.04443 , 2018.Kazuyuki Hara, Daisuke Saitoh, and Hayaru Shouno. Analysis of dropout learning regarded asensemble learning. In Alessandro E.P. Villa, Paolo Masulli, and Antonio Javier Pons Rivero (eds.),

Artiﬁcial Neural Networks and Machine Learning (ICANN) , number 9887 in Lecture Notes inComputer Science, pp. 72–79, Cham, Switzerland, 2016. Springer.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp.770–778, 2016.Christoph Hofer, Roland Kwitt, Marc Niethammer, and Andreas Uhl. Deep learning with topologicalsignatures. In

Advances in Neural Information Processing Systems (NeurIPS) , pp. 1633–1643,2017.Danijela Horak, Slobodan Maleti´c, and Milan Rajkovi´c. Persistent homology of complex networks.

Journal of Statistical Mechanics: Theory and Experiment , 2009(03):P03034, 2009.Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks.

The IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , pp. 7132–7141, 2018.Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift. In Francis Bach and David Blei (eds.),

Proceedings of the 32ndInternational Conference on Machine Learning , volume 37 of

Proceedings of Machine LearningResearch , pp. 448–456. PMLR, 2015.Valentin Khrulkov and Ivan Oseledets. Geometry score: A method for comparing generative ad-versarial networks. In Jennifer Dy and Andreas Krause (eds.),

Proceedings of the 35th InternationalConference on Machine Learning , volume 80 of

Proceedings of Machine Learning Research , pp.2621–2629. PMLR, 2018.Pek Y. Lum, Gurjeet Singh, Alan Lehman, Tigran Ishkanov, Mikael Vejdemo-Johansson, MuthuAlagappan, John Carlsson, and Gunnar Carlsson. Extracting insights from the shape of complexdata using topology.

Scientiﬁc Reports , 3:1–8, 2013.Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. Methods for interpreting andunderstanding deep neural networks.

Digital Signal Processing , 73:1–15, 2017.James R. Munkres.

Elements of algebraic topology . CRC Press, Boca Raton, FL, USA, 1996.Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On theexpressive power of deep neural networks. In Doina Precup and Yee Whye Teh (eds.),

Proceedingsof the 34th International Conference on Machine Learning , volume 70 of

Proceedings of MachineLearning Research , pp. 2847–2854. PMLR, 2017.Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu,Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deep learning with electronichealth records. npj Digital Medicine , 1(1):18, 2018.Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, DaisyDing, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, Matthew Lungren, and Andrew Y. Ng.CheXNet: Radiologist-level pneumonia detection on chest X-rays with deep learning. arXivpreprint arXiv:1711.05225 , 2017.Bastian Rieck and Heike Leitte. Exploring and comparing clusterings of multivariate data sets usingpersistent homology.

Computer Graphics Forum , 35(3):81–90, 2016.Bastian Rieck, Ulderico Fugacci, Jonas Lukasczyk, and Heike Leitte. Clique community persistence:A topological visual analysis approach for complex networks.

IEEE Transactions on Visualizationand Computer Graphics , 24(1):822–831, 2018.10ublished as a conference paper at ICLR 2019Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky,Brendan Daniel Tracey, and David Daniel Cox. On the information bottleneck theory of deeplearning. In

International Conference on Learning Representations (ICLR) , 2018.Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 , 2017.Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale imagerecognition. In

International Conference on Learning Representations (ICLR) , 2015.Ann Sizemore, Chad Giusti, and Danielle S. Bassett. Classiﬁcation of weighted networks throughmesoscale homological features.

Journal of Complex Networks , 5(2):245–273, 2017.Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving forsimplicity: The all convolutional net. In

Workshop Track of the International Conference onLearning Representations (ICLR) , 2015.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of Machine LearningResearch , 15(1):1929–1958, 2014.Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks.In

Advances in Neural Information Processing Systems (NeurIPS) , pp. 3104–3112, 2014.Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In

IEEEInformation Theory Workshop (ITW) , pp. 1–5, 2015.Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural networkweights. In

International Conference on Learning Representations (ICLR) , 2018.Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation sys-tem: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 ,2016.Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A novel image dataset for bench-marking machine learning algorithms. arXiv preprint arXiv:1708.07747 , 2017.Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. InDavid Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (eds.),

European Conference onComputer Vision (ECCV) , volume 8689 of

Lecture Notes in Computer Science , pp. 818–833, Cham,Switzerland, 2014. Springer.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understand-ing deep learning requires rethinking generalization. In

International Conference on LearningRepresentations (ICLR) , 2017. 11ublished as a conference paper at ICLR 2019 . . . . . . . . . Clustering coefﬁcient

14 15 16 17 18 19 20 21 22 23 24 25

Neural persistenceFigure A.1: Traditional graph measures (top), such as the clustering coefﬁcient, fail to detectdifferences in the complexity of neural networks. Our novel neural persistence measure (bottom), bycontrast, shows that trained networks with η = 0 . (green), which have an accuracy of ≈ . , obeya different distribution than networks trained with η = 1 × − . (yellow), which have accuraciesranging from . – . . A A

PPENDIX

A.1 C

OMPARISON WITH GRAPH - THEORETICAL MEASURES

Traditional complexity/structural measures from graph theory, such as the clustering coefﬁcient,the average shortest path length, and global/local efﬁciency are already known to be insufﬁcientlyaccurate to characterize different models of complex random networks Sizemore et al. (2017). Ourexperiments indicate that this holds true for (deep) neural networks, too. As a brief example, wetrained a perceptron on the MNIST data set with batch stochastic gradient descent ( η = 0 . ),achieving a test accuracy of ≈ . . Moreover, we intentionally ‘sabotaged’ the training by setting η = 1 × − such that SGD is unable to converge properly. This leads to networks with accuraciesranging from . – . . A complexity measure should be capable of distinguishing both classesof networks. However, as Figure A.1 (top) shows, this is not the case for the clustering coefﬁcient.Neural persistence (bottom), on the other hand, results in two regimes that can clearly be distinguished,with the trained networks having a signiﬁcantly smaller variance.A.2 P ROOF OF T HEOREM Proof.

We may consider the ﬁltration from Section 3.1 to be a subset selection problem withconstraints, where we select n out of m weights. The neural persistence NP( G k ) of a layer thus onlydepends on the selected weights that appear as tuples of the form (1 , w i ) in D k . Letting (cid:101) w denote thevector of selected weights arising from the persistence diagram calculation, we can rewrite neuralpersistence as NP( G k ) = (cid:107) − (cid:101) w (cid:107) p . Furthermore, (cid:101) w satisﬁes (cid:107) w min (cid:107) p ≤ (cid:107) (cid:101) w (cid:107) p ≤ (cid:107) w max (cid:107) p .Since all transformed weights are non-negative in our ﬁltration, it follows that (note the reversal ofthe two terms) (cid:107) − w max (cid:107) p ≤ NP( G k ) ≤ (cid:107) − w min (cid:107) p , (6)and the claim follows.A.3 A DDITIONAL VISUALIZATIONS AND ANALYSES FOR EARLY STOPPING

Due to space constraints and the large number of conﬁgurations that we investigated for our earlystopping experiments, this section contains additional plots that follow the same schematic: the toprow shows the differences in accuracy and epoch for our measure when compared to the commonly-used validation loss. Each cell in the heatmap corresponds to a single conﬁguration of b and g . Inthe heatmap of accuracy differences, blue represents parameter combinations for which we obtain a higher accuracy than validation loss for the same parameters; white indicates combinations for whichwe obtain the same accuracy, while red highlights combinations in which our accuracy decreases.Similarly, in the heatmap of epoch differences, green represents parameter combinations for whichwe stop earlier than validation loss for the same parameter. The scatterplots in Section 4.2 show an12ublished as a conference paper at ICLR 2019‘unrolled’ version of this heat map, making it possible to count how many parameter combinationsresult in early stops while also increasing accuracy, for example. The heatmaps, by contrast, make itpossible to compare the behaviour of the two measures with respect to each parameter combination.Finally, the bottom row of every plot shows how many times each measure was triggered for everyparameter combination. We consider a measure to be triggered if its stopping condition is satisﬁedprior to the last training epoch. Due to the way the parameter grid is set up, no conﬁguration abovethe diagonal can stop, because b + g would be larger than the total number of training epochs. Thispermits us to compare the ‘slopes’ of cells for each measure. Ideally, each measure should consist ofa dark green triangle, as this would indicate that parameter conﬁguration stops all the time. MNIST

Please refer to Figures A.2 and A.3. The colours in the difference matrix of the top roware slightly skewed because in a certain conﬁguration, our measure loses . of accuracy whenstopping. However, there are many other conﬁgurations in which virtually no accuracy is lost andin which we are able to stop more than four epochs earlier. The heatmaps in the bottom row againindicate that neural persistence is capable of stopping for more parameter combinations in general.We do not trigger as often for some of them, though. CIFAR-10

Please refer to Figure A.4. In general, we observe that this data set is more sensitivewith respect to the parameters for early stopping. While there are several conﬁgurations in whichneural persistence stops with an increase of almost in accuracy, there are also scenarios in whichwe cannot stop training earlier, or have to train longer (up to epochs out of epochs in total). Thesecond row of plots shows our measure triggers reliably for more conﬁgurations than validation loss.Overall, the scatterplot of all scenarios (Figure A.5) shows that most practical conﬁgurations are againlocated in Q and Q . While we may thus ﬁnd certain conﬁgurations in which we reliably outperformvalidation loss as an early stopping criterion, we also want to point out that our measures behavescorrectly for many practical conﬁgurations. Points in Q , where we train longer and achieve a higher accuracy, are characterized by a high patience g of approximately epochs and a low burn-in rate b ,or vice versa. This is caused by the training for CIFAR-10, which does not reliably converge for FCNs.Figure A.6 demonstrates this by showing loss curves and the mean normalized neural persistencecurves of ﬁve runs over training (loss curves have been averaged over all runs; standard deviationsare shown in grey; we show the ﬁrst half of the training to highlight the behaviour for practicalearly stopping conditions). For ‘Fashion-MNIST’, we observe that NP exhibits clear change pointsduring the training process, which can be exploited for early stopping. For ‘CIFAR-10’, we observe arather incremental growth for some runs (with no clearly-deﬁned maximum), making it harder toderive a generic early stopping criterion that does not depend on ﬁne-tuned parameters. Hence, wehypothesize that neural persistence cannot be used reliably in scenarios where the architecture isincapable of learning the data set. In the future, we plan to experiment with deliberately selected‘bad’ and ‘good’ architectures in order to evaluate to what extent our topological measure is capableof assessing their suitability for training, but this is beyond the scope of this paper. IMDB

Please refer to Figure A.7. For this data set, we observe that most parameter conﬁgurationsresult in earlier stopping (up to two epochs earlier than validation loss), with accuracy increasesof up to . . This is also shown in the scatterplot A.8. Only a single conﬁguration, viz. g = 1 and b = 0 , results in a severe loss of accuracy; we removed it from the scatterplot for reasons ofclarity, as its accuracy difference of − would skew the display of the remaining conﬁgurationstoo much (this is also why the legends do not include this outlier).13ublished as a conference paper at ICLR 2019 g b Accuracy difference −0.8−0.6−0.4−0.20.00.20.40.60.8 g b Epoch difference −6−4−20246 (a) Accuracy and epoch differences g b NP g b Validation loss 020406080100 (b) Number of triggers

Figure A.2: Additional visualizations for the ‘MNIST’ data set. − − − − − . − . . . Q Q Q Q Epoch difference A cc u r ac yd i ff e r e n ce Figure A.3: Scatterplot of epoch and accuracy differences for ‘MNIST’.14ublished as a conference paper at ICLR 2019 g b Accuracy difference −10−50510 0 10 20 30 40 50 60 70 g b Epoch difference −15−10−5051015 (a) Accuracy and epoch differences g b NP g b Validation loss 0246810 (b) Number of triggers

Figure A.4: Additional visualizations for the ‘CIFAR-10’ data set. − − − − − − − − − Q Q Q Q Epoch difference A cc u r ac yd i ff e r e n ce Figure A.5: Scatterplot of epoch and accuracy differences for ‘CIFAR-10’.15ublished as a conference paper at ICLR 2019 (a) CIFAR-10 (b) Fashion-MNIST

Figure A.6: A comparison of mean normalized neural persistence curves that we obtain during thetraining of ‘CIFAR-10’ and ‘Fashion-MNIST’. g b Accuracy difference −0.10−0.050.000.050.10 g b Epoch difference −2−1012 (a) Accuracy and epoch differences g b NP g b Validation loss 01020304050 (b) Number of triggers

Figure A.7: Additional visualizations for the ‘IMDB’ data set.16ublished as a conference paper at ICLR 2019 − − − − . − . − . − · − · − . . . Q Q Q Q Epoch difference A cc u r ac yd i ff e r e n ce Figure A.8: Scatterplot of epoch and accuracy differences for ‘IMDB’.17ublished as a conference paper at ICLR 2019A.4 N

EURAL P ERSISTENCE FOR C ONVOLUTIONAL L AYERS

In principle, the proposed ﬁltration process could be applied to any bipartite graph. Hence, wecan directly apply our framework to convolutional layers, provided we represent them properly.Speciﬁcally, for layer l we represent the convolution of its i th input feature map a ( l − i ∈ R h in × w in with the j th ﬁlter H j ∈ R p × q as one bipartite graph G i,j parametrized by a sparse weight matrix W ( l ) i,j ∈ R ( h out · w out ) × ( h in · w in ) , which in each row contains the p · q unrolled values of H j on thediagonal, with h in − p zeros padded in between after each p values of vec ( H j ) . This way, the ﬂattenedpre-activation can be described as vec ( z ( l ) i,j ) = W ( l ) i,j · vec ( a ( l − i ) + b li,j · ( h out · w out ) × .Since ﬂattening does not change the topology of our bipartite graph, we compute the normalizedneural persistence on this sparse weight matrix W ( l ) i,j as the unrolled analogue of the fully-connectednetwork’s weight matrix. Averaging over all ﬁlters then gives a per-layer measure, similar to the waywe derived mean normalized neural persistence in the main paper.When studying the unrolled adjacency matrix W ( l ) i,j , it becomes clear that the edge ﬁltration processcan be approximated in a closed form. Speciﬁcally, for m and n input and output neurons we initialize τ = m + n connected components. When using zero padding, the additional dummy input neuronshave to included in m . For all τ tuples in the persistence diagram the creation event c = 1 . Notably,each output neuron shares the same set of edge weights.Due to this, the destruction events—except for a few special cases—simplify to a list of length τ containing the largest ﬁlter values (each value is contained n times) in descending order until the listis ﬁlled. This simpliﬁcation of neural persistence of a convolution with one ﬁlter is shown as a closedexpression in Equations 7–11, and our implementation is sketched in Algorithm 3. We thus obtain NP( G i,j ) = (cid:107) − (cid:101) w (cid:107) p , (7)where we use (cid:107) (cid:101) w (cid:107) p ≤ (cid:13)(cid:13)(cid:13)(cid:0) , w Tc , w T ¯ c,φ , vec ( A φ ) T , vec ( B φ ) T (cid:1) T (cid:13)(cid:13)(cid:13) p , (8)with φ = τ − dim ( w c ) − , (9) A x = w (cid:98) xn (cid:99) ⊗ n − , (10) B y = w (cid:98) yn (cid:99) +1 ⊗ y mod n , (11)where := 0 . Following this notation, Equation 7 expresses neural persistence of the bipartite graph G i,j , with (cid:101) w denoting the vector of selected weights (i.e. the destruction events) when calculatingthe persistence diagram. We use w to denote the ﬂattened and sorted weight values (in descendingorder) of the convolutional ﬁlter H j , while w c represents the vector of all weights that are located ina corner of H j , whereas w ¯ c,φ is the vector of all weights which do not originate from the corner ofthe ﬁlter while still belonging to the ﬁrst (and thus largest ) (cid:106) φn (cid:107) weights in w , which we denote by w (cid:98) φn (cid:99) .For the subsequent experiments (see below), we use a simple CNN that employs

32 + 2048 ﬁlters.Hence, by using the shortcut described above, we do not have to unroll 2080 weight matrices explicitly,thereby gaining both in memory efﬁciency and run time, as compared to the naive approach: onaverage, a naive exact computation based on unrolling required .

77 s per convolutional ﬁlter andevaluation step, whereas the approximation only took about .

000 38 s while showing very similarbehaviour up to a constant offset.For our experiments, we used an off-the-shelf ‘LeNet-like’ CNN model architecture (two convo-lutional layers each with max pooling and ReLU, 1 fully-connected and softmax) as described inAbadi et al. (2015). We trained the model on ‘Fashion-MNIST’ and included this setup in the earlystopping experiments (100 runs of 20 epochs). In Figure A.9, we observe that stopping based onthe neural persistence of a convolutional layer typically only incurs a considerable loss of accuracy:given a ﬁnal test accuracy of . ± . , stopping with this naive extension of our measure reducesaccuracy by up to 4%. Furthermore, in contrast to early stopping on a fully-connected architecture,we do not observe any parameter combinations that stop early and increase accuracy. In fact, there18ublished as a conference paper at ICLR 2019 Algorithm 3

Approximating Neural Persistence of Convolutions per ﬁlter

Require: ﬁlter H ∈ R p × q ; number of input and output neurons as m, n T ← ∅ (cid:46)

Initialize set of tuples for persistence diagram τ ← m + n, t ← , i ← (cid:46) Initialize number of tuples, tuple counter, weight index h max ← max h ∈ H | h | (cid:46) Determine largest absolute weight H (cid:48) ← {| h | /h max | h ∈ H } (cid:46) Transform weights for ﬁltration s ← sort ( vec ( H (cid:48) )) (cid:46) Sort weights in descending order H (cid:48) c ← { h (cid:48) , , h (cid:48) ,q − , h (cid:48) p − , , h (cid:48) p − ,q − } (cid:46) Determine the set of all corner weights of ﬁlter H (cid:48) T ← (1 , , t ← t + 1 (cid:46) Add tuple for surviving component for h (cid:48) c ∈ H (cid:48) c do (cid:46) Each corner of H (cid:48) merges components T ← (1 , h (cid:48) c ) , t ← t + 1 end for while do (cid:46) Create the remaining tuples (Approximation step) n (cid:48) = n − Ind ( s [ i ] ∈ H (cid:48) c ) (cid:46) if current weight is a corner weight, write one less tuple if t + n (cid:48) ≤ τ then (cid:46) if there are at least n (cid:48) more tuples, set their merge value to s [ i ] repeat n (cid:48) times T ← (1 , s [ i ]) (cid:46) approximative as s [ i ] does not always add n (cid:48) merges due to loops t ← t + n (cid:48) , i ← i + 1 else (cid:46) otherwise, process the remaining tuples similarly repeat ( τ − t ) times T ← (1 , s [ i ]) break end if end while return (cid:107)T (cid:107) p (cid:46) Compute norm of approximated persistence diagramis no conﬁguration that results in an increased accuracy. This empirically conﬁrms our theoreticalscepticism towards naively applying our edge-focused ﬁltration scheme to CNNs.A.5 R

ELATIONSHIP BETWEEN NEURAL PERSISTENCE AND VALIDATION ACCURACY

Motivated by Figure 2, which shows the different ‘regimes’ of neural persistence for a perceptronnetwork, we investigate a possible correlation of (high) neural persistence with (high) predictiveaccuracy. For deeper networks, we ﬁnd that neural persistence measures structural properties thatarise from different parameters (such as training procedures or initializations), and no correlation canbe observed.For our experiments, we constructed neural networks with a high neural persistence prior to training.More precisely, following the theorems in this paper, we initialized most weights of each layer withvery low values and reserved high values for very few weights. This was achieved by sampling theweights from a beta distribution with α = 0 . and β = 0 . . Using this procedure, we are able toinitialize [20,20,20] networks with NP ≈ . ± . compared to the same networks that have NP ≈ . ± . when initialized by Xavier initialization. The mean validation accuracy of theseuntrained networks on the ‘Fashion-MNIST’ data set is . ± . and . ± . , respectively.Figure A.10 depicts how both types of networks converge to similar regimes of validation accuracy,while the mean normalized neural persistence achieved at the end of the training varies. For networksinitialized with high NP (Figure A.10, left) the validation accuracy of networks with ﬁnal . ≤ NP ≤ . ranges from . (not shown) to . . For Xavier initialization (Figure A.10, right),the lack of correlation can also be observed. Furthermore, comparing the two plots, there are no clearadvantages in initializing networks with high NP . This observation further motivates the proposed early stopping criterion , which checks for changes in the NP value, and considers stagnating valuesto be indicative of a trained network. 19ublished as a conference paper at ICLR 2019 g b Accuracy difference −4−2024 g b Epoch difference −15−10−5051015 (a) Accuracy and epoch differences g b NP g b Validation loss 020406080100 (b) Number of triggers

Figure A.9: Additional visualizations for the ‘Fashion-MNIST’ data set, following the preliminaryexamination of convolutional layers. Here, the approximated neural persistence calculation for theﬁrst convolutional layer was used. However, we also ran few runs of the same experiment using theexact method which showed the same results. Employing the second convolutional layer or both didnot improve this result.

Mean normalized neural persistence V a li d a t i o n a cc u r a c y Networks initialized with high NP

Mean normalized neural persistence

Xavier initialized networks

Figure A.10: Each cluster of points represent the last two training epochs (sampled every quarterepoch) of a [20,20,20] network trained on the ‘Fashion-MNIST’ data set. We observe no correlationbetween validation accuracy and normalized total persistence20ublished as a conference paper at ICLR 2019 N o r m a li z e d N e u r a l P e r s i s t e n c e Figure A.11: (left) Histogram of the ﬁnal normalized neural persistence of a [50 , , network for100 runs and 25 epochs of training. (right) Normalized neural persistence after 15 epochs of trainingon MNIST for different architectures with increasing depth. Deeper architectures are denoted as [ n × where n is the number of hidden layers.A.6 N EURAL PERSISTENCE FOR DIFFERENT DATA DISTRIBUTIONS AND DEEPER

FCN

ARCHITECTURES

Neural persistence captures information about different data distributions during training. Theweights tuned via backpropagation are directly inﬂuenced by the input data (as well as their labels)and neural persistence tracks those changes. To demonstrate this, we trained the same architecture , i.e. [50 , , , on two data sets with the same dimensions but different properties: MNIST and ‘Fashion-MNIST’. Each data set has the same image size ( × pixels, one channel) but lay on differentmanifolds. Figure A.11 (left) shows a histogram of the mean normalized neural persistence ( NP )after epochs of training over different runs. The distributions have a similar shape but areshifted, indicating that the two datasets lead the network to different topological regimes.We also investigated the effect of depth on neural persistence. We selected a ﬁxed layer size (20 hiddenunits) and increased the number of hidden layers. Figure A.11 (right) depicts the boxplots of mean NP for multiple architectures after 15 epochs of training on MNIST. Adding layers initially increasesthe variability of NP by enabling the network to converge to different regimes (essentially, there aremany more valid conﬁgurations in which a trained neural network might end up in). However, thiseffect is reduced after a certain depth: networks with deeper architectures exhibit less variability in NP . 21ublished as a conference paper at ICLR 2019A.7 E ARLY STOPPING IN DATA SCARCITY SCENARIOS

Labelled data is expensive in most domains of interest, which results in small data sets or low qualityof the labels. We investigate the following experimental set-ups: (1) Reducing the training dataset size and (2) Permuting a fraction of the training labels. We train a fully connected network( [500 , , architecture) on ‘MNIST’ and ‘Fashion-MNIST’. In the experiments, we comparethe following measures for stopping the training: i) Stopping at the optimal test accuracy. ii) Fixedstopping after the burn in period. iii) Neural persistence patience criterion. iv) Training loss patiencecriterion. v) Validation loss patience criterion. For a description of the patience criterion, seeAlgorithm 2. All measures, except validation loss, include the validation datasets ( ) in thetraining process to simulate a larger data set when no cross-validation is required. We report theaccuracy on the non-reduced, non-permuted test sets. The batch size is training instances. Thestopping measures are evaluated every quarter epoch.Figure A.12 shows the results averaged over runs (the error is the standard deviation). Thedifference between the top and the bottom panel is the data set and the patience parameters. The x -axis depicts the fraction of the data set, which is warped for better accessibility. In each panel, theleft-hand side subplots depict the results of the reduced data set experiment where the right-hand sidesubplots depict the result of the permutation experiments. The y -axis of the top subplot shows theaccuracy on the non-reduced, non-permuted test set. The y -axis of the bottom subplot shows whenthe stopping criterion was triggered.We note the following observations, which hold for both panels: More, non-permuted data yieldshigher test accuracy. Also, as expected, the optimal stopping gives the highest test accuracy. Theﬁxed early stopping results in inferior test accuracy when only a fraction of the data is available.The neural persistence based stopping is triggered late when only a fraction of the data is availablewhich results in a slightly better test accuracy compared to training and validation loss. The trainingloss stopping achieves similar test accuracies compared to the persistence based stopping (for allregimes except the very small data set) with shorter training, on average. We note that, it is generallynot advisable to use training loss as a measure for stopping because the stability of this criterionalso depends on the batch size. When only a fraction of the data is available, the validation lossbased stopping stops on average after the same number of training epochs as the training loss, whichresults in inferior test accuracy because the network has seen in total fewer training samples. Moststrikingly, validation loss based stopping is is triggered later (sometimes never) when most trainingand validation labels are randomly permuted which results in overﬁtting and poor test accuracy.To conclude, the neural persistence based stopping achieves good performance without being affectedby the batch size and noisy labels. The authors also note that the result is consistent for multiplearchitectures and most patience parameters. 22ublished as a conference paper at ICLR 2019 t e s t a cc u r a c y OptimalFixedPersistenceTrainingValidation0.0 0.01 0.1 0.4 0.6 0.9 0.99 1.00.250.500.751.00

Fashion MNIST [500,500,200]patience = 4burn in = 20.1 0.2 0.4 0.6 1.0Fraction of data01020 e p o c h s t e s t a cc u r a c y OptimalFixedPersistenceTrainingValidation0.0 0.01 0.1 0.4 0.6 0.9 0.99 1.00.250.500.751.00

MNIST [500,500,200]patience = 16burn in = 00.1 0.2 0.4 0.6 1.0Fraction of data5101520 e p o c h s Figure A.12: On MNIST and Fashion-MNIST NP (in blue) stops later than validation and trainingloss when fewer training samples are available (left-hand side) which results in a higher test accuracy.For increasing noise in the training labels (right-hand side), the stopping of NP remains stable, incontrast to the validation loss stopping, which leads to lower test accuracy after longer training at ahigh fraction of permuted labels. The patience and burn in parameters are reported in quarter epochs.23ublished as a conference paper at ICLR 2019 T a b l e A . : P a r a m e t e r s a ndhyp e r p a r a m e t e r s f o r t h ee xp e r i m e n t onb e s t p r ac ti ce s a ndn e u r a l p e r s i s t e n ce . D r opou t a ndb a t c hno r m a li za ti on w e r ea pp li e d a f t e r t h e ﬁ r s t h i dd e n l a y e r . T h r oughou tt h e n e t w o r k s , R e L U w a s t h eac ti v a ti on f un c ti ono f c ho i ce . D a t a s e t R un s E po c h s A r c h it ec t u r e O p ti m i ze r B a t c h S i ze H yp e r p a r a m e t e r s M N I S T [ , ] A d a m η = . β = . , β = . , (cid:15) = × − η = . β = . , β = . , (cid:15) = × − , B a t c h N o r m a li za ti on η = . β = . , β = . , (cid:15) = × − , D r opou t % T a b l e A . : P a r a m e t e r s a ndhyp e r p a r a m e t e r s f o r t h ee xp e r i m e n t on ea r l y s t opp i ng . T h r oughou tt h e n e t w o r k s , R e L U w a s t h eac ti v a ti on f un c ti ono f c ho i ce . D a t a s e t R un s E po c h s A r c h it ec t u r e O p ti m i ze r B a t c h S i ze H yp e r p a r a m e t e r s ( F a s h i on -) M N I S T P e r ce p t r on M i n i b a t c h S GD η = . [ , , ] A d a m η = . β = . , β = . , (cid:15) = × − [ , ][ , , ] C I F A R - [ , , ] A d a m η = . β = . , β = . , (cid:15) = × − I M D B [ , , ] A d a m η = × − β = . , β = . , (cid:15) = × − .

980 0 .

981 0 .

982 0 .

983 0 .

984 0 .

985 0 .

986 0 .

987 0 . Test set accuracy

Figure A.13: Comparison of test set accuracy for trained networks without modiﬁcations (green),with batch normalization (yellow), and with 50% of the neurons dropped out during training (red) forthe MNIST data set.A.8 T