[PDF] Faster Convergence in Deep-Predictive-Coding Networks to Learn Deeper Representations

Abstract

Deep-predictive-coding networks (DPCNs) are hierarchical, generative models. They rely on feed-forward and feed-back connections to modulate latent feature representations of stimuli in a dynamic and context-sensitive manner. A crucial element of DPCNs is a forward-backward inference procedure to uncover sparse, invariant features. However, this inference is a major computational bottleneck. It severely limits the network depth due to learning stagnation. Here, we prove why this bottleneck occurs. We then propose a new forward-inference strategy based on accelerated proximal gradients. This strategy has faster theoretical convergence guarantees than the one used for DPCNs. It overcomes learning stagnation. We also demonstrate that it permits constructing deep and wide predictive-coding networks. Such convolutional networks implement receptive fields that capture well the entire classes of objects on which the networks are trained. This improves the feature representations compared with our lab's previous non-convolutional and convolutional DPCNs. It yields unsupervised object recognition that surpass convolutional autoencoders and are on par with convolutional networks trained in a supervised manner.

Full PDF

SSUBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 1

Faster Convergence in Deep-Predictive-CodingNetworks to Learn Deeper Representations

Isaac J. Sledge,

Member, IEEE , and Jos´e C. Pr´ıncipe,

Life Fellow, IEEE

Abstract —Deep-predictive-coding networks (DPCNs) are hierarchical, generative models that rely onfeed-forward and feed-back connections to modulate latent feature representations of stimuli in a dynamicand context-sensitive manner. A crucial element of DPCNs is a forward-backward inference procedureto uncover sparse states of a dynamic model, which are used for invariant feature extraction. However,this inference and the corresponding backwards network parameter updating are major computationalbottlenecks. They severely limit the network depths that can be reasonably implemented and easily trained.We therefore propose a optimization strategy, with better empirical and theoretical convergence, based onaccelerated proximal gradients.We demonstrate that the ability to construct deeper DPCNs leads to receptive ﬁelds that capture wellthe entire notions of objects on which the networks are trained. This improves the feature representations.It yields completely unsupervised classiﬁers that surpass convolutional and convolutional-recurrent au-toencoders and are on par with convolutional networks trained in a supervised manner. This is despite theDPCNs having orders of magnitude fewer parameters.

Index Terms —Bio-inspired vision, predictive coding, unsupervised learning

1. Introduction

Predictive coding is a promising theory for sensory information processing. Under this theory, a dynamics-basedhierarchical, generative model [1, 2] of the world is formed and consistently updated to infer the possible physicalcauses of given stimuli while suppressing prediction errors at all levels [3]. Top-down connections carry predictionsabout activity, in the form of the causes, to the lower model levels. The propagated causes reﬂect past experience[4] and act as priors to disambiguate the incoming sensory inputs. Bottom-up connections relay prediction errors tohigher levels to update the causes. The interaction of the feed-forward and feed-back connections [5] on the causesenables robust object analysis [6] from the observed stimuli.Several predictive coding schemes have been created and their biological plausibility investigated [1, 4, 7, 8].None of the early contributions, however, have been known to extract highly discriminative details, like sparse, invari-ant feature representations, that are helpful for high-level object analysis tasks. Our lab thus developed multi-layer,deep-predictive-coding networks (DPCNs) [9–11]; alternate networks later followed [12, 13].DPCNs can be thought of as parameter-light, non-traditional autoencoders with feed-forward and recurrent feed-back connections. They do not have a corresponding decoder, though, so they require a self-organizing principle to beeffective.During training, DPCNs learn, according to a free-energy principle [14], to build spatio-temporal, transformation-invariant representations of dynamic input stimuli [15]. This yields an approximate identity mapping that preservesperceptual difference. The underlying feature representations are composed of hidden states and causes. Hiddenstates describe conditional dependencies over time in the stimuli. Hidden causes are transform-invariant versions ofthe states that mediate conditional state dependencies. For static input stimuli, DPCNs behave similarly to sparse-coding models [16].DPCNs have exhibited promise for implementing unsupervised object recognition in images and video, espe-cially when the objects are well localized throughout space and time. It has been demonstrated that DPCNs learnsparse features that are better for classiﬁcation than those from alternate convolutional and deconvolutional autoen-coders on many benchmark datasets. Such behavior stemmed from the interaction of feed-forward and feed-backconnections in the networks. It also arose due to the implicit supervision imposed from leveraging temporal informa-tion [17]. Moreover, multiple presentations of the stimuli facilitated the extraction of spatial and temporal regularities[18, 19], which we hypothesize permitted high-level object analysis [20], like class recognition.

Isaac J. Sledge is the Senior Machine Learning Scientist with the Advanced Signal Processing and Automated Target Recognition Branch, NavalSurface Warfare Center, Panama City, FL, USA (email: [email protected]). He is the director of the Machine Intelligence Defense (MIND) lab atthe US Naval Sea Systems Command.Jos´e C. Pr´ıncipe is the Don D. and Ruth S. Eckis Chair and the Distinguished Professor with both the Department of Electrical and ComputerEngineering and the Department of Biomedical Engineering, University of Florida, Gainesville, FL, USA (email: principe@uﬂ.edu). He is the directorof the Computational NeuroEngineering Laboratory (CNEL) at the University of Florida.The work of the authors was funded by grants N00014-14-1-0542 (Marc Steinberg, ONR35), N00014-19-WX-00636 (Marc Steinberg, ONR35), andN00014-21-WX-00476 (J. Tory Cobb, ONR32) from the US Ofﬁce of Naval Research. The ﬁrst author was also supported by in-house laboratoryindependent research (ILIR) grant N00014-19-WX-00687 (Frank Crosby) from the US Ofﬁce of Naval Research and a Naval Innovation in Science andEngineering (NISE) grant from NAVSEA. a r X i v : . [ c s . A I] F e b UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 2

Learning sufﬁciently robust features in a DPCN has been shown to be quite computationally intensive, though. Amulti-stage optimization strategy, based on proximal gradients [21, 22], was used in [9–11] to conduct feed-forwardand feed-back inference. Sub-quadratic function-value convergence rates were theoretically guaranteed for thisstrategy [23, 24], but only sub-linear rates were ever obtained due to severe oscillations; that is, the search was nota descent strategy and could lead to localized increases in the cost function across iterations. The DPCNs were thuspractically limited to two layers, which, while sufﬁcient for characterizing certain stimuli, may not yield represen-tations that handle objects in complex environments. The networks exhibited poor empirical performance when ex-tended beyond two layers due to being stymied by the lack of a reasonable convergence rate [25]. That is, the deeperlayers did not reach a stable feature representation, which impacted the representations in preceding layers.Here, we propose an alternate optimization approach for DPCNs to go beyond the two-layer network limita-tion. We replace the proximal gradient search with an accelerated version (see section 2). This approach possessesa sub-polynomial rate of function-value convergence and largely avoids cost rippling, thereby greatly improvingthe empirical convergence rate (see section 2). We are thus able to efﬁciently train both deeper and wider networksthat stabilize well. The resulting deeper feature representations are far more robust than those previously obtainedfor DPCNs. In particular, the later-layer causes have receptive ﬁelds that embody the entirety of the objects beingpresented, despite the lack of training labels, on a variety of benchmark datasets. This yields unsupervised classiﬁersthat are on par with supervised-trained convolutional and convolutional-recurrent deep networks (see section 3). TheDPCNs have orders of magnitude fewer parameters, though, than these other deep networks.

2. Deep Predictive Coding

The objective of predictive coding is to approximate external sensory stimuli using generative, latent-variablemodels. Such models hierarchically encode only the residual prediction errors so that the internal representationsare modiﬁed only for unexpected changes in the stimuli. The prediction errors are the differences between either theactual stimuli or a transformed version of it and the predicted stimuli produced from the underlying latent variables;we refer to these latent variables as causes.

Deﬁnition 1.

Let y t ∈ R p represent a time-varying sensory stimulus at time t . The stimuli can be describedby an underlying cause, κ ,t ∈ R d , and a time-varying intermediate state, γ ,t ∈ R k + , through a pair of θ -parameterized mapping functions, f : R k + → R p , the cause-update function, and g : R k + × R d → R k + , the state-transition function. These functions deﬁne a latent-variable model, y t = f ( γ ,t ; θ )+ (cid:15) ,t , γ ,t = g ( γ ,t − , κ ,t ; θ )+ (cid:15) (cid:48) ,t . Here, (cid:15) ,t ∈ R p and (cid:15) (cid:48) ,t ∈ R k are noise terms that represent the stochastic and model uncertainty, respectively, inthe predictions. This model can be extended to a multi-layer hierarchy by cascading additional θ -parameterizedmapping functions, f i : R k i + → R d i − and g i : R k i + × R d i → R k i + , at each layer i beyond the ﬁrst, κ i − ,t = f i ( γ i,t ; θ )+ (cid:15) i,t , γ i,t = g i ( γ i,t − , κ i,t ; θ )+ (cid:15) (cid:48) i,t . Here, κ i,t ∈ R d i , γ i,t ∈ R k i + , (cid:15) i,t ∈ R d i − and (cid:15) (cid:48) i,t ∈ R k i .For a hierarchical, predictive-coding model, both feed-forward, bottom-up and feed-back, top-down processesare used to characterize observed stimuli. In the former case, the observed stimuli are propagated, in a feed-forwardfashion through the model, to extract progressively abstract details. The stimuli are ﬁrst converted to a series of statesthat encode spatio-temporal relationships. These states are then made invariant to various transformations, therebyforming the hidden causes. The causes at lower layers of the model form the observations to the layers above. Hiddencauses therefore provide a link between the layers. The states, in contrast, both connect the dynamics over time, toignore temporal discontinuities [26], and mediate the effects of the causes on the stimuli [4]. In the latter case, themodel generates top-down predictions such that the neural activity at one layer predicts the activity at a lower layer.The predictions from a higher level are sent through feed-back connections to be compared to the actual activity. Thisyields a model uncertainty error that is forwarded to subsequent layers to update the population activity and improveprediction. Such a top-down process repeats until the bottom-up stimuli transformation process no longer impartsany new information. That is, there are no unexpected changes in the stimuli that the model cannot predict. Once thisoccurs, if the model is able to synthesize the input stimuli accurately using the uncovered features, then it means thatit has previously seen a similar observation [27].In the remainder of this section, we propose an efﬁcient architecture for this hierarchical, latent-variable modelthat is suitable for uncovering discriminative details from the stimuli. We consider a convolutional, acceleratedDPCN (ADPCN) that can extract highly sparse, invariant features for either dynamic or static stimuli (see section2.1). We then show how to effectively infer the ADPCN’s latent variables using a fast proximal gradient scheme(see section 2.2). This optimization process permits effectively forming deep hierarchies. Theoretical convergenceproperties are presented in the online appendix (see appendix A). UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 3

Figure 1: An overview of the DPCN architecture, which, for simplicity, is presented for only two layers. Note that the ﬁnal layer of the DPCN has nooutput, unlike in a standard autoencoder network. The goal of the DPCN is to learn a series of causes that explain the input stimuli, in this case,frames from a video of a bird, and hence recreate them. Each layer can be roughly decomposed into two inference blocks, one for updating the statesand the other for updating the causes. State and causal inference relies on intra-layer feed-forward (black lines) and feed-back processes (gray lines),along with intra-layer recurrent feed-back (blue lines). Much of the network is devoted to feed-back processes to provide self supervision.Inter-layer feed-back (red line) is used to update the sparsity parameter for the states. For the inference process, we denote multiplication of aquantity on a given path with a rounded square. Addition, subtraction, and multiplication of quantities along multiple paths are denoted usingcircular gated symbols. Pooling and un-pooling are denoted, respectively, with up and down arrows inside rounded squares. Lastly, the functionblocks apply either a sparsity operation or an exponential-function operation to the quantities on the given path; the corresponding parameter valuesfor these operations are offset from these blocks. We omit showing the actual optimization process but note that the feed-forward connections arelargely devoted to computing gradients of the two cost functions, ∇ γ i,t L and ∇ κ i,t L , that are used to update the states γ i,t and causes κ i,t . Our ADPCNs consist of two stages at each layer, which are outlined in ﬁg. 1. The ﬁrst stage entails inferring thehidden states, which essentially are a feature-based representation used to describe the stimuli. States are formed,at the ﬁrst layer, via sparse coding in conjunction with a temporal-state-space model to map the stimuli to an over-complete dictionary of convolutional ﬁlters. Subsequent DPCN layers follow the same process, with the only changebeing that the hidden causes assume the role of the observed stimuli.We deﬁne state inference via a least-absolute-shrinkage-and-selection-operator (LASSO) cost. We present this,here, for the case of single-channel stimuli. The extension to multiple channels is straightforward.

Deﬁnition 2.

Let γ i,t ∈ R k i + be the hidden states at time t and at model layer i . Let C i ∈ R k i × k i be a hidden-state-transition matrix. Let D (cid:62) i ∈ R k i × k i − be a Toeplitz-form matrix with q i ∈ R + ﬁlters structured as in [28].The state-inference cost function to be minimized, with respect to γ i,t , C i , and D (cid:62) i , is given by L ( γ i,t , κ i,t , C i , D (cid:62) i ; α i , λ i,t ) = 12 (cid:32) (cid:107) κ i − ,t − D (cid:62) i γ i,t (cid:107) + α i (cid:107) γ i,t − C i γ i,t − (cid:107) + k i (cid:88) k (cid:48) =1 [ λ i,t ] k (cid:48) | [ γ i,t ] k (cid:48) | (cid:33) , where κ ,t = y t . The ﬁrst term in this cost quantiﬁes the L prediction error, (cid:15) (cid:48) i,t = κ i − ,t − D (cid:62) i γ i,t , at layer i .For the input layer, this error is (cid:15) ,t = y t − D (cid:62) γ ,t . In either case, the aim is to ensure that the local reconstructionerror between layers is minimized. The second term constrains the next-state dynamics to be described by thestate-transition matrix. For static stimuli, indexed by t , the state feed-back is replaced by κ i,t − D (cid:62) i +1 ,t γ i +1 ,t . Thestrength of the recurrent feed-back connection is driven by α i ∈ R + . The transitions are L -sparse to make thestate-space representation consistent. Without such a norm penalty, the innovations would not be sparse due tothe feed-back. The ﬁnal term enforces L -sparsity of the states, with the amount controlled by λ i,t ∈ R k i + . Proposition 1.

Let γ i,t ∈ R k i + be the hidden states at time t and at model layer i . Let D (cid:62) i ∈ R k i × k i − be aToeplitz-form matrix of q i ∈ R + ﬁlters. The matrix-vector multiplication D (cid:62) i γ i,t is functionally equivalent toconvolution for all layers i .When projected back into the original visual space of the input, the dictionaries deﬁne a series of receptive ﬁelds. Thehidden states, at least for the initial layers of the hierarchy thus act as basic feature detectors. They often resemble thesimple cells in the visual cortex [29].Ideally, we would prefer L -sparsity to the L variant used in the state-inference cost, as it does not imposeshrinkage on the hidden-state values. However, by sacriﬁcing this property, we gain cost convexity, which aids inefﬁcient numerical optimization with provable convergence. Proposition 2.

Let γ i,t ∈ R k i + be the hidden states at time t and at model layer i . Let C i ∈ R k i × k i be a hidden-state-transition matrix. Let D (cid:62) i ∈ R k i × k i − be a Toeplitz-form matrix with q i ∈ R + ﬁlters. For α i , λ i,t ∈ R + ,the hidden state cost function L ( γ i,t , κ i,t , C i , D (cid:62) i ; α i , λ i,t ) is convex.The state-based feature representations constructed by the ﬁrst stage are not guaranteed to be invariant to varioustransforms. Discrimination can be impeded, as a result. The second DPCN processing stage thus entails explicitlyimposing this behavior. Local translation invariance is attained by leveraging the spatial relationships of the states inneighborhoods via the sum-pooling of states. Invariance to more complex transformations, like rotation and spatialfrequency, is made possible through the inference of subsequent hidden causes. UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 4

Sparse cause inference is driven by a LASSO-based cost that captures non-linear dependencies between compo-nents in the pooled states.

Deﬁnition 3.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the bottom-up hidden causes at time t andmodel layer i . Let κ (cid:48) i,t ∈ R d i be the top-down-inferred causes. Let C i ∈ R k × k be a hidden-state-transition matrix.Let D (cid:62) i ∈ R k i × k i − be a Toeplitz-form matrix of q i ∈ R + ﬁlters. Let G i ∈ R d i × k i be an invariant matrix. Thehidden-cause cost to be minimized, with respect to κ i,t and G i , is given by L ( γ i,t , κ i,t , G i ; α (cid:48) i , λ (cid:48) i , η (cid:48) i , λ i,t ) = 12 (cid:32)(cid:32) n (cid:88) j =1 k i (cid:88) k (cid:48) =1 | [ λ i,t ] k (cid:48) γ ji,t | (cid:33) + η (cid:48) i (cid:107) κ i,t − κ (cid:48) i,t (cid:107) + λ (cid:48) i (cid:107) κ i,t (cid:107) (cid:33) , where λ i,t,k (cid:48) = α (cid:48) i (1+ exp ( − [ G i κ i,t ] k )) , α (cid:48) i ∈ R + . The ﬁrst term in this cost models the multiplicative interactionof the causes κ i,t with the sum-pooled states γ ji,t through an invariant matrix G i ∈ R d i × k i . This characterizesthe shape of the sparse prior on the states. That is, the invariant matrix is adapted such that each component ofthe causes are connected to element groups in the accumulated states that co-occur frequently. Co-occurringcomponents typically share common statistical regularities, thereby yielding locally invariant representations[30]. The second term speciﬁes that the difference between the bottom-up κ i,t and top-down inferred causes κ (cid:48) i,t should be small, with the term weight speciﬁed by η (cid:48) i ∈ R + . The ﬁnal term imposes L sparsity, with the amountcontrolled by λ (cid:48) i ∈ R + , to prevent the intermediate representations from being dense.The causes obtained by solving the above LASSO cost will behave somewhat like complex cells in the visual cortex[31]. Similar results are found in temporally coherent networks [32], albeit without guaranteed feature invariance.As with the state-inference cost, we employ L sparsity in the hidden-cause cost for practical reasons, eventhough we would prefer L sparsity for its theoretical appeal. Proposition 3.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes at time t and modellayer i . Let C i ∈ R k i × k i be a hidden-state-transition matrix, D (cid:62) i ∈ R k i × k i − + be a Toeplitz-form matrix of ﬁlters,and G i ∈ R d i × k i be an invariant matrix. For α (cid:48) i , λ (cid:48) i , η (cid:48) i ∈ R + and λ i,t ∈ R k i + , the hidden cause cost function L ( γ i,t , κ i,t , G i ; α (cid:48) i , λ (cid:48) i , η (cid:48) i , λ i,t ) is convex.Deﬁning states and causes as we have speciﬁed above has signiﬁcant advantages. DPCNs are, for instance, in-credibly parameter efﬁcient compared to standard recurrent-convolutional autoencoders. Often, few ﬁlters are neededto adequately synthesize an observed stimuli under varying conditions, which is a byproduct of the explicit featureinvariance imposed by the non-linear, sparse cause inference. The propagation and transformation of observed stimuli in a DPCN is more involved than for standard networkarchitectures. At any layer in the ADPCN, the hidden, sparse states and unknown, sparse causes that minimize thetwo-part LASSO cost must be inferred to create the feed-forward observations for the next DPCN layer.Joint inference of the states and causes can be done in a manner similar to block coordinate descent. That is, fora given mini-batch of stimuli, the states can be updated by solving the corresponding LASSO cost while holding thecauses ﬁxed. The causes can then be updated while holding the states ﬁxed. Altering either of these representationsamounts to solving a convolutional, L -sparse-coding problem. The presence of discontinuous, L -based terms in theLASSO costs complicates the application of standard optimization techniques, though.Here, we consider a fast proximal-gradient-based approach for separating and accounting for the smooth andnon-smooth components of the LASSO costs. This approach is motivated and analyzed in the appendix. Deﬁnition 4.

Let γ i,t ∈ R k i + be the hidden states and π i,t ∈ R k i + be the auxiliary hidden states at time t andmodel layer i . These auxiliary hidden states will be linear combinations of hidden states across different times.Let C i ∈ R k i × k i be a hidden-state-transition matrix. As well, let D (cid:62) i ∈ R k i × k i − be a Toeplitz-form matrix of q i ∈ R + ﬁlters. For an inertial sequence β m ∈ R + and an adjustable step size τ mi,t ∈ R + , the hidden-state inferenceprocess, indexed by iteration m , is given by the following expressions γ mi,t +1 = PROX λ i,t (cid:32) π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t , κ i,t , C i , D (cid:62) i ; α i , λ i,t ) (cid:33) , π m +1 i,t = γ mi,t +1 + β m ( γ mi,t +1 − γ m − i,t +1 ) with L ( π mi,t , κ i,t , C i , D (cid:62) i ; α i , λ i,t ) = D (cid:62) i ( κ i − ,t − D i π mi,t )+ α i Ω i ( π mi,t ) . Here, use a Nesterov smoothing, Ω i ( π mi,t ) = arg max (cid:107) Ω i,t (cid:107) ∞ ≤ Ω (cid:62) i,t ( π mi,t − C i π m − i,t ) , Ω i,t ∈ R k i , to approximate the non-smooth state transition.Small values for the hidden states are clamped via a soft thresholding function implicit to the proximal operator,which leads to a sparse solution. The states are then spatially max pooled over local neighborhoods, using non-overlapping windows, to reduce their resolution, γ i,t +1 = POOL ( γ i,t +1 ) . Deﬁnition 5.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes at time t and model layer i . Let C i ∈ R k i × k i be a hidden-state-transition matrix, D (cid:62) i ∈ R k i × k i − be a Toeplitz-form matrix of q i ∈ R + UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 5 ﬁlters, and G i ∈ R d i × k i be an invariant matrix. For an adjustable step size τ (cid:48) i,tm ∈ R + and inertial sequence β (cid:48) m ∈ R + , the hidden-cause inference process, indexed by m , is given by the following expressions κ mi,t +1 = PROX λ (cid:48) i (cid:32) π (cid:48) i,tm − λ (cid:48) i τ (cid:48) i,tm ∇ π (cid:48) i,tm L ( γ i,t +1 , π (cid:48) i,tm , G i ; α (cid:48) i , λ (cid:48) i , η (cid:48) i , λ i,t ) (cid:33) , π (cid:48) i,tm +1 = κ mi,t +1 + β (cid:48) m ( κ mi,t +1 − κ m − i,t +1 ) with ∇ π (cid:48) i,tm L ( γ i,t +1 , π (cid:48) i,tm , G i ; α (cid:48) i , λ (cid:48) i , η (cid:48) i , λ i,t ) = − α (cid:48) i G (cid:62) i exp ( − G i π (cid:48) i,tm ) | γ ji,t +1 | +2 η (cid:48) i ( κ mi,t +1 − κ (cid:48) i,t ) . Small valuesfor the hidden causes are clamped via an implicit soft thresholding function, which leads to a sparse solution.The inferred causes are then used to update the sparsity parameter λ i,t +1 = α (cid:48) i (1+ exp ( − UNPOOL ( G i κ i,t +1 ))) using spatial max unpooling.In both cases, the step size is bounded by the Lipschitz constant of the LASSO cost to be solved. The choice of theinertial sequence greatly affects the convergence properties of the optimization. Proposition 4.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes. The state iterates { γ mi,t +1 } ∞ m =1 strongly converge to the global solution of L ( γ i,t , κ i,t , C i , D (cid:62) i ; α i , λ i,t ) for the accelerated prox-imal gradient scheme. Likewise, the cause iterates { κ mi,t +1 } ∞ m =1 for the accelerated proximal gradient schemestrongly converge to the global solution of L ( γ i,t +1 , κ i,t , G i ; α (cid:48) i , λ (cid:48) i , η (cid:48) i , λ i,t ) at a sub-polynomial rate. Thisoccurs when using the inertial sequences β m , β (cid:48) m = ( k m − /k m +1 , where k m depends polynomially on m .In this bottom-up inference process, there is an implicit assumption that the top-down predictions of the causesare available. This, however, is not the case for each iteration of a mini batch being propagated through the DPCN.We therefore consider an approximate, top-down prediction using the states from the previous time instance and,starting from the ﬁrst layer, perform bottom-up inference using this prediction. Deﬁnition 6.

At the beginning of every time step t , using the state-space model at each layer, the likely top-down causes, κ (cid:48) i − ,t ∈ R d i − , are predicted using the previous states γ i,t ∈ R k i and the causes κ i,t ∈ R d i . That is,for the ﬁlter dictionary matrix, the following top-down update is performed κ (cid:48) i − ,t = D (cid:62) i γ (cid:48) i,t , γ (cid:48) i,t = arg min γ i,t (cid:32) λ (cid:48) i (cid:107) γ i,t +1 − C i γ i,t (cid:107) + α (cid:48) i (cid:107) γ i,t (cid:107) (cid:32) exp ( − unpool ( G i κ i,t )) (cid:33)(cid:33) , except for the last layer, wherein κ (cid:48) i,t +1 = κ i,t . This minimization problem has an algebraic expression for theglobal solution: [ γ (cid:48) i,t ] k = [ C i,t γ i,t − ] k , whenever α (cid:48) i λ i,t,k < α i , and zero otherwise.These top-down predictions serve an important role during inference, as they transfer abstract knowledge from higherlayers into lower ones, thereby improving the overall representation quality. The predictions also modulate the repre-sentations due to state zeroing by the sparsity hyperparameter.Alongside the state and cause inference is a learning process for ﬁtting the ADPCN parameters to the stimuli.Here, we consider layer-wise, gradient-descent training without top-down information, which is performed onceinference has stabilized for a given mini batch. An overview of this procedure, within the variable inference, is pre-sented in the online appendix (see appendix A).

3. Simulation Results

We now assess the capability of our inference strategy for unsupervised ADPCNs. We focus on static visualstimuli (see section 3.1). We demonstrate that ADPCNs uncover stable, meaningful feature representation morequickly than DPCNs (see section 3.2). This permits ADPCNs to often exceed the performance of deep unsupervisedand deep supervised networks.

We relied on ﬁve datasets for our simulations. Two of these, MNIST and FMNIST,were of single-channel, static visual stimuli. The remaining three, CIFAR-10/100 and STL-10, contained multi-channel, static visual stimuli. We whitened each dataset and zeroed their means. We relied on the default training andtest set deﬁnitions for each dataset.

Training and Inference Protocols.

For learning the DPCN and ADPCN parameters, we relied on ADAM-basedgradient descent with mini batches [33]. We set the initial learning rate to η = − , which helped prevent over-shooting the global optimum. The learning rate was decreased by half every epoch. We used exponential decay ratesof 9.0 × − and 9.9 × − for the ﬁrst- and second-order gradient moments in ADAM, respectively, which wereemployed to perform bias correction and adjust the per-parameter learning rates. An epsilon additive factor of 10 − was used to preempt division by zero. We used an initial forgetting factor value of θ = UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 6 � ��

CenterPosition � �� C en t e r P o s i t i on � �� CenterPosition � �� F il t e r I nde x �� F il t e r I nde x �� FeatureResponse F ea t u r e R e s pon s e �� FilterIndex F il t e r I nde x DPCN First Layer FiltersDPCN Third Layer FiltersDPCN Second Layer FiltersDPCN First Layer ReconstructionDPCN Third Layer Reconstruction � �� F il t e r I nde x � �� FilterIndex F il t e r I nde x Filter Similarity � � �� C en t e r P o s i t i on Accelerated Proximal Gradient Inference Results � �� F il t e r I nde x � �� OrientationAngle ��

OrientationAngle �� S pa t i a l F r equen cy �� S pa t i a l F r equen cy (a)(b)(c)(d)(e)(f)(g) (a)(b)(c)(d)(e)(f)(g) �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e �� F ea t u r e R e s pon s e ADPCN First Layer FiltersADPCN First Layer ReconstructionADPCN Third Layer FiltersADPCN Third Layer Reconstruction Filter SimilarityADPCN Second Layer Filters

Proximal Gradient Inference Results

Figure 2: A comparison of accelerated proximal gradient inference and learning (left, blue) and proximal gradient inference and learning (right, red)the MNIST dataset. The presented results are shown after training with mini batches for two epochs. (a) Polar scatter plots of the orientation anglesversus spatial frequency for the ﬁrst-layer causes. (b) Line plots of the normalized center positions with included orientations for the ﬁrst-layercauses. For both (a) and (b), we ﬁt Gabor ﬁlters to the ﬁrst-layer causes; locally optimal ﬁlter parameters were selected via a gradient-descentscheme. The plots are color-coded according to the connection strength between the invariant matrix and the observation matrix in the ﬁrst networklayer. Higher connection strengths indicate subsets of dictionary elements from the observation matrix that are most likely active when a column ofthe invariance matrix is also active. If a DPCN has been trained well, then the ﬁlters should have a small orientation-angle spread. Each plotrepresents a randomly chosen column of the ﬁrst-layer invariance matrix. (c)–(e) Back-projected causes from the ﬁrst, second, and third layers ofthe networks, respectively. Each plot represents a randomly chosen cause. The back-projected causes can be interpreted as receptive ﬁelds, withdarker colors indicating a higher degree of activation. For each layer, we assess the ﬁlter similarity and provide VAT similarity plots. In these plots,low similarities are denoted using gray while higher similarities are denoted using progressively more vivid shades of either blue (acceleratedproximal gradients) or red (proximal gradients). If a DPCN has been trained well, then there should be few to no duplicate ﬁlters. There, hence,should not be any conspicuous blocky structures along the main diagonal of the VAT similarity plots. (f)–(g) Reconstructed instances from arandom batch at the ﬁrst and third layer, respectively. For each layer, we also assess the feature similarity between the original training samples andthe reconstructed versions and provide corresponding scatter plots. If a DPCN reconstructs the input samples well, then there should be a stronglinear relationship between the features. Higher distributional spreads and shifts away from the main diagonal indicate larger reconstruction errors.

For DPCN and ADPCN inference, we relied on variable numbers of ﬁlters along with variable ﬁlter sizes, withthe actual amounts determined by the speciﬁc stimuli dataset. We primarily set the sparsity parameters to λ , λ (cid:48) = 0 . , λ , λ (cid:48) = 0 . , and λ , λ (cid:48) = 0 . . Such values permitted retaining much of the visual content in the ﬁrst two layerswhile compressing it more pronouncedly in the third. Some stimuli datasets have slightly altered parameter values.Due to the static nature of the stimuli, we did not have temporal state feed-back. We did, however, propagate thecause-state difference between layers and set the feed-back strengths, for most simulations, to α , α = 1 and α = 3 .The stronger feed-back amount in the third layer aided in suppressing noise without adversely impacting the in earlierlayers priors. We ﬁxed the causal sparsity constants to α (cid:48) , α (cid:48) , α (cid:48) = 1 . We terminated the accelerated and regularFISTA inference processes after 50 and 1000 iterations, respectively, per mini batch. A signiﬁcantly lower number ofiterations was used in the former case, since the chosen inertial sequence facilitated quick convergence. We considered the same architecture for the DPCNs and ADPCNs. At the ﬁrst layer, weused 32 states with 7 × × × × × × UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 7 � �� DPCN First Layer FiltersDPCN Third Layer FiltersDPCN Second Layer FiltersDPCN First Layer ReconstructionDPCN Third Layer ReconstructionFilter Similarity � � �� (a)(b)(c)(d)(e)(f)(g) (a)(b)(c)(d)(e)(f)(g) �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e � �� FilterIndex F il t e r I nde x � �� F il t e r I nde x � �� F il t e r I nde x �� F il t e r I nde x �� F il t e r I nde x �� FilterIndex F il t e r I nde x �� CenterPosition � �� C en t e r P o s i t i on �� OrientationAngle �� S pa t i a l F r equen cy �� S pa t i a l F r equen cy �� OrientationAngle �� C en t e r P o s i t i on � �� CenterPosition � ��

ADPCN First Layer FiltersADPCN First Layer ReconstructionADPCN Third Layer FiltersADPCN Third Layer Reconstruction Filter SimilarityADPCN Second Layer Filters

Accelerated Proximal Gradient Inference Results Proximal Gradient Inference Results � ��

Figure 3: A comparison of accelerated proximal gradient inference and learning (left, blue) and proximal gradient inference and learning (right, red)the FMNIST dataset. The presented results are shown after training with mini batches for two epochs. See ﬁg. 2 for descriptions of the plots. × Simulation Results.

Simulation ﬁndings are presented in ﬁg. 2 and ﬁg. 3 for the single-channel MNIST andFMNIST datasets, respectively. Findings for the multi-channel CIFAR-10/100 and STL-10 datasets are, respectively,shown in ﬁg. 4 and ﬁg. 5. The results for both ADPCNs and DPCNs were for after two epochs.For these datasets, the ADPCNs were successful in quickly uncovering invariant representations. Most of thecolumns in the invariance matrix grouped together dictionary elements that had very similar orientation and fre-quency while being insensitive to translation (see ﬁg. 2(a) to ﬁg. 5(a)). Likewise, for each active invariance-matrixcolumn, a subset of the dictionary elements were grouped together by orientation and spatial position, which indi-cated invariance to other properties like spatial frequency and center position (see ﬁg. 2(b) to ﬁg. 5(b)). The DPCNs,in comparison, had representations that would be signiﬁcantly altered by transformations other than translation. Thisoccurred because subsets of the dictionary elements were not grouped according to various characteristics. Discrimi-nation performance hence suffered for stimuli samples that were slightly altered.As well, the ADPCNs learned meaningful ﬁlters from the stimuli. The ﬁrst two layers of our ADPCNs had causalreceptive ﬁelds that mimicked the behavior of simple and complex cells in the primate vision system (see ﬁg. 2(c)–(d)to ﬁg. 5(c)–(d)). The ﬁelds for the ﬁrst layer were predominantly divided into two types: low-frequency and high-frequency, localized band-pass ﬁlters. The former mainly encoded regions of uniform intensity and color along withslowly varying texture. The latter described contours and hence sharp boundaries. Such ﬁlters permitted accuratelyreconstructing the input stimuli (see ﬁg. 2(f) to ﬁg. 5(f)). The second-layer receptive ﬁelds were non-linear combi-nations of those in the ﬁrst that became activated by more complicated visual patterns, such as curves and junctions.A similar division of receptive ﬁelds into two categories was encountered in the second ADPCN layer. More ﬁlterswere activated by contours, however, than in the ﬁrst layer. For both layers, the ﬁlters were mostly unique, which iscaptured in the ordered similarity plots (see ﬁg. 2(c)–(d) to ﬁg. 5(c)–(d)). Beyond two layers, the ADPCN receptiveﬁelds encompassed entire objects (see ﬁg. 2(e) to ﬁg. 5(e)). They were, however, average representations, not highlyspeciﬁc ones, due to the limited number of causes (see see ﬁg. 2(g) to ﬁg. 5(g)). The backgrounds in the visual stim-uli were often suppressed at this layer, which greatly enhanced performance. The ordered similarity plots indicatethat none of the third-layer ﬁlters appeared to be duplicated for either dataset. This trend also held for the ﬁrst- and

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 8 �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e ADPCN First Layer FiltersADPCN First Layer ReconstructionADPCN Third Layer FiltersADPCN Third Layer Reconstruction Filter SimilarityADPCN Second Layer Filters DPCN First Layer FiltersDPCN Third Layer FiltersDPCN Second Layer FiltersDPCN First Layer ReconstructionDPCN Third Layer ReconstructionFilter Similarity (f)(g) (c)(d)(e)(f)(g) � �� F il t e r I nde x � �� F il t e r I nde x � �� FilterIndex F il t e r I nde x �� F il t e r I nde x �� F il t e r I nde x �� FilterIndex F il t e r I nde x �� (a)(b) (a)(b)(c)(d)(e) Accelerated Proximal Gradient Inference Results Proximal Gradient Inference Results �� CenterPosition � �� C en t e r P o s i t i on � �� C en t e r P o s i t i on � �� CenterPosition � �� S pa t i a l F r equen cy �� S pa t i a l F r equen cy �� OrientationAngle �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e �� OrientationAngle � ��

Figure 4: A comparison of accelerated proximal gradient inference and learning (left, blue) and proximal gradient inference and learning (right, red)the CIFAR-10/100 datasets. The presented results are shown after training with mini batches for two epochs. See ﬁg. 2 for descriptions of the plots. second-layer receptive ﬁelds, implying that the ADPCNs emphasized the extraction of non-redundant features.DPCNs, in contrast, did not stabilize to viable receptive ﬁelds at the same rate as the ADPCNs (see ﬁg. 2(c)–(d) toﬁg. 5(c)–(d)). For MNIST, the ﬁrst-layer DPCN receptive ﬁelds had some localized band-pass structure that was simi-lar to Gabor ﬁlters. The overall spread of the ﬁelds made it difﬁcult to accurately detect abrupt transitions and hencerecreate the input stimuli, though. The reconstructions thus were heavily distorted and blurred (see ﬁg. 2(f)). For FM-NIST, the ﬁrst-layer receptive ﬁelds focused on either low-frequency details, such as either constant grayscale valuesor slow-changing grayscale gradients, or higher-frequency details, such as periodic texture. While some of the causesbecame specialized band-pass-like ﬁlters, there were not enough to adequately preserve sharp edges. The stimulireconstructions were thus also distorted, with much of the high-frequency content completely removed (see ﬁg. 3(f)).Similar results were encountered for CIFAR-10/100 and STL-10 (see ﬁg. 4(f) to ﬁg. 5). For all of the datasets, thesecond-layer receptive ﬁelds became even less organized than in the ﬁrst layer. They were mostly activated by blob-like visual patterns, which did not preserve enough visual content for recreating a close resemblance of the inputstimuli beyond the ﬁrst layer. The DPCNs were unable to learn relevant representations in the third layer, as a con-sequence. The receptive ﬁelds for this network layer were unique, by virtue of being essentially random, but werelargely useless in extracting stimuli-speciﬁc details (see ﬁg. 2(e) to ﬁg. 5(e)). This further degraded the reconstructionquality to where the inputs were unrecognizable (see ﬁg. 2(g) to ﬁg. 5(g)). Discrimination was adversely impacteddue to this severe lack of identifying characteristics. The ﬁlter redundancy also impacted performance, as there wasnot enough unique information to be back-propagated to earlier layers to inform the choice of better receptive ﬁelds.Overall, the layer-aggregated ADPCN features yielded high-performing unsupervised classiﬁers (see appendixB). They achieved state-of-the-art unsupervised recognition rates for each dataset. They also were on par with deepnetworks trained in a supervised fashion, despite having orders of magnitude fewer parameters. Although the featuresfrom all layers had a positive net contribution, those from the third layer yielded the largest performance boost. TheDPCNs exhibited poor performance, in comparison. Only the ﬁrst-layer features aided classiﬁcation. The remainderlargely worsened the recognition capabilities. For both the DPCNs and ADPCNs, we relied on a ﬁve-nearest neighborclassiﬁer with an unsupervised-learned metric distance [35] to label the stimuli samples.

Simulation Discussions.

Our simulations indicate that ADPCNs were more effective at uncovering highlydiscriminative feature representations of visual stimuli than the original DPCN inference strategy. A trait that con-

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 9

ADPCN First Layer FiltersADPCN First Layer ReconstructionADPCN Third Layer FiltersADPCN Third Layer Reconstruction Filter SimilarityADPCN Second Layer Filters DPCN First Layer FiltersDPCN Third Layer FiltersDPCN Second Layer FiltersDPCN First Layer ReconstructionDPCN Third Layer ReconstructionFilter Similarity (f)(g) (c)(d)(e)(f)(g) �� (a)(b) (a)(b)(c)(d)(e) � �� F il t e r I nde x �� F il t e r I nde x �� FilterIndex F il t e r I nde x �� F il t e r I nde x � �� FilterIndex F il t e r I nde x � �� F il t e r I nde x Accelerated Proximal Gradient Inference Results Proximal Gradient Inference Results �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e �� F ea t u r e R e s pon s e �� FeatureResponse F ea t u r e R e s pon s e �� OrientationAngle �� S pa t i a l F r equen cy �� S pa t i a l F r equen cy �� OrientationAngle �� C en t e r P o s i t i on � �� CenterPosition � ��

CenterPosition � �� C en t e r P o s i t i on Figure 5: A comparison of accelerated proximal gradient inference and learning (left, blue) and proximal gradient inference and learning (right, red)the STL-10 dataset. The presented results are shown after training with mini batches for two epochs. See ﬁg. 2 for descriptions of the plots. tributed greatly to the ADPCNs’ success was its signiﬁcantly improved search rate.As noted in the appendix, proximal-gradient-type schemes can undergo four separate search phases, some ofwhich have different local convergence rates. In one of the phases, the constant-step regime, both the states andcauses underwent rapid improvements. However, in two phases, the local convergence rate was slow whenever thelargest eigenvalue of a certain recurrence matrix was less than the current inertial-sequence magnitude. For linearinertial sequences, like those found in the proximal-gradient-based DPCNs, this condition occurred early duringthe optimization process. That is, for such sequences, the growth was initially very rapid and followed a logarithmicrate. Within just a few iterations, the sequence magnitude had exceeded the eigenvalue, which preempted the fastconstant-step regime. The rate of convergence became worse than sublinear. A large number of search steps was thusneeded to move toward the global solution. However, when this happened, the total number of proximal-gradientiterations had been reached and the search was terminated. The search was, alternatively, stopped early due to a lackof progress across consecutive iterations. In either case, the states and causes did not adequately stabilize for a givenmini batch. The poor state and cause representations, naturally, were integrated into the ﬁlter dictionary matrices andinvariant matrices during the learning updates, which disrupted the priors in the early hierarchy for future stimuli.The convergence was further stymied by cost rippling; proximal-gradient-based optimization thus did not behavelike a pure descent method in two out of the four phases. Such behavior was caused by the eigenvalues of anotherrecurrence matrix being a pair of complex conjugates, which necessitated oscillating between the two. All of thesefactors made it difﬁcult to propagate meaningful bottom-up information [36] beyond the ﬁrst layer. The top-downdetails from higher layers were hence ineffective at modifying the priors to disambiguate the stimuli [37].The ADPCNs largely avoided these issues. After the constant-step regime, the search switched to one of two po-tentially slower phases. However, the APDCNs’ inertial-sequence growth rate was rather muted, as opposed to that ofthe DPCNs. The chance of exceeding the largest eigenvalue of the augmented, auxiliary-variable recurrence matrixwas low for the ADPCNs. The search thus could proceed unhindered toward the global solution. Moreover, sincethe eigenvalue-magnitude threshold was often not reached, the largest eigenvalue of the augmented, mapping recur-rence matrix was real-valued, not complex. This meant that the accelerated proximal gradients would behave like adescent method; it thus would not experience localized cost rippling due to alternating between conjugate pairs. Bothproperties promoted rapid stabilization of the states and causes, which expedited the formation of beneﬁcial priors

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 10 throughout much of the early-layer network inference. These priors facilitated the construction of transformation-insensitive feature abstractions in deeper network layers, due to the bottom-up forwarding of stimuli-relevant signals.The recurrent, top-down connections, in turn, suitably altered the representation sparsity, thereby ignoring extrane-ous details that did not contribute greatly to the reconstruction quality. This would bias in favor of valuable stimuli,which is analogous to observed functionality in the frontal and parietal cortices [38]. The descending pathways alsocarried predictive responses, in the form of templates of expected stimuli that would be matched to the current andfuture mini batches to aid recognition [39]. That is, the templates conveyed contextual information for extractingtask-speciﬁc information, which made recognition more reliable [40, 41].There were other traits that contributed to the success of the ADPCNs. For instance, the ﬁrst-layer receptive ﬁeldswere largely similar to that of simple cells in the primary visual cortex. Simple cells often implement Gabor-like ﬁl-ters with generic preferred stimuli that correspond to oriented edges [42, 43]. Such ﬁlters were highly useful withinthe ADCPNs, since relations between activations for speciﬁc spatial locations tended to be distinctive between ob-jects in visual stimuli. Activations were also obtained, in a Gabor space, that facilitated the construction of naturallysparse representations and could then be hierarchically extended, which was what we ultimately sought in the AD-PCNs. Changes in object location, scale, and orientation could be reliably detected, within this Gabor space, therebyaiding in the creation of transformation-invariant features that also permitted near-perfect stimuli reconstructionsat the ﬁrst layer. The uniqueness of the band-pass ﬁlters was beneﬁcial, as it permitted the ADPCNs to ﬁxate onnon-redundant stimuli characteristics. How these ﬁlters arose and the rate at which they formed were also importantaspects that aided in the ADPCNs’ success. They emerged due to feed-back from high-level visual network areas,which quickly reduced activity in lower areas to simplify the description of the stimuli to some of its most basic el-ements [44]. In doing so, alternate explanations were suppressed, and only the most dominant, fundamental causesof the stimuli remained [3], which were oriented luminance contours. This rapid stabilization of ﬁlters was exactlyfunctionality that we would expect to encounter in an efﬁcient predictive coding process. It was hence well alignedwith contemporary neurophysiological theory [45].Beyond the ﬁrst layer, the ADPCNs exhibited an increase in speciﬁcity, abstraction, and invariance for deeperhierarchies [46], which aligned well with the current understanding of the vision system [47]. This also aided in theADPCNs’ success. The second-layer receptive ﬁelds, in the case of MNIST, became sensitive to curved sub-strokesfor the hand-written digits, which is similar to prestriate cortex functionality [48]. For FMNIST, they emphasizedabrupt transitions found in regions of the apparel and fashion accessories. This mimics aspects of shape-pattern-selectivity behaviors found in the extrastriate cortices [49, 50]. For STL-10, the receptive ﬁelds were often elongatedGabor-like ﬁlters, which can be found in the prestriate cortex [51]. All of these features systematically focused onobject-relevant visual details, thereby aiding recognition. At the deepest layer, the receptive ﬁelds were entirely objectspeciﬁc, which is functionality somewhat akin to the neurons in the primate infero-temporal cortex [52]. It is alsorelated to memory activity [53, 54]. The representations were additionally translation and rotation insensitive, similarto inferotemporal cortex neurons [55–57], and changed little with respect to scale and spatial-frequency changes,similar to neurons in the middle temporal area [58, 59]. To our knowledge, such invariant, whole-object sensitivitywithin a single layer has not been witnessed in any existing predictive-coding model. Based on contemporary theories[60], we believe that the receptive-ﬁeld feed-back from the ﬁnal layer contributed to the effective connectivity inthe earlier network layers [61]. That is, the role of this ﬁnal layer was to disambiguate local image components bycreating a template that was fed back, which then selectively enhanced object components and suppressed interferingbackground components [62]. This biologically plausible behavior was crucial for achieving high discriminationrates on CIFAR-10/100 and STL-10, since the objects of interest were scattered amongst cluttered scenes.

4. Conclusions

Here, we have revisited the problem of unsupervised predictive coding. We considered a hierarchical, generativenetwork, the ADPCN, for reconstructing observed stimuli. This network was composed of temporal sparse-codingmodels with bi-directional connectivity. The interaction of the information passed by top-down and bottom-up con-nections across the models permitted the extraction of invariant, increasingly abstract feature representations.Our contribution in this paper was an effective means of inferring the underlying components of the feature rep-resentations, which are the sparse states and causes. Previously, a proximal-gradient-type approach was used forthis purpose. Despite its promising theoretical guarantees, though, it exhibited poor empirical performance. Thispractically limited the number of layered models that could be considered in the DPCNs. It also extensively curtailedthe quality of the stimuli features that could be extracted. Here, we considered a parallelizable, vastly acceleratedproximal-gradient strategy that overcame these issues. It allowed us to go beyond the existing two-layer limitation,facilitating the construction of arbitrary-layered DPCNs. Each layer led to increasingly enhanced performance forobject analysis, as the information from higher layers was propagated to earlier ones to form more effective stim-uli priors for bottom-up processing. Most crucially, our optimization strategy immensely streamlined inference.

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 11

Often, only one or two presentations of the stimuli was necessary to reach a stable ﬁlter dictionary, and hence a cor-responding set of sparse states and sparse causes, with good object analysis performance. The previous optimizationapproach required many times more presentations before a stable set of ﬁlters was uncovered. The resulting featureswere also not nearly as discriminative as those from our proposed strategy.We applied our ADPCNs to static-image datasets. For MNIST and FMNIST, the ADPCNs learned initial-layerreceptive ﬁelds that mimicked aspects of the early stages of the primate vision system. The later network stages im-plemented receptive ﬁelds that encompassed entire objects. In the case of MNIST, the back-projected ﬁlters becamepseudo averages of the hand-written numerical digits. Predominant writing styles were modeled well. For FMNIST,the back-projected ﬁlters were the various types of clothing and personal articles. General styles and some nuanceswere captured. To our knowledge, this is the ﬁrst time that such object-scale receptive ﬁelds have been learned forpredictive coding. This behavior helped yield unsupervised classiﬁers that achieved state-of-the-art performance.Such classiﬁers also outperformed supervised-trained deep networks, which lent credence to the complicated featureinference and invariance process that we employed. Similar results were witnessed for more complex natural-imagedatasets, such as CIFAR-10/100. The later-layer receptive ﬁelds again encompassed entire object categories. Thisyielded promising features that achieved state-of-the-art generalization performance compared to supervised-traineddeep networks. This was despite our use of simple nearest-neighbor classiﬁers. It was also despite the fact that ourADPCNs had many times fewer convolutional ﬁlters than these other deep networks.Our ADPCNs are readily applicable to video processing, much like the original DPCNs. This is because theADPCNs implement a recurrent state-space model. For video modalities, the ADPCN performance may be evenbetter than for static images, as the top-down, feed-back connections impose temporal constraints on the learningprocess. We will investigate this in our future research.

References [1] R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretation of someextra-classical receptive-ﬁeld effects,”

Nature Neuroscience , vol. 2, no. 1, pp. 79–87, 1999. Available:http://dx.doi.org/10.1038/4580[2] K. J. Friston and S. Kiebel, “Predictive coding under the free energy principle,”

Philosophical Transactions of theRoyal Society B , vol. 364, no. 1521, pp. 1211–1221, 2009. Available: http://dx.doi.org/10.1098/rstb.2008.0300[3] M. W. Spratling, “Unsupervised learning of generative and discriminative weights encoding elementary imagecomponents in a predictive coding model of cortical functions,”

Neural Computation , vol. 24, no. 1, pp. 60–103,2011. Available: http://dx.doi.org/10.1162/NECO a 00222[4] K. J. Friston, “Hierarchical models of the brain,”

PLOS Computational Biology , vol. 4, no. 11, pp.e1 000 211(1–24), 2008. Available: http://dx.doi.org/10.1371/journal.pcbi.1000211[5] T. Hosoya, S. A. Baccus, and M. Meister, “Dynamic predictive coding by the retina,”

Nature , vol. 436, no. 7047,pp. 71–77, 2005. Available: http://dx.doi.org/10.1038/nature03689[6] R. Auksztulewicz and K. J. Friston, “Repetition suppression and its contextual determinants in predictive coding,”

Cortex , vol. 80, no. 1, pp. 125–140, 2016. Available: http://dx.doi.org/10.1016/j.cortex.2015.11.024[7] M. V. Srinivasan, S. B. Laughlin, and A. Dubs, “Predictive coding: A fresh view of inhibition in the retina,”

Proceedings of the Royal Society B: Biological Sciences , vol. 216, no. 1205, pp. 427–459, 1982. Available:http://dx.doi.org/10.1098/rspb.1982.0085[8] J. F. M. Jehee, C. Rothkopf, J. M. Beck, and D. H. Ballard, “Learning receptive ﬁelds using predictive feedback,”

Journal of Physiology , vol. 100, no. 1-3, pp. 125–132, 2006. Available:http://dx.doi.org/10.1016/j.jphysparis.2006.09.011[9] R. Chalasani and J. C. Pr´ıncipe, “Deep predictive coding networks,” in

Proceedings of the InternationalConference on Learning Representations (ICLR) , Scottsdale, AZ, USA, May 2-4 2013, pp. 1–13. Available:https://arxiv.org/abs/1301.3541[10] J. C. Pr´ıncipe and R. Chalasani, “Cognitive architectures for sensory processing,”

Proceedings of the IEEE , vol.102, no. 4, pp. 514–525, 2014. Available: http://dx.doi.org/10.1109/JPROC.2014.2307023[11] R. Chalasani and J. C. Pr´ıncipe, “Context dependent encoding using convolutional dynamic networks,”

IEEETransactions on Neural Networks and Learning Systems , vol. 26, no. 9, pp. 1992–2004, 2015. Available:http://dx.doi.org/10.1109/TNNLS.2014.2360060[12] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervisedlearning,” in

Proceedings of the International Conference on Learning Representations (ICLR) , Toulon, France,April 24-26 2017, pp. 1–13. Available: https://arxiv.org/abs/1605.08104[13] K. Han, H. Wen, Y. Zhang, D. Fu, E. Culurciello, and Z. Liu, “Deep predictive coding network with locallyrecurrent processing for object recognition,” in

Advances in Neural Information Processing Systems (NIPS) ,S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds. Red Hook, NY, USA:Curran Associates, 2018, pp. 9201–9213.[14] K. J. Friston, J. Kilner, and L. Harrison, “A free energy principle for the brain,”

Journal of Physiology , vol. 100,no. 1-3, pp. 70–87, 2006. Available: http://dx.doi.org/10.1016/j.jphysparis.2006.10.001[15] P. F¨oldi´ak, “Learning invariance from transformation sequences,”

Neural Computation , vol. 3, no. 2, pp. 194–200,1991. Available: http://dx.doi.org/10.1162/neco.1991.3.2.194[16] A. Hyv¨arinen and P. O. Hoyer, “A two-layer sparse coding model learns simple and complex cell receptive ﬁeldsand topography from natural images,”

Vision Research , vol. 41, no. 18, pp. 2413–2423, 2001. Available:http://dx.doi.org/10.1016/S0042-6989(01)00114-6

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 12 [17] R. Baker, M. Dexter, T. E. Hardwicke, A. Goldstone, and Z. Kourtzi, “Learning to predict: Exposure to temporalsequences facilitates prediction of future events,”

Vision Research , vol. 99, no. 1, pp. 124–133, 2014. Available:http://dx.doi.org/10.1016/j.visres.2013.10.017[18] P. Perruchet and S. Pacton, “Implicit learning and statistical learning: One phenomenon, two approaches,”

Trendsin Cognitive Sciences , vol. 10, no. 5, pp. 233–238, 2006. Available: http://dx.doi.org/10.1016/j.tics.2006.03.006[19] R. N. Aslin and E. L. Newport, “Statistical learning: From acquiring speciﬁc items to forming general rules,”

Current Directions in Psychological Science , vol. 21, no. 3, pp. 170–176, 2012. Available:http://dx.doi.org/10.1177/0963721412436806[20] T. F. Brady and M. M. Chun, “Spatial constraints on learning in visual search: Modeling and contextual cuing,”

Journal of Experimental Psychology: Human Perception and Performance , vol. 33, no. 4, pp. 798–815, 2007.Available: http://dx.doi.org/10.1037/0096-1523.33.4.798[21] O. G¨uler, “New proximal point algorithms for convex minimization,”

SIAM Journal on Optimization , vol. 2, no. 3,pp. 649–664, 1992. Available: http://dx.doi.org/10.1137/0802032[22] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,”

SIAMJournal of Imaging Sciences , vol. 2, no. 1, pp. 183–202, 2009. Available: http://dx.doi.org/10.1137/080716542[23] A. Chambolle and C. Dossal, “On the convergence of the iterates of the ‘fast iterative shrinkage/thresholdingalgorithm’,”

Journal of Optimization Theory and Applications , vol. 166, no. 1, pp. 968–982, 2015. Available:http://dx.doi.org/10.1007/s10957-015-0746-4[24] H. Attouch and J. Peypouquet, “The rate of convergence of Nesterov’s accelerated forward-backward method isactually faster than /k ,” SIAM Journal on Optimization , vol. 26, no. 3, pp. 1824–1834, 2016. Available:http://dx.doi.org/10.1137/15M1046095[25] E. Santana, M. S. Emigh, P. Zegers, and J. C. Pr´ıncipe, “Exploiting spatio-temporal structure with recurrentwinner-take-all networks,”

IEEE Transactions on Neural Networks and Learning Systems , vol. 29, no. 8, pp.3738–3746, 2018. Available: http://dx.doi.org/10.1109/TNNLS.2017.2735903[26] J. V. Stone, “Learning perceptually salient visual patterns using spatiotemporal smoothness constraints,”

NeuralComputation , vol. 8, no. 7, pp. 1463–1492, 2008. Available: http://dx.doi.org/10.1162/neco.1996.8.7.1463[27] H. E. Schendan and C. E. Stern, “Where vision meets memory: Prefrontal-posterior networks for visual objectconstancy during categorization and memory,”

Cerebral Cortex , vol. 18, no. 7, pp. 1695–1711, 2008. Available:http://dx.doi.org/10.1093/cercor/bhm197[28] J. Sulam, V. Papyan, Y. Romano, and M. Elad, “Multilayer convolutional sparse modeling: Pursuit and dictionarylearning,”

IEEE Transactions on Signal Processing , vol. 66, no. 15, pp. 4090–4104, 2018. Available:http://dx.doi.org/10.1109/TSP.2018.2846226[29] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive ﬁeld properties by learning a sparse codefor natural images,”

Nature , vol. 381, no. 6583, pp. 607–609, 1996. Available: http://doi.org/10.1038/381607a0[30] Y. Karklin and M. S. Lewicki, “A hierarchical Bayesian model for learning nonlinear statistical regularities innonstationary natural signals,”

Neural Computation , vol. 17, no. 2, pp. 397–423, 2006. Available:http://dx.doi.org/10.1162/0899766053011474[31] M. Ito and H. Komatsu, “Representation of angles embedded within contour stimuli in area V2 of macaquemonkeys,”

Journal of Neuroscience , vol. 24, no. 13, pp. 3313–3324, 2004. Available:http://dx.doi.org/10.1523/jneurosci.4364-03.2004[32] J. Hurri and A. Hyv¨arinen, “Temporal coherence, natural image sequences, and the visual cortex,” in

Advances inNeural Information Processing Systems (NIPS) , S. Becker, S. Thrun, and K. Obermayer, Eds. Cambridge, MA,USA: MIT Press, 2003, pp. 157–164.[33] D. P. Kingma and J. Ba, “ADAM: A method for stochastic optimization,” in

Proceedings of the InternationalConference on Learning Representations (ICLR) , San Diego, CA, USA, May 7-9 2015, pp. 1–15. Available:https://arxiv.org/abs/1412.6980[34] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochastic gradient descent,” in

Proceedings of the International Conference on Machine Learning (ICML) , New York, NY, USA, June 19-242016, pp. 1225–1234.[35] O. Sener, H. O. Song, A. Saxena, and S. Savarese, “Learning transferrable representations for unsuperviseddomain adaptation,” in

Advances in Neural Information Processing Systems (NIPS) , T. G. Dietterich, S. Becker,and Z. Ghahramani, Eds. Red Hook, NY, USA: Curran Associates, 2016, pp. 2110–2118.[36] A. Mechelli, C. J. Price, K. J. Friston, and A. Ishai, “Where bottom-up meats top-down: Neuronal interactionsduring perception and imagery,”

Cerebral Cortex , vol. 14, no. 11, pp. 1256–1265, 2004. Available:http://dx.doi.org/10.1093/cercor/bhh087[37] M. Bar, “Visual objects in context,”

Nature Reviews Neuroscience , vol. 5, no. 8, pp. 617–629, 2004. Available:http://dx.doi.org/10.1038/nrn1476[38] J. T. Serences, “Value-based modulations in human visual cortex,”

Neuron , vol. 60, no. 6, pp. 1169–1181, 2008.Available: http://dx.doi.org/10.1016/j.neuron.2008.10.051[39] S. Ullman, “Sequence seeking and counter streams: A computational model for bidirectional information ﬂow inthe visual cortex,”

Cerebral Cortex , vol. 5, no. 1, pp. 1–11, 1995. Available: http://dx.doi.org/10.1093/cercor/5.1.1[40] M. C. Potter, “Meaning in visual search,”

Science , vol. 187, no. 4180, pp. 965–966, 1975. Available:http://dx.doi.org/10.1126/science.1145183[41] S. E. Palmer, “The effects of contextual scenes on the identiﬁcation of objects,”

Memory and Cognition , vol. 3,no. 1, pp. 519–526, 1975. Available: http://dx.doi.org/10.3758/BF03197524[42] D. H. Hubel and T. N. Wiesel, “Receptive ﬁelds and functional architecture in two nonstriate visual areas (18 and19) of the cat,”

Journal of Neurophysiology , vol. 28, no. 2, pp. 229–289, 1965. Available:http://dx.doi.org/10.1152/jn.1965.28.2.229[43] ——, “Receptive ﬁelds and functional architecture of monkey striate cortex,”

Journal of Physiology , vol. 195,

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 13 no. 1, pp. 215–243, 1968. Available: http://dx.doi.org/10.1113/jphysiol.1968.sp008455[44] S. O. Murray, P. Schrater, and D. Kersten, “Perception grouping and the interactions between visual cortical areas,”

Neural Networks , vol. 17, no. 5-6, pp. 695–705, 2004. Available: http://dx.doi.org/10.1016/j.neunet.2004.03.010[45] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by V1?”

Vision Research , vol. 37, no. 23, pp. 3311–3325, 1997. Available: http://doi.org/10.1016/S0042-6989(97)00169-7[46] M. Risenhuber and T. Poggio, “Hierarchical models of object recognition in the cortex,”

Nature Neuroscience ,vol. 2, no. 11, pp. 1019–1025, 1999. Available: http://dx.doi.org/10.1038/14819[47] B. S. Tjan, V. Lestou, and Z. Kourtzi, “Uncertainty and invariance in the human visual cortex,”

Journal ofNeurophysiology , vol. 96, no. 3, pp. 1556–1568, 2005. Available: http://dx.doi.org/10.1152/jn.01367.2005[48] J. Hedg´e and D. C. van Essen, “Selectivity for complex shapes in primate visual area V2,”

Journal ofNeuroscience , vol. 20, no. 5, pp. 1–6, 2000. Available: http://dx.doi.org/10.1523/jneurosci.20-05-j0001.2000[49] A. Pasupathy and C. E. Connor, “Responses to contour features in macaque area V4,”

Journal ofNeurophysiology , vol. 82, no. 5, pp. 2490–2502, 1999. Available: http://dx.doi.org/10.1152/jn.1999.82.5.2490[50] ——, “Population coding of shape in area V4,”

Nature Neuroscience , vol. 5, no. 12, pp. 1332–1338, 2002.Available: http://dx.doi.org/10.1038/972[51] L. Liu, L. She, M. Chen, T. Liu, H. D. Lu, Y. Dan, and M. Poo, “Spatial structure of neuronal receptive ﬁeld inawake monkey secondary visual cortex (V2),”

Proceedings of the National Academy of Sciences , vol. 113, no. 7,pp. 1913–1918, 2016, (accepted, in press). Available: http://dx.doi.org/10.1073/pnas.1525505113[52] C. Bruce, R. Desimone, and C. G. Gross, “Visual properties of neurons in a polysensory area in superior temporalsulcus of the macaque,”

Journal of Neurophysiology , vol. 46, no. 2, pp. 369–384, 1981. Available:http://dx.doi.org/10.1152/jn.1981.46.2.369[53] Y. Miyashita and H. S. Chang, “Neuronal correlate of pictorial short-term memory in the primate temporalcortex,”

Nature , vol. 331, no. 6151, pp. 68–70, 1988. Available: http://dx.doi.org/10.1038/331068a0[54] V. Yakovlev, S. Fusi, E. Berman, and E. Zohary, “Inter-trial neuronal activity in inferior temporal cortex: Aputative vehicle to generate long-term visual associations,”

Nature Neuroscience , vol. 1, no. 4, pp. 310–317, 1998.Available: http://dx.doi.org/10.1038/1131[55] E. T. Rolls, “Neurophysiological mechanisms underlying face processing within and beyond the temporal corticalvisual areas,”

Philosophical Transactions of the Royal Society of London , vol. 335, no. 1273, pp. 11–20, 1992.Available: http://dx.doi.org/10.1098/rstb.1992.0002[56] M. J. Tovee, E. T. Rolls, and P. Azzopardi, “Translation invariance in the responses to faces of single neurons inthe temporal visual cortical areas of the alert macaque,”

Journal of Neurophysiology , vol. 72, no. 3, pp.1049–1060, 1994. Available: http://dx.doi.org/10.1152/jn.1994.72.3.1049[57] E. Salinas and L. F. Abbott, “Invariant visual responses from attention gain ﬁelds,”

Journal of Neurophysiology ,vol. 77, no. 6, pp. 3267–3272, 1997. Available: http://dx.doi.org/10.1152/jn.1997.77.6.3267[58] E. T. Rolls, G. C. Baylis, and C. M. Leonard, “Role of low and high spatial frequencies in the face-selectiveresponses of neurons in the cortex in the superior temporal sulcus in the monkey,”

Vision Research , vol. 25, no. 8,pp. 1021–1035, 1985. Available: http://dx.doi.org/10.1016/0042-6989(85)90091-4[59] L. Liu, J. A. Bourne, and M. G. P. Rosa, “Spatial and temporal frequency selectivity of neurons in the middletemporal visual area of new world monkeys (

Callithrix jacchus ),”

European Journal of Neuroscience , vol. 25,no. 6, pp. 1780–1792, 2007. Available: http://dx.doi.org/10.1111/j.1460-9568.2007.05453.x[60] H. Liang, X. Gong, M. Chen, Y. Yan, W. Li, and C. D. Gilbert, “Interactions between feedback and lateralconnections in the primary visual cortex,”

Proceedings of the National Academy of Sciences , vol. 114, no. 32, pp.8637–8642, 2017. Available: http://dx.doi.org/10.1073/pnas.1706183114[61] N. Ramalingam, J. N. J. McManus, W. Li, and C. D. Gilbert, “Top-down modulation of lateral interactions invisual cortex,”

Journal of Neuroscience , vol. 33, no. 5, pp. 1773–1789, 2013. Available:http://dx.doi.org/10.1523/jneurosci.3825-12.2013[62] B. Epshtein, I. Lifshitz, and S. Ullman, “Image interpretation by a single bottom-up top-down cycle,”

Proceedingsof the National Academy of Sciences , vol. 105, no. 38, pp. 14 298–14 303, 2008. Available:http://dx.doi.org/10.1073/pnas.0800968105

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 14

Appendix A

Below, we outline the ADPCN training and inference process.

Algorithm 1:

Accelerated Deep Prediction Network (ADPCN) Training and Inference Inputs : Initial dictionary matrix D > i ∈ R k i × k i − + , state-transition matrix C i ∈ R k i × k i , and invariant matrix G i ∈ R d i × k i for each network layer. A set of time-varying, static stimuli Y t , where y t ∈ Y t , y t ∈ R p . A set of initialstates γ i, ∈ R k i + and causes κ i, ∈ R d i . for t = 0 , , , . . . do Initialize the bottom-up cause for the ﬁrst layer as κ ,t = y t . for i = 0 , , , . . . do For all layers but the last, the most likely top-down causes, κ i − ,t ∈ R d i − , are initialized at each iterationusing the previous states γ i,t ∈ R k i and the causes κ i,t ∈ R d i , κ i − ,t = D > i γ i,t , γ i,t = arg min γ i,t λ i k γ i,t +1 − C i γ i,t k + α i k γ i,t k exp ( − UNPOOL ( G i κ i,t )) !! . This minimization problem has an algebraic expression for the global solution: [ γ i,t ] k = [ C i,t γ i,t − ] k ,whenever α i λ i,t,k < α i , and zero otherwise. For the last layer, κ i,t +1 = κ i,t . for i = 0 , , , . . . do Let β m ∈ R + be an inertial sequence, β m = ( k m − /k m +1 , where k m = 1+( m r − /d with r ∈ R , + and d ∈ R + . Given an adjustable step size τ mi,t ∈ R + , update the states using proximal-gradient steps, indexedby m , until either convergence or a pre-set number of iterations has been reached γ mi,t +1 = PROX λ i,t π mi,t − λ i,t τ mi,t ( D > i ( κ i − ,t − D i π mi,t )+ α i Ω i ( π mi,t )) ! , where π m +1 i,t = γ mi,t +1 + β m ( γ mi,t +1 − γ m − i,t +1 ) . The term Ω i ( π mi,t ) quantiﬁes the contribution of thenon-smooth state transition. Use Nesterov smoothing, with µ i ∈ R + , to approximate them, Ω i ( π i,t ) = arg max k Ω i,t k ∞ ≤ Ω > i,t ( π mi,t − C i γ i,t − ) − µ i k Ω i,t k / . Max-pool the states usingnon-overlapping windows γ i,t +1 = POOL ( γ i,t +1 ) . Let β m ∈ R + be an inertial sequence, β m = ( k m − /k m +1 , where k m = 1+( m r − /d with r ∈ R , + , d ∈ R + . Given an adjustable step size τ i,tm ∈ R + , update the causes using proximal-gradient steps untileither convergence or a pre-set number of iterations has been reached κ mi,t +1 = PROX λ i π i,tm − λ i τ i,tm (2 η i ( κ mi,t +1 − κ i,t ) − α i G > i exp ( − G i π i,tm ) | γ ji,t +1 | ) ! , where π i,tm +1 = κ mi,t +1 + β m ( κ mi,t +1 − κ m − i,t ) . Update the sparsity parameter using spatial max unpoolingafter the causes update has concluded, λ i,t +1 = α i (1+ exp ( − UNPOOL ( G i κ i,t +1 ))) . for i = 0 , , , . . . do Update the ﬁlter dictionary matrix D > i ∈ R k i × k i − + and the state-transition matrix C i,t ∈ R k i × k i independently, until either convergence or a pre-set number of iterations has been reached, viadual-estimation ﬁltering, with steps indexed by m , D m +1 i > = D mi > + σ t + ψ mi ( γ i − ,t +1 − D mi > γ i,t +1 ) γ i,t +1 + θ mi ( D mi − D m − i ) ! ,C m +1 i = C mi + σ t + ψ im SIGN ( γ i,t +1 − C mi γ i,t ) γ > i,t +1 + θ mi ( C mi − C m − i ) ! , where ψ mi , ψ im ∈ R + are step sizes, θ mi ∈ R + is a momentum coefﬁcient, and σ t , σ t ∈ R is Gaussiantransition noise over the parameters. Normalize D m +1 i > to avoid returning a trivial solution. Update the causal invariance matrix G i ∈ R d i × k i via dual-estimation ﬁltering, with steps indexed by m , G m +1 i = G mi + σ t + ψ t m ( exp ( − G mi κ i +1 ,t ) | γ i,t +1 | ) κ > i,t +1 + θ mi ( G mi − G m − i ) ! , where ψ t m ∈ R + is a step size, θ mi ∈ R + is a momentum coefﬁcient, and σ t ∈ R is Gaussian-distributedtransition noise over the parameters. Normalize G m +1 i to avoid returning a trivial solution.The convergence of dual-estimation ﬁltering is straightforward to demonstrate. For the proximal-gradient infer-ence process, it is much more involved. We build up to it in what follows. We ﬁrst prove a weak convergence resultthat facilitates demonstrating a much stronger one when relying on properties of Cauchy sequences. We then quan-tify the global convergence rate for our chosen inertial sequence. We then compare the global convergence rate to UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 15 a Nesterov-style inertial sequence that was used in the original DPCN to illustrate the advantages of the former forADPCNs. Lastly, we outline local convergence properties of both inertial sequences to explain the results presentedfor DPCNs and ADPCNs in the main part of the paper.Toward this end, it is important to show that the distance between a given iterate and the solution set for eitherinference cost can be bounded by the norm of the residual. This occurs whenever the norm of the residual is small.The iterate must also be sufﬁciently close to the solution set.

Proposition A.1.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes. Let π i,t ∈ R k i + be theauxiliary states and let π i,t ∈ R d i be the auxiliary causes at layer i and time t . Let ω ,i ∈ R , that is assumed tosatisfy ω ,i ≥ L ( γ ∗ i , κ i,t , C i , D > i ; α i , λ i ) . As well, let ω ,i ∈ R , ω ,i ≥ L ( γ i,t , κ ∗ i , G i ; α i , λ i , η i , λ i,t ) . Forsome ω ,i , there are (cid:15) ,i , (cid:15) ,i ∈ R + , such that, for step size τ mi,t ∈ R + dist ( γ mi,t +1 , Γ ∗ i ) ≤ (cid:15) ,i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) PROX /‘ i,j π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t , κ i,t , C i , D > i ; α i , λ i ) ! − π mi,t (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) whenever the following conditions, k PROX λ i,t ( π mi,t − λ i,t τ mi,t ∇ π i,t L ( π mi,t , κ i,t , C i , D > i ; α i , λ i,t )) − γ mi,t k < (cid:15) ,i and ω ,i ≤ L ( γ mi,t , κ i,t , C i , D > i ; α i , λ i ) , are satisﬁed. Here, dist ( γ mi,t +1 , Γ ∗ i ) denotes the distance of a giveniterate with the solution set. Likewise, for some ω ,i , there are (cid:15) ,i , (cid:15) ,i ∈ R + , such that, for step sizes τ i,tm ∈ R + ,dist ( κ mi,t , K ∗ i ) ≤ (cid:15) ,i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) PROX /‘ i,j π i,tm − λ i τ i,tm ∇ π i,tm L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t ) ! − π i,tm (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) whenever the following conditions k PROX λ i ( π i,tm − λ i τ i,tm ∇ π i,tm L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t )) − κ mi,t k < (cid:15) ,i and ω ,i ≤ L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t ) are satisﬁed. Here, Γ ∗ i , with γ ∗ i ∈ Γ ∗ i , γ ∗ i ∈ R k i + , denotes the sol-ution set for the state-inference cost function. As well, K ∗ i , with κ ∗ i ∈ K ∗ i , κ ∗ i ∈ R d i + , denotes the solution set forthe cause-inference cost function. Both ‘ i,j , ‘ i,j ∈ R + denote Lipschitz constants of the state and cause costs. Proof:

In what follows, for ease of presentation, we ignore the variables of the inference cost that remainﬁxed across inference iterations. Let the L -sparsity term in the state-inference cost be re-written in an equiva-lent manner as P k i k =1 [ λ i,t ] k ξ i,t , with the polyhedral-set constraint | [ γ i,t ] k |− ξ i,t ≤ , ξ i,t ∈ R k i . There existssome ω i ∈ [0 , ∞ ) k i such that ( PROX /‘ i,j ( π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t )) − γ mi,t ) + ( ∇ π mi,t L ( π mi,t ) − ω i ) = 0 , where ( PROX /‘ i,j ( π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t )) = ξ i,t . Here, we assume that the choice of π mi,t is such that theinequality conditions are satisﬁed for (cid:15) ,i , (cid:15) ,i ∈ R + . As well, there exists some optimal π ∗ i ∈ Γ ∗ i , π ∗ i ∈ R k i + , and acorresponding ω ∗ i ∈ [0 , ∞ ) k i such that ∇ π ∗ i L ( π ∗ i ) − ω ∗ i = 0 , where π ∗ i = ξ i,t . We also have that, for σ ∈ R + , σ k π mi,t − π ∗ i k ≤ h π mi,t − π ∗ i , ∇ π mi,t L ( π mi,t ) −∇ π ∗ i L ( π ∗ i ) i . Hence, for some ω ,i ∈ R that depends on ω ,i ∈ R ,we have σ k π mi,t − π ∗ i k ≤ ( ω ,i +(( ω ,i ) +4 ω ,i ) / ) k π mi,t − PROX /‘ i,j ( π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t )) k / . When combined with k ( π mi,t , ω i,t ) − ( π ∗ i , ω ∗ i ) k ≤ δ i ( k π mi,t − π ∗ i k + k π mi,t − PROX λ i,t ( π mi,t − λ i,t τ mi,t ∇ π i,t L ( π mi,t )) k ) for δ i ∈ R + , we get that min π ∗ ∈ Γ ∗ i k π mi,t − π ∗ k ≤ (cid:15) ,i k π mi,t − π ∗ i k . A similar argument to what is above can be usedfor the causes. (cid:4) As Luo and Tseng [1] have shown, such locally held bounds are useful for analyzing the rate of convergence of itera-tive algorithms. They can be used to very weakly demonstrate that the sequence of functional iterates will eventuallyreach the global functional solution, albeit independent of strong inertial-sequence properties. Analogous boundshave been derived by Pang [2], albeit for strongly convex problems. Here, we do not impose the condition of strongconvexity for the inference costs, which makes these results applicable to a broader set of functionals, much like theones that we employ.To provide a more general convergence result, we need to consider speciﬁc inertial sequences and incorporatethem into the analysis. Toward this end, we ﬁrst bound the squared residual between the primary iterates, the states γ i,t and the causes κ i,t , and their corresponding auxiliary iterates π i,t and π i,t . Proposition A.2.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes. Let π i,t ∈ R k i + be theauxiliary states at layer i and time t . Assume that the state update, for a positive step size τ mi,t ∈ R + , is given bythe relation γ mi,t +1 = PROX λ i,t ( π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t , κ i,t , C i , D > i ; α i , λ i,t )) , with the auxiliary state update π m +1 i,t = γ mi,t +1 + β m ( γ mi,t +1 − γ m − i,t +1 ) . Likewise, assume that the cause update, for a positive step size τ i,tm ∈ R + , isgiven by the relation κ mi,t +1 = PROX λ i ( π i,tm − λ i τ i,tm ∇ π i,tm L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t )) , with the auxiliarycause relation π i,tm +1 = κ mi,t +1 + β m ( κ mi,t +1 − κ mi,t +1 ) . In both cases, let β m , β m = ( k m − /k m +1 , with elements UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 16 k m = 1+( m r − /d with r ∈ R , + , d ∈ R + . There are some (cid:15) ,i , (cid:15) ,i ∈ R + such that k π mi,t − γ mi,t +1 k ≥ τ mi,t (cid:15) ,i L ( γ mi,t +1 , κ i,t , C i , D > i ; α i , λ i,t ) −L ( γ ∗ i , κ i,t , C i , D > i ; α i , λ i,t ) ! k π i,tm − κ mi,t +1 k ≥ τ i,tm (cid:15) ,i L ( γ i,t +1 , κ mi,t +1 , G i ; α i , λ i , η i , λ i,t ) −L ( γ i,t +1 , κ ∗ i , G i ; α i , λ i , η i , λ i,t ) ! where γ ∗ i ∈ Γ ∗ i , γ ∗ i ∈ R k i + , is a solution of the state-inference cost and κ ∗ i ∈ K ∗ i , κ ∗ i ∈ R d i + , is a solution of thecause-inference cost. Proof:

We focus on the case of the hidden states; that for the hidden causes has only slight differences. Inwhat follows, for ease of presentation, we ignore the variables of the inference cost that remain ﬁxed acrossinference iterations. We have, for m being sufﬁciently large, that L ( γ mi,t +1 ) −L ( γ ∗ i ) ≤ k γ m +1 i,t +1 − π m +1 i,t k /τ mi,t + dist ( γ m +1 i,t +1 , Γ ∗ i ) k γ m +1 i,t +1 − π m +1 i,t k /τ mi,t ≤ ξ − ,i (4 (cid:15) ,i + ξ ,i ) k γ m +1 i,t +1 − π m +1 i,t k / τ mi,t where (cid:15) ,i ∈ R + and ξ ,i ∈ R + . There exists some (cid:15) ,i ≥ ξ − ,i (4 (cid:15) ,i + ξ ,i ) / , which must naturally be positive, suchthat the proposition is true. Here, the second inequality follows from proposition A.1,dist ( γ mi,t +1 , Γ ∗ i ) ≤ (cid:15) ,i k PROX τ mi,t ( γ mi,t +1 − λ i,t τ mi,t ∇ γ mi,t +1 L ( γ mi,t +1 ) − γ mi,t +1 k /‘ i,t τ mi,t , This arises from the non-decreasing nature of the proximal-norm function and the Cauchy-Schwarz inequality,which implies that dividing by ‘ i,t τ mi,t leads to a non-increasing function. The iterate-solution distance can thusbe further bounded from above as dist ( γ mi,t +1 , Γ ∗ i ) ≤ (cid:15) ,i ξ ,i k γ mi,t +1 − π mi,t k . Since the relationship holds forarbitrary m sufﬁciently large, it can be increased by one iteration. (cid:4) We now can bound the functional-value difference between an arbitrary iterates, γ i,t and κ i,t , and optimal solutions, γ ∗ i ∈ Γ ∗ i and κ ∗ i ∈ K ∗ i . Note that, due to convexity of the two inference costs, every solution is guaranteed to be aglobally optimal one. From this result, we will be able to obtain convergence of the function values and iterates. Proposition A.3.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes. Let the inertialsequences used, respectively, for the state-inference and cause-inference costs be β m , β m = ( k m − /k m +1 , where k m = 1+( m r − /d with r ∈ R , + , d ∈ R + . We have that ∞ X m =1 k m +1 L ( γ mi,t +1 , κ i,t , C i , D > i ; α i , λ i,t ) −L ( γ ∗ i , κ i,t , C i , D > i ; α i , λ i,t ) ! ∞ X m =1 k m +1 L ( γ i,t +1 , κ mi,t +1 , G i ; α i , λ i , η i , λ i,t ) −L ( γ i,t +1 , κ ∗ i , G i ; α i , λ i , η i , λ i,t ) ! are convergent. Here, γ ∗ i ∈ Γ ∗ i , γ ∗ i ∈ R k i + , is a solution of the state-inference cost and κ ∗ i ∈ K ∗ i , κ ∗ i ∈ R d i + , is asolution of the cause-inference cost. We have assumed, here, that the state update, for a positive step size τ mi,t ∈ R + , was given by the relation γ mi,t +1 = PROX λ i,t ( π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t , κ i,t , C i , D > i ; α i , λ i,t )) , with theauxiliary update π m +1 i,t = γ mi,t +1 + β m ( γ mi,t +1 − γ m − i,t +1 ) . As well, the cause update, for a positive step size τ i,t ∈ R + ,was κ mi,t +1 = PROX λ i ( π i,tm − λ i τ i,tm ∇ π i,tm L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t )) , with the auxiliary cause update π i,tm +1 = κ mi,t +1 + β m ( κ mi,t +1 − κ mi,t +1 ) . Proof:

For ease of presentation, we ignore the variables of the inference cost that remain ﬁxed across infer-ence iterations. It can be shown that, for some ξ ,i ∈ R + , L ( γ m +1 i,t +1 ) ≤ L ( γ i,t +1 m ) −k γ i,t +1 m − γ m +1 i,t +1 k / τ mi,t + k γ i,t +1 m − π m +1 i,t k / τ mi,t − (1 − ξ i,t ) k γ m +1 i,t +1 − π m +1 i,t k /τ mi,t ≤ (1 − β − m +1 ) L ( γ mi,t ) + β m +1 L ( γ ∗ i ) + β − m +1 k β m γ m +1 i,t +1 − ( β m +1 − γ mi,t +1 − γ ∗ i k / τ mi,t + β − m +1 k β m γ mi,t +1 − ( β m − γ m − i,t +1 − γ ∗ i k / τ mi,t − (1 − ξ i,t ) k γ m +1 i,t +1 − π m +1 i,t k /τ mi,t where γ i,t +1 m = β − m +1 γ ∗ i +(1 − β − m +! ) γ mi,t . Multiplying both sides by k m +1 and re-arranging terms yields k m +1 ( L ( γ ki,t +1 ) −L ( γ ∗ i )) − k m +1 ( L ( γ m +1 i,t +1 ) −L ( γ ∗ i )) ≥ ( k m − k m +1 − k m +1 )( L ( γ mi,t +1 ) −L ( γ ∗ i )) + k m +1 (1 − ξ i,t ) k γ mi,t +1 − π mi,t k / τ mi,t − k k m +1 γ m +1 i,t +1 − ( k m +1 − γ mi,t +1 − γ ∗ i k /τ mi,t − k k m γ mi,t +1 − ( k m − γ m − i,t +1 − γ ∗ i k /τ mi,t . The result derived in proposition A.2 can be applied to show that k m +1 (1 − ξ i,t ) k γ mi,t +1 − π mi,t k / τ mi,t is bounded UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 17 above by k m +1 (1 − ξ i,t )( L ( γ m +1 i,t +1 ) −L ( γ ∗ i )) / (cid:15) ,i . Continuing from above, we have that (2 k m − k m +1 − k m +1 )( L ( γ mi,t +1 ) −L ( γ ∗ i )) − ( k m − k m +1 )( γ m +1 i,t +1 ) −L ( γ ∗ i )) ≥ ( k m − k m +1 − k m +1 )( L ( γ mi,t +1 ) −L ( γ ∗ i )) + k m +1 (1 − ξ i,t ) k γ mi,t +1 − π mi,t k / τ mi,t + k m +1 (1 − ξ i,t ) k γ m +1 i,t +1 − π m +1 i,t k / τ mi,t . From this, we can see that (2 k m − k m +1 − k m +1 )( L ( γ mi,t +1 ) −L ( γ ∗ i ))+ k m +1 (1 − ξ i,t ) k γ mi,t +1 − π mi,t k / τ mi,t isa non-increasing sequence for m . It is bounded below. This implies convergence of the sequence in m and hencefor m +1 . This takes care of the two terms on the left and the ﬁrst two terms on the right-hand side. This leavesthe ﬁnal term on the right-hand side, k m +1 (1 − ξ i,t ) k γ m +1 i,t +1 − π m +1 i,t k / τ mi,t , which is also is convergent in m .Applying proposition A.2 to this ﬁnal term on the right-hand side proves the proposition for the hidden states. Asimilar argument to what is above can be used for the causes. (cid:4) Based on properties of the inertial series { β m } ∞ m =1 and { β m } ∞ m =1 , particularly that the inverse of the inertial subcom-ponents P ∞ m =1 k − m is convergent, we immediately obtain that the state { γ mi,t } ∞ m =1 and cause { κ mi,t } ∞ m =1 iterates areCauchy. The state and cause iterates are thus bounded. The Bolzano-Weierstrass theorem implies convergence of iter-ate subsequences for complete spaces, which applies to our case. The iterates themselves are also strongly convergentto global solutions. Convergence of proximal-gradient-type schemes is not new. It, however, needed to be veriﬁed forour accelerated case.We are now able to prove the main convergence result of the paper. Proposition 4.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes. The state iterates { γ mi,t +1 } ∞ m =1 strongly converge to the global solution of L ( γ i,t , κ i,t , C i , D > i ; α i , λ i,t ) for the accelerated prox-imal gradient scheme. Likewise, the cause iterates { κ mi,t +1 } ∞ m =1 for the accelerated proximal gradient schemestrongly converge to the global solution of L ( γ i,t +1 , κ i,t , G i ; α i , λ i , η i , λ i,t ) at a sub-polynomial rate. Thisoccurs when using the inertial sequences β m , β m = ( k m − /k m +1 , where k m depends polynomially on m . Proof:

Strong convergence of the states { γ mi,t +1 } ∞ m =1 and causes { κ mi,t +1 } ∞ m =1 to the optimal solutions γ ∗ i ∈ Γ ∗ i and κ ∗ i ∈ K ∗ i can be obtained from an extension of proposition A.3. For the convergence rate, we notethat there is some ζ i ∈ R + such that ζ i m − r ≥ k γ m − i,t +1 − γ mi,t k . For m > , we have that k γ m + m i,t +1 − γ mi,t k is boundedabove by P m + m j = m +1 k γ ji,t − γ j − i,t k ≤ ζ i P m + m j = m +1 m − r . As m → ∞ , k γ mi,t +1 − γ ∗ i k ≤ ζ i r/m r ( r − , which impliesa sub- r -polynomial rate of convergence for the state iterate sequence. A similar result holds for the causeiterates. (cid:4) The choice of the inertial sequence greatly affects convergence properties. The classical sequence proposed byNesterov, for instance, yields iterates { γ mi,t +1 } ∞ m =1 and { κ mi,t +1 } ∞ m =1 that only weakly converge to global solutions γ ∗ i ∈ Γ ∗ i and κ ∗ i ∈ K ∗ i , which stems from the fact that P ∞ m =1 k − m , with k m +1 = (1+(1+4 k m ) / ) / , is divergent. Inﬁnite-dimensional Euclidean spaces, this is not a shortcoming, since it implies convergence componentwise and thusis equivalent to strong convergence.The original DPCN relied on a Nesterov-style sequence, so we analyze its convergence. Proposition A.4.

Let the inertial sequences used, respectively, for the state-inference and cause-inferencecosts be β m , β m = ( k m − /k m +1 , where k m +1 = (1+(1+4 k m ) / ) / . We have that ∞ X m =1 k m L ( γ mi,t +1 , κ i,t , C i , D > i ; α i , λ i,t ) − L ( γ ∗ i , κ i,t , C i , D > i ; α i , λ i,t ) + 12 τ mi,t k γ mi,t +1 − γ m − i,t +1 k ! ∞ X m =1 k m L ( γ i,t +1 , κ mi,t +1 , G i ; α i , λ i , η i , λ i,t ) − L ( γ i,t +1 , κ ∗ i , G i ; α i , λ i , η i , λ i,t ) + 12 τ i,tm k κ mi,t +1 − κ m − i,t +1 k ! are convergent. Here, γ ∗ i ∈ Γ ∗ i , γ ∗ i ∈ R k i + , is a solution of the state-inference cost and κ ∗ i ∈ K ∗ i , κ ∗ i ∈ R d i + , is asolution of the cause-inference cost. We have assumed, here, that the state update, for a positive step size τ mi,t ∈ R + , was given by the relation γ mi,t +1 = PROX λ i,t ( π mi,t − λ i,t τ mi,t ∇ π mi,t L ( π mi,t , κ i,t , C i , D > i ; α i , λ i,t )) , with theauxiliary update π m +1 i,t = γ mi,t +1 + β m ( γ mi,t +1 − γ m − i,t +1 ) . As well, the cause update, for a positive step size τ i,t ∈ R + ,was κ mi,t +1 = PROX λ i ( π i,tm − λ i τ i,tm ∇ π i,tm L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t )) , with the auxiliary cause update π i,tm +1 = κ mi,t +1 + β m ( κ mi,t +1 − κ mi,t +1 ) . Proof:

For ease of presentation, we ignore the variables of the inference cost that remain ﬁxed across infer-ence iterations. It can be shown that L ( γ mi,t +1 ) − L ( γ ∗ i ) + β m k γ mi,t +1 − γ m − i,t +1 k / τ mi,t ≥ L ( γ m +1 i,t +1 ) − L ( γ ∗ i ) + k γ mi,t +1 − γ m +1 i,t +1 k / τ m +1 i,t . UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 18

Multiplying both sides by k m +1 , performing an addition by zero, and re-arranging terms yields k m +1 ( L ( γ mi,t +1 ) −L ( γ ∗ i )) + k m +1 k γ mi,t +1 − γ m − i,t +1 k / τ mi,t ≤ ( k m +1 + k m − k m )( L ( γ mi,t +1 ) −L ( γ ∗ i )) − k m +1 ( k m − k γ mi,t +1 − γ m − i,t +1 k / τ mi,t ≤ k m ( L ( γ mi,t +1 ) −L ( γ ∗ i )) + k ( L ( γ mi,t +1 ) −L ( γ ∗ i )) − ( k m (2 − k )+ k m (2 k − − k ) / τ mi,j + k m k γ mi,t +1 − γ m − i,t +1 k / τ mi,t . The last inequality follows because P ∞ m =1 k − m is divergent. In this case, there exists some < k < such that k m +1 − k m ≤ k for all m > m , m > . We therefore have that k ( k m + k m +1)( L ( γ mi,t +1 ) −L ( γ ∗ i )) ≥ k m +1 ( L ( γ m +1 i,t +1 ) −L ( γ ∗ i ) + k γ m +1 i,t +1 − γ mi,t +1 k / τ m +1 i,t ) + k m ( L ( γ mi,t +1 ) −L ( γ ∗ i )+ k γ mi,t +1 − γ m − i,t +1 k / τ mi,t ) . Re-organizing terms allows us to demonstrate that P ∞ m =1 k γ mi,t +1 − γ m − i,t +1 k / τ mi,t is convergent via a Cauchytest. This implies that P ∞ m =1 k m ( L ( γ mi,t +1 ) −L ( γ ∗ i )+ k γ mi,t +1 − γ m − i,t +1 | / τ mi,t ) is also convergent. A similarargument to what is above can be used for the causes. (cid:4) The rate of convergence, though, is limited when choosing a Nesterov-style inertial sequence, both locally and glob-ally. In the global case, we have that

Proposition A.5.

Let γ i,t ∈ R k i + be the hidden states and κ i,t ∈ R d i be the hidden causes. The state iterates { γ mi,t +1 } ∞ m =1 strongly converge to the global solution of L ( γ i,t , κ i,t , C i , D > i ; α i , λ i,t ) for the accelerated prox-imal gradient scheme. Likewise, the cause iterates { κ mi,t +1 } ∞ m =1 for the accelerated proximal gradient schemestrongly converge to the global solution of L ( γ i,t +1 , κ i,t , G i ; α i , λ i , η i , λ i,t ) at a sub-quadratic rate. This oc-curs when using the inertial sequences β m , β m = ( k m − /k m +1 , where k m +1 = (1+(1+4 k m ) / ) / . Proof:

Strong convergence of the states { γ mi,t +1 } ∞ m =1 and causes { κ mi,t +1 } ∞ m =1 to the optimal solutions γ ∗ i ∈ Γ ∗ i and κ ∗ i ∈ K ∗ i can be obtained from an extension of proposition A.4. For the convergence rate, we notethat there is some ζ i ∈ R + such that ζ i m − / ≥ k γ m − i,t +1 − γ mi,t k . For m > , we have that k γ m + m i,t +1 − γ mi,t k is boundedabove by P m + m j = m +1 k γ ji,t − γ j − i,t k ≤ ζ i P m + m j = m +1 m − . As m → ∞ , k γ mi,t +1 − γ ∗ i k ≤ ζ i / m , which impliesa sub-quadratic rate of convergence for the state and cause iterate sequences. (cid:4) To better understand why the convergence is better with a polynomial inertial sequence, it is helpful to re-cast theproximal gradient updates in a way that permits understanding local convergence behaviors using spectral analysis.We do this ﬁrst for the state-inference process.

Proposition A.6.

The state update γ mi,t +1 = PROX λ i,t ( π mi,t − λ i,t τ mi,t ∇ π i,t L ( π i,t , κ i,t , C i , D > i ; α i , λ i,t )) ,with π m +1 i,t = γ mi,t + β m ( γ mi,t − γ m − i,t ) , is equivalent to γ mi,t +1 = SHRINK ( w mi,t ; λ i,t /‘ i,t ) , for the auxiliary variable w mi,t = I k i × k i − ‘ − i,t D > i D i ! π mi,t + ‘ − i,t D > i κ i − ,t + α i ‘ − i,t PROJ ∞ ( π mi,t − C i γ i,t − ) /µ i ! , where I k i × k i ∈ R k i × k i + is the identity matrix, × k i ∈ R k i is a row vector of zeros and k i +1 × ∈ R k i +1+ is acolumn vector of ones. This is equivalent to the homogeneous matrix recurrence  w m +1 i,t w mi,t  = (cid:18) W mi,t ‘ − i,t D > i κ i − ,t +( I k i × k i − ‘ − i,t D > i D i ) z mi,t + α i ‘ − i,t f mi,t × k i k i +1 × (cid:19)| {z } S mi,t  w mi,t w m − i,t  where z mi,t = − (1+ β m ) λ i,t s mi,t /‘ i,t + β m λ i,t s m − i,t /‘ i,t , with s mi,t = SIGN ( SHRINK ( w mi,t ; λ i,t /‘ i,t )) . The term f mi,t accounts for the Nesterov-smoothed component, f mi,t = PROJ ∞ ((( H mi,t ) w mi,t − λ mi,t ‘ − i,t s mi,t − C i γ i,t − ) /µ i ) , whichis given by projecting the L -sparse state-transition component onto an L ∞ ball. The matrix W mi,t ∈ R k i × k i is W mi,t = (cid:18) (1+ β m )( I k i × k i − ‘ − i,t D > i D i )( H mi,t ) − β m +1 ( I k i × k i − ‘ − i,t D > i D i )( H mi,t ) I k i × k i k i × k i (cid:19) . Here, ‘ i,t ∈ R , + is the Lipschitz constant of the state-inference cost at a given layer i and for the current batch t .The ﬂag matrix H mi,t = DIAG ( SIGN ( SHRINK ( w mi,t ; λ i,t /‘ i,t ))) is diagonal and relies on a sparse shrinkage processfor the auxiliary variable. Proof:

The underlying update for accelerated proximal gradients can be re-written as γ mi,t +1 = arg min π ( k π − ( π mi,t − ‘ − i,t ∇ π mi,t L ( π mi,t )) k / ‘ i,t + λ i,t k π k ) UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 19 = SHRINK (( I k i × k i − ‘ − i,t D > i D i ) π mi,t + ‘ − i,t D > i κ i − ,t + α i ‘ − i,t PROJ ∞ (( π mi,t − C i γ i,t − ) /µ i ); λ i,t /‘ i,t ) where L ( π mi,t ) represents the state-inference cost but without the L -sparsity constraint on the states. Here, wehave used a Nesterov smoothing approach, with µ i ∈ R + , to deal with the L -sparse state-transition update,arg max k Ω i,t k ∞ ≤ Ω > i,t ( π mi,t − C i γ i,t − ) − µ i k Ω i,t k / PROJ ∞ (( π mi,t − C i γ i,t − ) /µ i ) . This projection onto an L ∞ -ball has the closed-form solution PROJ ∞ (( π mi,t − C i γ i,t − ) /µ i =  , ( π mi,t − C i γ i,t − ) /µ i > π mi,t − C i γ i,t − ) /µ i , − ≤ ( π mi,t − C i γ i,t − ) /µ i ≤ − , ( π mi,t − C i γ i,t − ) /µ i < − . We replace the states by the auxiliary variable w mi,t and note that γ mi,t +1 = SHRINK ( w mi,t ; λ i,t /‘ i,t ) , where SHRINK ( w mi,t ; λ i,t /‘ i,t ) = DIAG ( SIGN ( SHRINK ( w mi,t ; λ i,t /‘ i,t ))) w mi,t − λ i,t SIGN ( SHRINK ( w mi,t ; λ i,t /‘ i,t )) /‘ i,t , SIGN ( SHRINK ( w mi,t ; λ i,t /‘ i,t )) =  , w mi,t > λ i,t /‘ i,t , − λ i,t /‘ i,t ≤ w mi,t ≤ λ i,t /‘ i,t − , w mi,t < − λ i,t /‘ i,t . We can systematically re-write the auxiliary-variable update as w m +1 i,t = ( I k i × k i − ‘ − i,t D > i D i )( γ mi,t +1 + β m ( γ mi,t +1 − γ m − i,t +1 )) + ‘ − i,t D > i κ i − ,t + α i ‘ − i,t PROJ ∞ (( γ mi,t +1 − C i γ i,t ) /µ i )= ( I k i × k i − ‘ − i,t D > i D i )((1+ β m )( H m − i,t ) w mi,t − λ i,t SIGN ( SHRINK ( w m − i,t ; λ i,t /‘ i,t )) /‘ i,t ) − ( I k i × k i − ‘ − i,t D > i D i )( β m − ( H mi,t ) w mi,t + λ i,t SIGN ( SHRINK ( w m − i,t ; λ i,t /‘ i,t )) /‘ i,t ) − ‘ − i,t D > i κ i − ,t + α i ‘ − i,t PROJ ∞ (( γ mi,t +1 − C i γ i,t ) /µ i )= (1+ β m )( I k i × k i − ‘ − i,t D > i D i )( H mi,t ) w mi,t − β m ( I k i × k i − ‘ − i,t D > i D i )( H m − i,t ) w m − i,t − ‘ − i,t D > i κ i − ,t + ( I k i × k i − ‘ − i,t D > i D i )(( − λ i,t − λ i,t β m ) SIGN ( SHRINK ( w m − i,t ; λ i,t /‘ i,t )) /‘ i,t + λ i,t β m ) SIGN ( SHRINK ( w m − i,t ; λ i,t /‘ i,t )) /‘ i,t ) + α i ‘ − i,t PROJ ∞ (( γ mi,t +1 − C i γ i,t ) /µ i ) . The matrix recurrence follows from this update. (cid:4)

We now characterize the cause inference in a similar manner.

Proposition A.7.

The cause update κ mi,t +1 = PROX λ i ( π i,tm − λ i τ i,tm ∇ π i,tm L ( γ i,t +1 , π i,tm , G i ; α i , λ i , η i , λ i,t )) ,with π i,tm +1 = κ mi,t +1 + β m ( κ mi,t +1 − κ mi,t +1 ) , is equivalent to κ mi,t +1 = SHRINK ( v mi,t ; λ i /‘ i,t ) , for the auxiliaryvariable v mi,t = I d i × d i − /‘ i,t ! π i,tm +2 η i I d i × d i κ i,t /‘ i,t − α i /‘ i,t G > i exp ( − G i π i,tm ) | γ ji,t +1 | ! , where I d i × d i ∈ R d i × d i + is the identity matrix, × d i ∈ R d i is a row vector of zeros and d i +1 × ∈ R d i +1+ is acolumn vector of ones. This is equivalent to the homogeneous matrix recurrence  v m +1 i,t v mi,t  = (cid:18) V mi,t η i I d i × d i κ i,t /‘ i,t +( I d i × d i − /‘ i,t ) z i,tm − α i g mi,t /‘ i,t × d i d i +1 × (cid:19)| {z } T mi,t  v mi,t v m − i,t  where z mi,t = − (1+ β m ) λ i q mi,t /‘ i,t + β m λ i q m − i,t /‘ i,t , with q mi,t = SIGN ( SHRINK ( v mi,t ; λ i /‘ i,t )) . The term g mi,t accounts for the invariant-matrix component, g mi,t = G > i exp ( − G i v i,tm ) | γ ji,t +1 | . The matrix V mi,t ∈ R d i × d i is V mi,t = (cid:18) (1+ β m )( I d i × d i − /‘ i,t )( M mi,t ) − β m +1 ( I d i × d i − /‘ i,t )( M mi,t ) I d i × d i d i × d i (cid:19) . Here, ‘ i,t ∈ R , + is the Lipschitz constant of the state-inference cost at a given layer i and for the current batch t .The ﬂag matrix M mi,t = DIAG ( SIGN ( SHRINK ( v mi,t ; λ i /‘ i,t ))) is diagonal and relies on a sparse shrinkage processfor the auxiliary variable.We now list spectral properties of the iteration matrices W mi,t ∈ R k i +1 × k i +1 and V mi,t ∈ R d i +1 × d i +1 . The validityof these claims follows from extensions of the work in [3]. Lemma A.1.

Suppose that the ﬂag matrices across consecutive iterations m of the hidden state and causeupdates respectively satisfy H m − i,t = H mi,t = H m +1 i,t and M m − i,t = M mi,t = M m +1 i,t . The iteration matrices W mi,t and V mi,t are different at each step and satisfy: UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 20 (i) k W mi,t k ≤ and k V mi,t k ≤ . Also, k ( I k i × k i − ‘ − i,t D > i D i ) H mi,t k ≤ and k ( I d i × d i − /‘ i,t ) M mi,t k ≤ .(ii) For any < β m , β m ≤ , the eigenvalues of W mi,t and V mi,t lie in a closed circle in the real-complexplane that is centered at (0 , ) and that has a radius of . If either of these iteration matrices has eigenvalueswith absolute values of ρ ( W mi,t ) = 1 and ρ ( V mi,t ) = 1 , then there must be no imaginary component. If the stepsizes are such that β m , β m < and if W mi,t and V mi,t have eigenvalues of one, then these eigenvalues must havea complete set of eigenvectors.Now, suppose that the ﬂag matrices H mi,t and M mi,t are such that they are not necessarily equal across iterations m . The full iteration matrices have spectral decompositions S mi,t = P mi,t J mi,t ( P mi,t ) − and T mi,t = Q mi,t R mi,t ( Q mi,t ) − where the block-diagonal eigenvalue matrices J mi,t and R mi,t have the form J mi,t = (cid:18) (cid:19) I mi,t

00 0 J i,tm  , R mi,t = (cid:18) (cid:19) I mi,t

00 0 R i,tm  , where I mi,t is an appropriately sized identity matrix that will depend on the ﬂag matrix at iteration m for layer i and batch t . Here, we have the condition that the spectral radii ρ ( J i,tm ) < and ρ ( R i,tm ) < . Some of these blocksmay be missing depending on either the equivalency or non-equivalency of the ﬂag matrices across iterations.This lemma suggests that there are multiple local phases that can arise from matrix-based proximal-gradientrecurrence. These phases depend on properties of the ﬂag matrix on the spectral characteristics of the full iterationmatrices that deﬁne the matrix recurrence. If the ﬂag matrices remain the same across consecutive iterations, thenthe total-iteration-matrix operator remains invariant. The structure of the spectrum for that operator controls theconvergence behavior of the process. If the ﬂag matrix changes, then the set of active constraints at the current pass inthe process has changed across consecutive iterations. The current iteration is thus a transition to a different operatorwith a different eigenstructure. The algorithm hence adopts a combinatorial aspect while it searches for the correct setof active constraints. The speciﬁc phases are distinguished by the eigenstructure of the total-iteration-matrix operator. Proposition A.8.

Suppose that the ﬂag matrices across consecutive iterations m of the hidden state and causeupdates respectively satisfy H m − i,t = H mi,t and M m − i,t = M mi,t . Let the eigendecompositions of the total-iterationmatrices be such that S mi,t = P mi,t J mi,t ( P mi,t ) − and T mi,t = Q mi,t R mi,t ( Q mi,t ) − , where J mi,t and R mi,t have block-diag-onal forms. Then, the iterates can belong to one of the following phases:(i) Let the spectral radii ρ ( W mi,t ) < and ρ ( V mi,t ) < . In this case, the R × , + Jordan blocks in the upper-left corners of the eigenvalue matrices J mi,t and R mi,t are not present. The identity-matrix blocks I mi,t ∈ R × are degenerate. As long as the ﬂag matrices H mi,t and M mi,t do not change across m and if the iterates are closeenough to the optimal solution, then linear convergence is achieved to that solution. Such solutions are uniqueﬁxed points which are eigenvectors [ P mi,t ] k i +1 , k i +1 and [ Q mi,t ] d i +1 , d i +1 of S mi,t and T mi,t with unit eigen-values. If the eigenvectors are non-negative, then they satisfy the Karush-Kuhn-Tucker conditions for the stateand cause inference costs.(ii) If ρ ( W mi,t ) = 1 and ρ ( V mi,t ) = 1 , then S mi,t and T mi,j both have non-trivial R × , + Jordan blocks in the upper-left corners of J mi,t and R mi,t . There are no other eigenvalues on the unit circle. The theory of the power methodimplies that the vector iterates will converge to an invariant subspace corresponding to the unit eigenvalue.The presence of the non-trivial Jordan block implies the existence of a Jordan chain. That is, there are non-zero vectors ϕ, ϕ ∈ R k i +10 , + and φ, φ ∈ R d i +10 , + such that the equivalence relations ( S mi,t − I k i +1 , k i +1 ) ϕ = ϕ and ( S mi,t − I k i +1 , k i +1 ) ϕ = 0 along with ( T mi,t − I d i +1 , d i +1 ) φ = φ and ( T mi,t − I d i +1 , d i +1 ) φ = 0 aresatisﬁed. Any vector that includes a component of the form aϕ + bϕ and aφ + bφ , for a, b ∈ R would adda constant factor aϕ and aφ , respectively, to S mi,t and T mi,t , plus descending lower-order terms from the otherlesser eigenvalues. If H mi,t and M mi,t do not change across m , then the state w mi,t and cause v mi,t iterates takeconstant-sized steps and either diverge or drive some component negative, which results in a change in theiteration matrices W mi,t and V mi,t .(iii) Suppose that ρ ( W mi,t ) = 1 and ρ ( V mi,t ) = 1 , but S mi,t and T mi,j have no non-diagonal Jordan block for thateigenvalue. If we assume that the solution is unique, then the unit eigenvalues of S mi,t and T mi,j must be simple.There are no other eigenvalues on the unit circle. If the iterates are close enough to the optimal solution, thenthey linearly converge to it, as the inference process behaves similarly to a Von Mises iteration. These unique,ﬁxed-point solutions are, by deﬁnition, the eigenvectors [ P mi,t ] k i +1 , k i +1 and [ Q mi,t ] d i +1 , d i +1 of S mi,t and T mi,t for the unit eigenvalues. The convergence rate is determined by the next-largest eigenvalues in theabsolute value, that is, the largest eigenvalues of J i,tm and R i,tm , as long as the ﬂag matrices H mi,t and M mi,t donot change across m . This phase cannot be last in the inference process, as the search will eventually jump to adifferent one due to the eigenvalue properties. UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 21

Now, suppose that H m − i,t = H mi,t and M m − i,t = M mi,t for iterations m . In this case, the iteration operator does notremain invariant over more than one pass. The iteration matrices W mi,t and V mi,t could match one of the conditionsin the above phases. They could also have the following eigenstructure associated with a fourth phase:(iv) W mi,t and V mi,t have eigenvalues with absolute value one, but not equal to one. This occurs when theiterates transition to a new set of active constraints. The next pass will result in a different operator with adifferent ﬂag matrix. Proof:

For S mi,t , the upper-left portion, W mi,t , contributes to the eigenvalue blocks I mi,t and J i,tm of J mi,t , where S mi,t = P mi,t J mi,t ( P mi,t ) − . Both W mi,t and S mi,t have the same set of eigenvalues with equivalent geometric andalgebraic multiplicities, except when an eigenvalue has an absolute value of one. No eigenvalue with an absolutevalue of one can have a non-diagonal Jordan block. Hence, the blocks, I mi,t and J i,tm , corresponding to thoseeigenvalues must be diagonal.If W mi,t has no eigenvalue equal to one, then S mi,t has a simple eigenvalue that is one. In this case, the alge-braic multiplicity increases by one. The geometric multiplicity either increases by one or stays the same. (cid:4) Such results indicate that there are local search regimes in the inference process where the convergence is quickerthan the global convergence rate. We can specify the conditions when this will occur.

Proposition A.8.

Let the state and cause inference recurrences be deﬁned as in propositions A.6 and A.7,respectively, for the accelerated proximal gradient search. For a non-accelerated search, they become (cid:18) w i,tm +1 (cid:19) = (cid:18) W i,tm ‘ i,t D > i κ i − ,t − λ i,t ‘ − i,t ( I k i × k i − ‘ − i,t D > i D i ) s i,t + α i ‘ i,t f mi,t (cid:19)(cid:18) w i,tm +1 (cid:19) where W i,tm = ( I k i × k i − ‘ − i,t D > i D i )( H i,tm ) and s i,tm = SIGN ( SHRINK ( w i,tm ; λ i,t /‘ i,t )) . The ﬂag matrix is suchthat H i,tm = DIAG ( s i,tm ) . For the non-accelerated cause inference, we have the recurrence relation (cid:18) v i,tm +1 (cid:19) = (cid:18) V i,tm η i I d i × d i κ i,t /‘ i,t +( I d i × d i − /‘ i,t ) q i,tm − α i g mi,t /‘ i,t (cid:19)(cid:18) v i,tm +1 (cid:19) where V i,tm = ( I d i × d i − /‘ i,t )( M i,tm ) and q i,tm = SIGN ( SHRINK ( v i,tm ; λ i /‘ i,t )) . The ﬂag matrix is such that M i,tm = DIAG ( q i,tm ) . We have that:(i) For the second phase, the constant step-size vector has the form (1 − β m ) − ( ϕ, ϕ, > , W i,tm ϕ = ϕ ,with ϕ ∈ R k i being a scaled eigenvector of the state-inference total-iteration matrix. Likewise, for thecause inference, the constant step-size vector has the form (1 − β m ) − ( φ, φ, > , where V i,tm φ = φ , with φ ∈ R d i being a scaled eigenvector. Since β m , β m have a limit of one, the constant-step-size vector islarger than the one for the states, ( ϕ , , W i,tm ϕ = ϕ , in the non-accelerated proximal-gradient case. Theconstant-step-size vector in the accelerated case is also larger than the one for the causes, ( φ , , V i,tm φ = φ ,in the case of non-accelerated proximal gradients.(ii) In the ﬁrst and third phases, if > ρ ( W i,tm ) > β m and > ρ ( V i,tm ) > β m , then the accelerated proximalgradient scheme will be faster than the non-accelerated case for the state and cause inference. It will be slower,however, if > β m > ρ ( W i,tm ) and > β m > ρ ( V i,tm ) . When β m > ρ ( W i,tm ) and β m > ρ ( V i,tm ) , then thelargest eigenvalues of W mi,t and V mi,t must be a pair of complex conjugates. According to the theory of VonMises iterations, the convergence will oscillate between the two complex numbers and the search will not bemonotonically decreasing across iterations. The accelerated case will hence be slower than the non-acceleratedcase, as ρ ( W i,tm ) and ρ ( V i,tm ) will remain ﬁxed for a speciﬁc phase, while the steps β m and β m will monotoni-cally increase. Proof:

For part (i), a single update has the form ( w m +1 i,t , w mi,t , > +( ϕ , ϕ , > . There exists a Jordanblock in J mi,t and hence a Jordan chain. Therefore, ( S mi,t − I k i +1 × k i +1 )( w m +1 i,t , w mi,t , > = ( ϕ , ϕ , > and S mi,t ( ϕ , ϕ , > = ( ϕ , ϕ , > . It can be seen that ϕ = ϕ . This implies W i,tm ϕ = ϕ for the accelerated case.Moreover, ( S mi,t − I k i +1 × k i +1 )  w m +1 i,t w mi,t  =  ((1+ β m ) W i,tm − I k i × k i ) w mi,t − β m W i,tm w m − i,t +( I k i × k i − ‘ − i,t D > i D i ) z mi,t w mi,t − w m − i,t  where ( S mi,t − I k i +1 × k i +1 )( w m +1 i,t , w mi,t , > = (1 − β m ) − ( ϕ, ϕ, > . For this to occur, we must have that w m − i,t = w mi,t +( β m − − ϕ . Similar arguments apply to the causes. The analysis is similar for the non-accel-erated proximal-gradient case.For part (ii), we prove properties for the states and note that they extend to the causes with few changes. Let UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 22 ( v , v ) > be an eigenvector of W mi,t . We have that W mi,t ( v , v ) > = (cid:18) (1 + β m ) ρ ( W mi,t ) v + β m ( I k i × k i − ‘ i,t D > i D i )( H mi,t ) v v (cid:19) = ρ ( W mi,t ) (cid:18) ρ ( W mi,t ) v v (cid:19) since v = ρ ( W mi,t ) v . Hence, ( I k i × k i − ‘ i,t D > i D i )( H mi,t ) v = ρ ( W mi,t ) v / ((1+ β m ) ρ ( W mi,t ) − β m ) . This impliesthat ρ ( W mi,t ) + β m ρ ( K mi,t ) − (1+ β m ) ρ ( W mi,t ) ρ ( W i,tm ) = 0 , where W i,tm = ( I k i × k i − ‘ i,t D > i D i )( H mi,t ) . It can beseen that ρ ( W mi,t ) has real-valued roots if β m (1+ β m ) < ρ ( W mi,t ) and complex-valued roots otherwise. Whenthere are real-valued roots, then ρ ( W i,tm ) > β m . Moreover, ρ ( W mi,t ) = (1+ β m ) ρ ( W i,tm ) / β m ) ρ ( W i,tm ) / − β m ρ ( W i,tm )) / < ρ ( W i,tm ) , which follows from the fact that β m < as per its deﬁnition. When there are com-plex-valued roots, then | ρ ( W mi,t ) | = ( β m ρ ( W i,tm )) / and hence | ρ ( W mi,t ) | < ρ ( W i,tm ) whenever ρ ( W i,tm ) > β m . If,however, β m > ρ ( W i,tm ) , then | ρ ( W mi,t ) | > ρ ( W i,tm ) .Both the accelerated and non-accelerated proximal gradients reduce to a Von Mises iteration in the ﬁrst andthird phases. The rate of convergence is, respectively, determined by | ρ ( W mi,t ) | and | ρ ( W i,tm ) | . (cid:4) This result provides insight into why a Nesterov-style inertial sequence is worse than the one that we employed. It istrivial to demonstrate that the step sizes for a Nesterov inertial sequence will be strictly greater than the step sizes forour polynomial inertial sequence. The former will reach the eigenvalue/step-size inequality conditions more quickly,yielding an inference slowdown whenever the search is in either the ﬁrst or third phase. Such phases typically occurnear the end of the inference process and occupy a majority of the overall process. Moreover, severe cost ripplingwill be encountered due to the emergence of complex-conjugate eigenvalues in the iteration matrices. Our inertialsequence will likely never reach this point, though, since the search is terminated after only a few hundred steps. If itis reached, then it will occur far later than in the Nesterov case, allowing our accelerated strategy to take advantage ofthe faster linear convergence for a greater number of iterations.Although there is a chance that a non-accelerated proximal gradient scheme can locally converge more quicklythan ours, this will rarely occur in practice. We often terminate the search process well before it can happen.

References [1] Z.-Q. Luo and P. Tseng, “Error bound and convergence analysis of matrix splitting algorithms for the afﬁnevariational inequality problem,”

SIAM Journal on Optimization , vol. 2, no. 1, pp. 43–54, 1992. Available:http://dx.doi.org/10.1137/0802004[2] J.-S. Pang, “A posteriori error bounds for the linearly-constrained variational inequality problem,”

Mathematics ofOperations Research , vol. 12, no. 3, pp. 474–484, 1987. Available: http://dx.doi.org/10.1287/moor.12.3.474[3] D. Boley, “Local linear convergence of the alternating direction method of multipliers on quadratic or linearproblems,”

SIAM Journal on Optimization , vol. 23, no. 4, pp. 2183–2207, 2013. Available:http://dx.doi.org/10.1137/120878951

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 23

Appendix B

Below, we report performance errors for the MNIST, FMNIST, CIFAR-10, CIFAR-100, and STL-10 datasets. Foreach of the referenced methods, we either report the test-set classiﬁcation errors that the authors listed or report thebest known test-set classiﬁcation errors obtained using that approach. We also specify whether the chosen approacheswere predominantly unsupervised, semi-supervised, or supervised. The best methods for each learning category arehighlighted in green. The worst are highlighted in red.

MNIST Error Rate ComparisonMethod Learning Style ErrorDPCN [1] Unsupervised 36.2%AE [2] Unsupervised 18.8%GAN [3] Unsupervised 17.2%CRPN [4] Unsupervised 12.6%IMSAT [5] Unsupervised 1.60%IIC [6] Unsupervised 1.60%SCAE [7] Unsupervised 1.00%EBSR [8] Unsupervised 0.39%ADPCN Unsupervised 0.31%MON [9] Supervised 0.45%RCNN [10] Supervised 0.31%MCNN [11] Supervised 0.23% FMNIST Error Rate ComparisonMethod Learning Style ErrorDPCN [1] Unsupervised 42.7%kSCN [12] Unsupervised 39.9%CRPN [4] Unsupervised 8.59%ADPCN Unsupervised 2.51%TLML [13] Semi-super. 11.2%ZSDA [14] Supervised 15.5%DAGH [15] Supervised 6.30%DCAP [16] Supervised 5.54%DRBC [17] Supervised 4.30%DNET [18] Supervised 4.60%LES [19] Supervised 4.14%JOUT [20] Supervised 2.87% CIFAR-10 Error Rate ComparisonMethod Learning Style ErrorDPCN [1] Unsupervised 68.5%NOMP [21] Unsupervised 39.2%CRPN [4] Unsupervised 32.7%RFL [22] Unsupervised 16.9%ADPCN Unsupervised 7.73%MON [9] Supervised 9.38%DASN [23] Supervised 9.22%ACN [24] Supervised 9.08%SPNET [25] Supervised 8.60%HNET [26] Supervised 7.69%LAF [27] Supervised 7.51%RCNN [10] Supervised 7.09%CIFAR-100 Error Rate ComparisonMethod Learning Style ErrorDPCN [1] Unsupervised 95.6%AEVB [28] Unsupervised 84.8%DEC [29] Unsupervised 81.5%DAIC [30] Unsupervised 76.2%DCCM [31] Unsupervised 67.3%CRPN [4] Unsupervised 59.7%ADPCN Unsupervised 29.7%PMO [32] Supervised 38.1%TREE [33] Supervised 36.8%SBO [34] Supervised 27.4%INIT [35] Supervised 26.3%DNET [18] Supervised 17.1% STL-10 Error Rate ComparisonMethod Learning Style ErrorDPCN [1] Unsupervised 87.1%SWAE [36] Unsupervised 72.9%JULE [37] Unsupervised 72.3%DCNN [38] Unsupervised 70.1%DEC [29] Unsupervised 64.1%CRPN [4] Unsupervised 55.5%SRF [39] Unsupervised 39.9%ADPCN Unsupervised 17.3%CKN [40] Supervised 37.6%MTBO [41] Supervised 29.9%SSTN [42] Supervised 23.4%RESN [43] Supervised 14.2%

References [1] J. C. Pr´ıncipe and R. Chalasani, “Cognitive architectures for sensory processing,”

Proceedings of the IEEE , vol.102, no. 4, pp. 514–525, 2014. Available: http://dx.doi.org/10.1109/JPROC.2014.2307023[2] Y. Bengio, P. Lamblin, D. Popovici, and J. Larochelle, “Greedy layer-wise training of deep networks,” in

Advances in Neural Information Processing Systems (NIPS) , B. Sch¨olkopf, J. C. Platt, and T. Hoffman, Eds.Cambridge, MA, USA: MIT Press, 2007, pp. 153–160.[3] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generativeadversarial networks,” in

Proceedings of the International Conference on Learning Representations (ICLR) , SanJuan, Puerto Rico, May 2-4 2016, pp. 1–16. Available: https://arxiv.org/abs/1511.06434[4] R. Chalasani and J. C. Pr´ıncipe, “Context dependent encoding using convolutional dynamic networks,”

IEEETransactions on Neural Networks and Learning Systems , vol. 26, no. 9, pp. 1992–2004, 2015. Available:http://dx.doi.org/10.1109/TNNLS.2014.2360060[5] W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama, “Learning discrete representations via informationmaximizing self-augmented training,” in

Proceedings of the International Conference on Machine Learning(ICML) , Sydney, Australia, August 6-11 2017, pp. 1558–1567. Available:http://dx.doi.org/10.5555/3305381.3305542[6] X. Ji, A. Vedaldi, and J. Henriques, “Invariant information clustering for unsupervised image classiﬁcation andsegmentation,” in

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Seoul, SouthKorea, October 27-November 2 2019, pp. 9864–9873. Available:http://dx.doi.org/DOI:10.1109/ICCV.2019.00996[7] A. Kosiorek, S. Sabour, Y. W. Teh, and G. E. Hinton, “Stacked capsule autoencoders,” in

Advances in NeuralInformation Processing Systems (NIPS) , H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, andR. Garnett, Eds. Red Hook, NY, USA: Curran Associates, 2019, pp. 15 512–15 522.[8] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, “Efﬁcient learning of sparse representations with anenergy-based model,” in

Advances in Neural Information Processing Systems (NIPS) , B. Sch¨olkopf, J. C. Platt,and T. Hoffman, Eds. Cambridge, MA, USA: MIT Press, 2007, pp. 1137–1144.

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 24 [9] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” in

Proceedings ofthe International Conference on Machine Learning (ICML) , Atlanta, GA, USA, June 16-21 2013, pp. 1319–1327.Available: http://dx.doi.org/10.5555/3042817.3043084[10] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in

Proceedings of the IEEEInternational Conference on Computer Vision and Pattern Recognition (CVPR) , Boston, MA, USA, June 7-122015, pp. 3367–3375. Available: http://dx.doi.org/10.1109/CVPR.2015.7298958[11] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classiﬁcation,” in

Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) ,Providence, RI, USA, June 16-21 2012, pp. 3642–3649. Available:http://dx.doi.org/10.1109/CVPR.2012.6248110[12] T. Zhang, P. Ji, M. Harandi, R. Hartley, and I. Reid, “Scalable deep k -subspace clustering,” in Proceedings of theAsian Conference on Computer Vision (ACCV) , Perth, Australia, December 2-6 2018, pp. 466–481. Available:http://dx.doi.org/10.1109/10.1007/978-3-030-20873-8[13] B. Yu, T. Liu, M. Gong, C. Ding, and D. Tao, “Correcting the triplet selection bias for triplet loss,” in

Proceedingsof the European Conference on Computer Vision (ECCV) , Munich, Germany, September 8-14 2018, pp. 71–86.Available: http://dx.doi.org/10.1007/978-3-030-01231-1[14] K.-C. Peng, Z. Wu, and J. Ernst, “Zero-shot deep domain adaptation,” in

Proceedings of the European Conferenceon Computer Vision (ECCV) , Munich, Germany, September 8-14 2018, pp. 793–810. Available:http://dx.doi.org/10.1007/978-3-030-01252-6[15] Y. Chen, Z. Lai, Y. Ding, L. Lin, and W. Wong, “Deep supervised hashing with anchor graph,” in

Proceedings ofthe IEEE International Conference on Computer Vision (ICCV) , Seoul, South Korea, October 27-November 22019, pp. 9795–9803. Available: http://dx.doi.org/10.1109/ICCV.2019.00989[16] J. Rajasegaran, V. Jayasundara, S. Jayasekara, H. Jayasekara, S. Seneviratne, and R. Rodrigo, “DeepCaps: Goingdeeper with capsule networks,” in

Proceedings of the IEEE International Conference on Computer Vision andPattern Recognition (CVPR) , Long Beach, CA, USA, June 15-20 2019, pp. 10 717–10 725. Available:http://dx.doi.org/10.1109/CVPR.2019.01098[17] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” in

Advances in Neural InformationProcessing Systems (NIPS) , I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, andR. Garnett, Eds. Red Hook, NY, USA: Curran Associates, 2017, pp. 3856–3866.[18] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in

Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) ,Honolulu, HI, USA, July 21-26 2017, pp. 2261–2269. Available: http://dx.doi.org/10.1109/CVPR.2017.243[19] A. Nøkland and L. H. Eidnes, “Training neural networks with local error signals,” in

Proceedings of theInternational Conference on Machine Learning (ICML) , Long Beach, CA, USA, June 10-15 2019, pp.4839–4850.[20] S. Wang, T. Zhou, and J. Bilmes, “Jumpout: Improved dropout for deep neural networks with ReLUs,” in

Proceedings of the International Conference on Machine Learning (ICML) , Long Beach, CA, USA, June 10-152019, pp. 6668–6676.[21] T.-H. Lin and H. T. Kung, “Stable and efﬁcient representation learning with nonnegativity constraints,” in

Proceedings of the International Conference on Machine Learning (ICML) , Beijing, China, June 22-24 2014, pp.1323–1331.[22] Y. Jia, C. Huang, and T. Darrell, “Beyond spatial pyramids: Receptive ﬁeld learning for pooled image features,” in

Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) ,Providence, RI, USA, June 16-21 2012, pp. 3370–3377. Available:http://dx.doi.org/10.1109/CVPR.2012.6248076[23] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber, “Deep networks with internal selective attentionthrough feedback connections,” in

Advances in Neural Information Processing Systems (NIPS) , Z. Ghahramani,M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Red Hook, NY, USA: Curran Associates,2014, pp. 3545–3553.[24] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutionalnet,” in

Proceedings of the International Conference on Learning Representations (ICLR) , San Diego, CA, USA,May 7-9 2015, pp. 1–14. Available: https://arxiv.org/abs/1412.6806[25] O. Rippel, J. Snoek, and R. P. Adams, “Spectral representations for convolutional neural networks,” in

Advancesin Neural Information Processing Systems (NIPS) , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, Eds. Red Hook, NY, USA: Curran Associates, 2015, pp. 2449–2457.[26] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in

Advances in NeuralInformation Processing Systems (NIPS) , C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,Eds. Cambridge, MA, USA: MIT Press, 2015, pp. 2377–2385.[27] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activation functions to improve deep neuralnetworks,” in

Proceedings of the International Conference on Learning Representations (ICLR) , San Diego, CA,USA, May 7-9 2015, pp. 1–9. Available: https://arxiv.org/abs/1412.6830[28] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in

Proceedings of the International Conferenceon Learning Representations (ICLR) , Scottsdale, AZ, USA, May 2-4 2013, pp. 1–14. Available:https://arxiv.org/abs/1312.6114[29] J. Xie, R. Girshick, and A. Farhadi, “Unsupervised deep embedding for clustering analysis,” in

Proceedings of theInternational Conference on Machine Learning (ICML) , New York, NY, USA, June 20-22 2016, pp. 478–487.[30] J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan, “Deep adaptive image clustering,” in

Proceedings of the IEEEInternational Conference on Computer Vision (ICCV) , Venice, Italy, October 22-29 2017, pp. 5880–5888.Available: http://dx.doi.org/10.1109/ICCV.2017.626

UBMITTING TO IEEE TNNLS (SHORT PAPER) DISTRIBUTION A: UNLIMITED 25 [31] J. Wu, K. Long, F. Wang, C. Qian, C. Li, Z. Lin, and H. Zha, “Deep comprehensive correlation mining for imageclustering,” in

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Seoul, SouthKorea, October 27-November 2 2019, pp. 8149–8158. Available: http://dx.doi.org/10.1109/ICCV.2019.00824[32] J. T. Springenberg and M. Riedmiller, “Improving deep neural networks with probabilistic maxout units,” in

Proceedings of the International Conference on Learning Representations (ICLR) , Banff, Canada, April 14-162014, pp. 1–10. Available: https://arxiv.org/abs/1312.6116[33] N. Srivastava and R. R. Salakhutdinov, “Discriminative transfer learning with tree-based priors,” in

Advances inNeural Information Processing Systems (NIPS) , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.Weinberger, Eds. Red Hook, NY, USA: Curran Associates, 2013, pp. 2094–2102.[34] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. P. Adams,“Scalable Bayesian optimization using deep neural networks,” in

Proceedings of the International Conference onMachine Learning (ICML) , Lille, France, July 6-11 2015, pp. 2171–2180. Available:https://arxiv.org/abs/1502.05700[35] D. .Mishkin and J. Matas, “All you need is a good init,” in

Proceedings of the International Conference onLearning Representations (ICLR) , San Juan, Puerto Rico, May 2-4 2016, pp. 1–14. Available:https://arxiv.org/abs/1511.06422[36] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun, “Stacked what-where auto-encoders,” in

Proceedings of theInternational Conference on Learning Representations (ICLR) , San Juan, Puerto Rico, May 2-4 2016, pp. 1–12.Available: https://arxiv.org/abs/1506.02351[37] J. Yang, D. Parikh, and D. Batra, “Joint unsupervised learning of deep representations and image clusters,” in

Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) , LasVegas, NV, USA, June 27-30 2016, pp. 5147–5156. Available: http://dx.doi.org/10.1109/CVPR.2016.556[38] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in

Proceedings of the IEEEInternational Conference on Computer Vision and Pattern Recognition (CVPR) , San Francisco, CA, USA, June13-18 2010, pp. 2528–2535. Available: http://dx.doi.org/10.1109/CVPR.2010.5539957[39] A. Coates and A. Y. Ng, “Selecting receptive ﬁelds in deep networks,” in

Advances in Neural InformationProcessing Systems (NIPS) , J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds.Red Hook, NY, USA: Curran Associates, 2011, pp. 2528–2536.[40] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional kernel networks,” in

Advances in NeuralInformation Processing Systems (NIPS) , Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q.Weinberger, Eds. Red Hook, NY, USA: Curran Associates, 2014, pp. 2627–2635.[41] K. Swerksy, J. Snoek, and R. P. Adams, “Multi-task Bayesian optimization,” in

Advances in Neural InformationProcessing Systems (NIPS) , C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds.Red Hook, NY, USA: Curran Associates, 2013, pp. 2004–2012.[42] E. Oyallon, E. Belilovsky, and S. Zagoruko, “Scaling the scattering transform: Deep hybrid networks,” in

Proceedings of the IEEE International Conference on Computer Vision (ICCV) , Venice, Italy, December 7-132017, pp. 5619–5628. Available: http://dx.doi.org/10.1109/ICCV.2017.599[43] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in