[PDF] Hindsight Network Credit Assignment

Abstract

We present Hindsight Network Credit Assignment (HNCA), a novel learning method for stochastic neural networks, which works by assigning credit to each neuron's stochastic output based on how it influences the output of its immediate children in the network. We prove that HNCA provides unbiased gradient estimates while reducing variance compared to the REINFORCE estimator. We also experimentally demonstrate the advantage of HNCA over REINFORCE in a contextual bandit version of MNIST. The computational complexity of HNCA is similar to that of backpropagation. We believe that HNCA can help stimulate new ways of thinking about credit assignment in stochastic compute graphs.

Full PDF

HHindsight Network Credit Assignment

Kenny Young

Department of Computing ScienceUniversity of AlbertaEdmonton, AB, Canada [email protected]

Abstract

We present Hindsight Network Credit Assignment (HNCA), a novel learningmethod for stochastic neural networks, which works by assigning credit to eachneuron’s stochastic output based on how it inﬂuences the output of its immediatechildren in the network. We prove that HNCA provides unbiased gradient estimateswhile reducing variance compared to the REINFORCE estimator. We also experi-mentally demonstrate the advantage of HNCA over REINFORCE in a contextualbandit version of MNIST. The computational complexity of HNCA is similar tothat of backpropagation. We believe that HNCA can help stimulate new ways ofthinking about credit assignment in stochastic compute graphs.The idea of using discrete stochastic neurons within neural networks is appealing for a number ofreasons, including representing complex multimodal distributions, modeling discrete choices withina compute graph, providing regularization, and enabling nontrivial exploration. However, trainingsuch neurons efﬁciently presents challenges, as backpropagation is not directly applicable. A numberof techniques have been proposed for producing either biased, or unbiased estimates of gradients forstochastic neurons. Bengio et al. (2013) proposes an unbiased REINFORCE (Williams, 1992) styleestimator, as well as a biased but low variance estimator by treating a threshold function as constantduring backpropagation. Tang and Salakhutdinov (2013) propose an EM procedure which maximizesvariational lower bound on the loss. Maddison et al. (2016) and Jang et al. (2016) each propose abiased estimator based on a continuous relaxation of discrete outputs. Tucker et al., 2017 use such acontinuous relaxation to derive a control variate for a REINFORCE style estimator, resulting in avariance reduced unbiased gradient estimator.We introduce a novel, unbiased, and computationally efﬁcient estimator for the gradients of stochasticneurons which reduces variance by assigning credit to each neuron based on how much it impacts theoutputs of its immediate children. Our technique is inspired by the recently proposed Hindsight CreditAssignment (Harutyunyan et al., 2019) approach to reinforcement learning, hence we call it HindsightNetwork Credit Assignment (HNCA). Aside from the immediate application to stochastic neuralnetworks, we believe this line of thinking can help pave the way for new ways of thinking aboutcredit assignment in stochastic compute graphs (see the work of Weber et al. (2019) and Schulmanet al. (2015) for some current techniques).

Problem setting.

We consider the problem of training a network of stochastic neurons. The networkconsists of a directed acyclic graph where each node is either an input node, hidden node, or outputnode. Each hidden node and output node correspond to a stochastic neuron that generates outputaccording to a parameterized stochastic policy conditioned on it’s incoming edges. Let Φ be a randomvariable corresponding to the output of a particular neuron. We deﬁne B (Φ) to be the set of randomvariables corresponding to outputs of the parent nodes of Φ (i.e. nodes with incoming edges tothe neuron Φ ), we will similarly use C (Φ) to denote the children of Φ . We assume each neuron’soutput takes a discrete set of possible values. Let π Φ ( φ | b ) be the policy of the neuron Φ which isdeﬁned to be equal to the probability that Φ = φ conditioned on the neurons inputs B ( φ ) = b . That Beyond Backpropagation Workshop at NeurIPS 2020. a r X i v : . [ c s . L G ] N ov s, π Φ ( φ | b ) ˙= P (Φ = φ | B (Φ) = b ) . π Φ ( φ | b ) is a parameterized function with a set of learnableparameters θ Φ . For simplicity, assume the network has a single output node ˆΦ .We focus on a contextual bandit setting, where the network selects an action conditioned on the inputwith the aim of maximizing an unknown reward function. The approach can generalize straight-forwardly to other settings, such as supervised learning. Deﬁne the reward function R (Φ , ˆΦ , (cid:15) ) ,where (cid:15) is unobserved i.i.d. noise and Φ is a random variable corresponding to the network input.We will also use R to represent the random variable corresponding to the output of the rewardfunction. At each timestep the network receives an i.i.d. input Φ . The goal is to tune the networkparameters to make E [ R ] as high as possible. Towards this, we are interested in constructing anunbiased estimator of the gradient ∂ E [ R ] ∂θ Φ for the parameters of each unit, such that we can update theparameters according to the estimator to improve the reward in expectation.Directly computing the gradient of the output probability with respect to the parameters for agiven input, as we do in backpropagation, is intractable for most stochastic networks. Computingthe output probability P ( ˆΦ | Φ ) (or the associated gradient) would require marginalizing over allpossible output values of each node. Instead, we can deﬁne a local REINFORCE estimator as ˆ G RE Φ = ∂ log( π Φ (Φ | B (Φ))) ∂θ Φ R . ˆ G RE Φ is an unbiased estimator of the gradient ∂ E [ R ] ∂θ Φ (see Appendix A).However, ˆ G RE Φ tends to have signiﬁcant variance. Hindsight Network Credit Assignment.

We now come to the main contribution of this report,introducing HNCA: a variance reduced policy gradient estimator for learning in stochastic neuralnetworks. HNCA works by exploiting the causal structure of the stochastic compute graph to assigncredit to each node’s output based on how it impacts the output of its immediate children.To aid us in deriving HNCA, deﬁne the action value for a particular neuron as follows: Q Φ ( φ, b ) = E [ R | B (Φ) = b, Φ = φ ] . (1)HNCA involves constructing a stochastic estimator of Q Φ ( φ, b ) . That is, a random variable ˆ Q Φ ( φ ) such that: E [ ˆ Q Φ ( φ ) | B (Φ) = b ] = Q Φ ( φ, b ) . (2)Given any such estimator, we can construct a policy gradient estimator ˆ G Φ = (cid:88) φ ∂π Φ ( φ | B (Φ)) ∂θ Φ ˆ Q Φ ( φ ) . (3)Any such estimator is an unbiased estimator of the gradient in the sense that E [ ˆ G Φ ] = ∂ E [ R ] ∂θ Φ (seeAppendix B). Choosing ˆ Q Φ ( φ ) = ˆ Q RE Φ ( φ ) ˙= (Φ= φ ) π Φ ( φ | B (Φ)) R recovers the local REINFORCE estimator ˆ G RE Φ . Before deriving HNCA, we introduce an additional assumption: Assumption 1.

Parents of children are not descendants. More precisely, let C + (Φ) be the set ofdescendants of Φ ; for every node in the network we assume that: B ( C (Φ)) ∩ C + (Φ) = ∅ . This assumption guarantees that the parents of the children of Φ provide no additional informationrelevant to predicting Φ given the parents of Φ . This holds for typical feedforward networks as wellas more general architectures. We leave open the question of whether it can be relaxed. The ﬁguresin Appendix C show a simple examples where Assumption 1 holds and another where it is violated.With this assumption in place, for any non-output node ( Φ (cid:54) = ˆΦ ), we can rewrite Q Φ ( φ, b ) as follows: Q Φ ( φ, b ) ( a ) = E (cid:20) (Φ = φ ) π Φ ( φ | B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) ( b ) = E (cid:20) E (cid:20) (Φ = φ ) π Φ ( φ | B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) C (Φ) , B (Φ) , B ( C (Φ)) \ Φ , R (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) All expectations and probabilities are taken with respect to all random variables in the network, input andreward function. For Bernoulli neurons, this is equivalent to the unbiased estimator proposed by Bengio et al. (2013) E (cid:20) E [ (Φ = φ ) | C (Φ) , B (Φ) , B ( C (Φ)) \ Φ , R ] π Φ ( φ | B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) ( c ) = E (cid:20) E [ (Φ = φ ) | C (Φ) , B (Φ) , B ( C (Φ)) \ Φ] π Φ ( φ | B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:21) = E (cid:20) P (Φ = φ | C (Φ) , B (Φ) , B ( C (Φ)) \ Φ) π Φ ( φ | B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) ( d ) = E (cid:20) P (Φ = φ | C (Φ) , B (Φ) , B ( C (Φ)) \ Φ) P (Φ = φ | B ( C (Φ)) \ Φ , B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) ( e ) = E (cid:20) P ( C (Φ) | B (Φ) , B ( C (Φ)) \ Φ , Φ = φ ) P ( C (Φ) | B ( C (Φ)) \ Φ , B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) ( f ) = E (cid:20) P ( C (Φ) | B ( C (Φ)) \ Φ , Φ = φ ) P ( C (Φ) | B ( C (Φ)) \ Φ , B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) B (Φ) = b (cid:21) , (4)where ( a ) uses ˆ Q RE ; ( b ) uses the law of total expectation; ( c ) follows from the fact that theconditioning variables besides R form a Markov blanket (Pearl, 2014) for φ , hence we can drop Rknowing it provides no additional information; ( d ) follows from assumption 1; ( e ) applies Baye’srule; and ( f ) follows from the fact that B ( C (Φ)) separates C (Φ) from B (Φ) . The ﬁnal expressiongives rise to what we call the HNCA action-value estimator: ˆ Q HNCA Φ ( φ ) = P ( C (Φ) | B ( C (Φ)) \ Φ , Φ = φ ) P ( C (Φ) | B ( C (Φ)) \ Φ , B (Φ)) R. (5)Effectively, HNCA assigns credit to a particular action choice φ based on the relative likelihood of it’schildren’s actions had φ been chosen, independent of the actual value of Φ . This provides a variancereduction, because an action will only get credit relative to other choices if it makes a signiﬁcantdifference further downstream. Note that Equation 4 applies only to nodes for which C (Φ) (cid:54) = ∅ andthus excludes the output node ˆΦ . For ˆΦ we will simply use the reinforce estimator ˆ Q RE Φ ( φ ) .As stated in the following theorem, the HNCA action-value estimator always has variance lower thanor equal to the local REINFORCE estimator. Let V represent variance. Theorem 1. V ( ˆ Q HNCA Φ (Φ) | B ( φ ) = b ) ≤ V ( ˆ Q RE Φ (Φ) | B ( φ ) = b ) . Theorem 1 follows from the law of total variance by the proof available in Appendix D. In Ap-pendix E, we further analyze how this affects the variance of ˆ G HNCA Φ ˙= (cid:80) φ ∂π Φ ( φ | B (Φ)) ∂θ Φ ˆ Q HNCA Φ ( φ ) ;a key result is that, for Bernoulli neurons, which we use in our experiments, we also have V ( ˆ G HNCA Φ (Φ) | B ( φ ) = b ) ≤ V ( ˆ G RE Φ (Φ) | B ( φ ) = b ) . A Computationally Efﬁcient Implementation.

We provide a computationally efﬁcient pseudo-codeimplementation of HNCA for a network consisting of Bernoulli hidden neurons with a soft-maxoutput neuron. The implementation is similar to backprop in that each node receives informationfrom its parents in the forward pass and passes back information to its children in a backward pass tocompute a gradient estimate. Like backprop, the required compute is proportional to the number ofedges in the graph. See Appendix F for pseudo-code and further details on this implementation.

Experiments.

We evaluate HNCA on MNIST (LeCun et al., 2010), formulated as a contextual bandit.At each training step, the model outputs a class and receives a reward of only if the class is correctand otherwise.Our architecture consists of a stochastic feedforward neural network with 1,2 or 3 hidden layers, eachwith 64 Bernoulli nodes, followed by a softmax output layer. For training, we use batched versions ofthe algorithms in Appendix F with batchsize 16. We map the output of the Bernoulli nodes to oneor negative one, instead of one or zero, as we found this signiﬁcantly improved performance. Forcompleteness, results using a zero-one mapping can be found in Appendix G.As a baseline, we use the REINFORCE estimator with the same architecture. As two other baselines,we use backpropagation with an analogous deterministic architecture with tanh and ReLU activationstrained using REINFORCE with the policy gradient of the network taken as a whole. All models,3 a) Learning-rate sensitivitycurves for 1 hidden layer. (b) Learning-rate sensitivitycurves for 2 hidden layers. (c) Learning-rate sensitivitycurves for 3 hidden layers.(d) Learning curves for bestlearning-rate for 1 hidden layer. (e) Learning curves for bestlearning-rate for 2 hidden layers. (f) Learning curves for bestlearning-rate for 3 hidden layers. Figure 1: Learning curves and learning-rate sensitivity for HNCA and baselines on contextual banditMNIST. Green curves are HNCA, red are REINFORCE, blue are backprop with sigmoid activations,and orange are backprop with ReLU activations. The architecture is a small neural network with 64units per layer with different numbers of hidden layers. All plots show 10 random seeds with errorbars showing 95% conﬁdence intervals. In order to show the fastest convergence among settingswhere ﬁnal performance is similar, the best learning-rate is taken to be the highest learning-rate thatis no more than one standard error from the learning-rate that gives the highest ﬁnal accuracy.both deterministic and stochastic, use Glorot initialization (Glorot & Bengio, 2010). All models aretrained for a total of 100 epochs with a batchsize of 16. For each architecture we sweep a wide rangeof learning-rates in powers of 2.The results are shown in Figure 1. As expected, we ﬁnd that HNCA generally improves on RE-INFORCE, with the reduced variance facilitating stability at larger learning-rates, and leading tosigniﬁcantly faster learning. We also found that HNCA was generally competitive with the deter-minisitic network with ReLU activations. However, with tanh activations the deterministic networkdrastically outperforms all other tested approaches in this domain. This is a rather surprising resultgiven the conventional wisdom that ReLU activation is superior. We speculate that this may be due totanh allowing more ﬂexible exploration, offering a beneﬁt in the contextual bandit setting relativeto ReLU where saturation may lead to premature convergence. However, given that this is not themain focus of this work we do not investigate this further. It would likely be possible to improve theperformance of each approach through regularization, increased layer width, and other enhancements.Here, we choose to demonstrate the relative beneﬁt of HNCA over REINFORCE in a simple setting.

Conclusion.

We introduce HNCA, a novel technique for computing unbiased gradient estimates forstochastic neural networks. HNCA works by estimating the gradient for each neuron based on how itcontributes to the output of it’s immediate children. We prove that HNCA provides lower variancegradient estimates compared to REINFORCE, and provide a computationally efﬁcient pseudo-codeimplementation with complexity on the same order as backpropagation. We also experimentallydemonstrate the beneﬁt of HNCA over REINFORCE on a contextual bandit version of MNIST.Note that, unlike backpropagation, HNCA does not propagate credit information multiple steps. Itonly leverages information on how a particular node impacts its immediate children to produce avariance reduced gradient estimate. Extending this to propagate information multiple steps, whilemaintaining computational efﬁciency, appears nontrivial as a node’s impact on descendants multiplesteps away depends on intermediate nodes in a complex way. Nonetheless, there may be waysto extend this approach to propogate limited information muliple steps and we see this as a veryinteresting direction for future work. 4 eferences

Bengio, Y., Léonard, N., & Courville, A. (2013). Estimating or propagating gradients throughstochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 .Glorot, X., & Bengio, Y. Understanding the difﬁculty of training deep feedforward neural networks.In:

Proceedings of the thirteenth international conference on artiﬁcial intelligence andstatistics . 2010, 249–256.Harutyunyan, A., Dabney, W., Mesnard, T., Azar, M. G., Piot, B., Heess, N., van Hasselt, H. P.,Wayne, G., Singh, S., Precup, D., et al. Hindsight credit assignment. In:

Advances in neuralinformation processing systems . 2019, 12488–12497.Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144 .LeCun, Y., Cortes, C., & Burges, C. (2010). Mnist handwritten digit database.

ATT Labs [Online].Available: http://yann.lecun.com/exdb/mnist , .Maddison, C. J., Mnih, A., & Teh, Y. W. (2016). The concrete distribution: A continuous relaxationof discrete random variables. arXiv preprint arXiv:1611.00712 .Pearl, J. (2014). Probabilistic reasoning in intelligent systems: Networks of plausible inference .Elsevier.Schulman, J., Heess, N., Weber, T., & Abbeel, P. Gradient estimation using stochastic computationgraphs. In:

Advances in neural information processing systems . 2015, 3528–3536.Tang, C., & Salakhutdinov, R. R. Learning stochastic feedforward neural networks. In:

Advances inneural information processing systems . 2013, 530–538.Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., & Sohl-Dickstein, J. Rebar: Low-variance,unbiased gradient estimates for discrete latent variable models. In:

Advances in neuralinformation processing systems . 2017, 2627–2636.Weber, T., Heess, N., Buesing, L., & Silver, D. (2019). Credit assignment techniques in stochasticcomputation graphs. arXiv preprint arXiv:1901.01761 .Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.

Machine learning , (3-4), 229–256.5 The Local REINFORCE Estimator is Unbiased

Here we show that the local REINFORCE estimator ˆ G RE Φ = ∂ log( π Φ (Φ | B (Φ))) ∂θ Φ R is an unbiasedestimator of the gradient of the expected reward with respect to θ Φ . E [ ˆ G RE Φ ] = E (cid:20) ∂ log( π Φ (Φ | B (Φ))) ∂θ Φ R (cid:21) = (cid:88) b P ( B (Φ) = b ) E (cid:20) ∂ log( π Φ (Φ | B (Φ))) ∂θ Φ R (cid:12)(cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:21) = (cid:88) b P ( B (Φ) = b ) (cid:88) φ π Φ ( φ | b ) ∂ log( π Φ ( φ | b )) ∂θ Φ E [ R | B (Φ) = b, Φ = φ ]= (cid:88) b P ( B (Φ) = b ) (cid:88) φ ∂π Φ ( φ | b ) ∂θ Φ E [ R | B (Φ) = b, Φ = φ ] ( a ) = ∂∂θ Φ (cid:88) b P ( B (Φ) = b ) (cid:88) φ π Φ ( φ | b ) E [ R | B (Φ) = b, Φ = φ ]= ∂ E [ R ] ∂θ Φ , where ( a ) follows from the fact that the probability of the parents of Φ , P ( B (Φ) = b )) , does notdepend on the parameters Θ controlling Φ itself, nor does the expected reward once conditioned on Φ . B Any Gradient Estimator Based on an Action-Value Estimator ObeyingEquation 2 is Unbiased

Assume we have access to an action value ˆ Q Φ ( φ ) obeying Equation 2 and construct a gradientestimator ˆ G Φ = (cid:80) φ ∂π Φ ( φ | B (Φ)) ∂θ Φ ˆ Q Φ ( φ ) as speciﬁed by Equation 3, then we can rewrite E [ ˆ G Φ ] asfollows: E [ ˆ G Φ ] = E (cid:88) φ ∂π Φ ( φ | B (Φ)) ∂θ Φ ˆ Q Φ ( φ )  = (cid:88) b P ( B (Φ) = b ) (cid:88) φ ∂π Φ ( φ | b ) ∂θ Φ E (cid:104) ˆ Q Φ ( φ ) (cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:105) ( a ) = (cid:88) b P ( B (Φ) = b ) (cid:88) φ ∂π Φ ( φ | b ) ∂θ Φ E [ R | B (Φ) = b, Φ = φ ]= ∂∂θ Φ (cid:88) b P ( B (Φ) = b ) (cid:88) φ π Φ ( φ | b ) E [ R | B (Φ) = b, Φ = φ ]= ∂ E [ R ] ∂θ Φ . where ( a ) follows from the assumption that ˆ Q Φ ( φ ) obeys Equation 2.6 Examples where Assumption 1 Holds and is Violated

Figure 2 illustrates the meaning of assumption 1 by showing a simple example where it holds, andanother where it does not. 𝚽 C + ( 𝚽 )B(C( 𝚽 )) (a) An example where assump-tion 1 holds. 𝚽 C + ( 𝚽 )B(C( 𝚽 )) (b) An example where assump-tion 1 fails to hold. Figure 2

D The HNCA Action Value Estimator has Lower Variance than theReinforce Estimator

Here we provide the proof of Theorem 1.

Theorem 1. V ( ˆ Q HNCA Φ (Φ) | B ( φ ) = b ) ≤ V ( ˆ Q RE Φ (Φ) | B ( φ ) = b ) . Proof.

The proof follows from the law of total variance. First, note that by inverting up to step ( b ) inEquation 4 we know: ˆ Q HNCA Φ ( φ ) = E (cid:20) (Φ = φ ) π Φ ( φ | B (Φ)) R (cid:12)(cid:12)(cid:12)(cid:12) C (Φ) , B (Φ) , B ( C (Φ)) \ Φ , R (cid:21) . Now apply the law of total variance to rewrite the variance of the reinforce estimator as follows: V ( ˆ Q RE Φ (Φ) | B ( φ ) = b )= V (cid:18) (Φ = φ ) π Φ ( φ | B ( φ )) R (cid:12)(cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:19) = E (cid:20) V (cid:18) (Φ = φ ) π Φ ( φ | B ( φ )) R (cid:12)(cid:12)(cid:12)(cid:12) C (Φ) , B (Φ) , B ( C (Φ)) \ Φ , R (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:21) + V (cid:18) E (cid:20) (Φ = φ ) π Φ ( φ | B ( φ )) R (cid:12)(cid:12)(cid:12)(cid:12) C (Φ) , B (Φ) , B ( C (Φ)) \ Φ , R (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:19) ≥ V (cid:18) E (cid:20) (Φ = φ ) π Φ ( φ | B ( φ )) R (cid:12)(cid:12)(cid:12)(cid:12) C (Φ) , B (Φ) , B ( C (Φ)) \ Φ , R (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) B ( φ ) = b (cid:19) = V ( ˆ Q HNCA (Φ) | B ( φ ) = b ) . Variance of the HNCA Gradient Estimator

Theorem 1 shows that the HNCA action value estimator ˆ Q HNCA Φ (Φ) has lower variance than ˆ Q RE Φ (Φ) . In this section we discuss how this impacts the variance of the associated gradient estimator ˆ G Φ = (cid:80) φ ∂π Φ ( φ | B (Φ)) ∂θ Φ ˆ Q Φ ( φ ) . We can write this using the law of total variance as follows: V ( ˆ G Φ ) = E (cid:104) V (cid:16) ˆ G Φ (cid:12)(cid:12)(cid:12) B (Φ) (cid:17)(cid:105) + V (cid:16) E (cid:104) ˆ G Φ (cid:12)(cid:12)(cid:12) B (Φ) (cid:105)(cid:17) . E (cid:104) ˆ G Φ (cid:12)(cid:12)(cid:12) B (Φ) (cid:105) is the same for both estimators so we will focus on E (cid:104) V (cid:16) ˆ G Φ (cid:12)(cid:12)(cid:12) B (Φ) (cid:17)(cid:105) . Let C represent covariance. E (cid:104) V (cid:16) ˆ G Φ (cid:12)(cid:12)(cid:12) B (Φ) (cid:17)(cid:105) = (cid:88) b P ( B (Φ) = b ) (cid:32)(cid:88) φ (cid:18) ∂π Φ ( φ | B (Φ)) ∂θ Φ (cid:19) V ( ˆ Q Φ ( φ ) | B (Φ))+ (cid:88) φ (cid:88) φ (cid:48) (cid:54) = φ (cid:18) ∂π Φ ( φ | B (Φ)) ∂θ Φ ∂π Φ ( φ (cid:48) | B (Φ)) ∂θ Φ (cid:19) C ( ˆ Q Φ ( φ ) , ˆ Q Φ ( φ (cid:48) ) | B (Φ)) (cid:33) . (6)We have already shown that V ( ˆ Q Φ ( φ ) | B (Φ)) is lower for ˆ Q HNCA Φ ( φ ) . Let us now look at C ( ˆ Q Φ ( φ ) , ˆ Q Φ ( φ (cid:48) ) | B (Φ)) . For ˆ Q RE Φ ( φ ) only one φ takes nonzero value at a time, hence the co-variance can be expressed as: C ( ˆ Q RE Φ ( φ ) , ˆ Q RE Φ ( φ (cid:48) ) | B (Φ)) = − E [ ˆ Q RE Φ ( φ ) | B (Φ)] E [ ˆ Q RE Φ ( φ (cid:48) ) | B (Φ)]= − E [ R | B (Φ) , Φ = φ ] E [ R | B (Φ) , Φ = φ (cid:48) ] . For ˆ Q HNCA Φ ( φ ) we can express the covariance as follows: C ( ˆ Q HNCA Φ ( φ ) , ˆ Q HNCA Φ ( φ (cid:48) ) | B (Φ)) = E [ ˆ Q HNCA Φ ( φ ) ˆ Q HNCA Φ ( φ (cid:48) ) | B (Φ)] − E [ ˆ Q HNCA Φ ( φ ) | B (Φ)] E [ ˆ Q HNCA Φ ( φ (cid:48) ) | B (Φ)]= E (cid:20) P ( C (Φ) | B ( C (Φ)) \ Φ , Φ = φ ) P ( C (Φ) | B ( C (Φ)) \ Φ , Φ = φ (cid:48) ) P ( C (Φ) | B ( C (Φ)) \ Φ , B (Φ)) R | B (Φ) , Φ = φ (cid:21) − E [ R | B (Φ) , Φ = φ ] E [ R | B (Φ) , Φ = φ (cid:48) ] . Note that the ﬁrst term in this expression is always positive while the second is equal to the covarianceexpression for ˆ Q RE Φ ( φ ) . Thus, C ( ˆ Q HNCA Φ ( φ ) , ˆ Q HNCA Φ ( φ (cid:48) ) | B (Φ)) ≥ C ( ˆ Q RE Φ ( φ ) , ˆ Q RE Φ ( φ (cid:48) ) | B (Φ)) for all φ, φ (cid:48) . Putting this all together and looking at Equation 6 we can conclude that V ( ˆ G HNCA Φ ) ≤ V ( ˆ G RE Φ ) as long as ∂π Φ ( φ | B (Φ)) ∂θ Φ ∂π Φ ( φ (cid:48) | B (Φ)) ∂θ Φ is always negative. This is the case for Bernoulineurons, since changing any given parameter in this case will increase the probability of one actionsand decrease the probability of the other. Here we use Bernoulli neurons in all but the ﬁnal layer,where we cannot apply ˆ G HNCA Φ anyways since C ( ˆΦ) = ∅ . For more complex parameterizations(including softmax) this will not hold exactly. However, speaking roughly, since the policy gradientsstill need to sum to one over all φ , the gradients with respect to different actions will tend to benegatively correlated, and thus ˆ G HNCA Φ will tend to be lower variance. F Details on Computationally Efﬁcient Implementation

The computational complexity of HNCA depends on how difﬁcult it is to compute P ( C (Φ) | B ( C (Φ)) \ Φ , Φ = φ ) and P ( C (Φ) | B ( C (Φ)) \ Φ , B (Φ)) in Equation 5. When the in-dividual neurons use a sufﬁciently simple parameterization, the method can be implemented as abackpropagation-like message passing procedure. In this case, the overall computation is proportionalto the number of connections, as in backpropagation itself.In our experiments, we will apply HNCA to solve a classiﬁcation task formulated as a contextualbandit (the agent must guess the class from the input and receives a reward of 1 only if the guess iscorrect, otherwise it does not receive the true class).8 lgorithm 1: HNCA algorithm forBernoulli hidden neuron Receive (cid:126)x from parents l = (cid:126)θ · (cid:126)x + b p = σ ( l ) φ ∼ Bernoulli ( p ) Pass φ to children Receive (cid:126)q , (cid:126)q , R from children q = (cid:81) i (cid:126)q [ i ] q = (cid:81) i (cid:126)q [ i ] ¯ q = pq + (1 − p ) q o (cid:126)l = l + (cid:126)θ (cid:12) (1 − (cid:126)x ) (cid:126)l = l − (cid:126)θ (cid:12) (cid:126)x (cid:126)p = σ ( (cid:126)l ) (cid:126)p = σ ( (cid:126)l ) Pass (cid:126)p , (cid:126)p , R to parents (cid:126)θ = (cid:126)θ + ασ (cid:48) ( l ) (cid:126)x (cid:16) q − q ¯ q (cid:17) R b = b + ασ (cid:48) ( l ) (cid:16) q − q ¯ q (cid:17) R Algorithm 2:

HNCA algorithm for Soft-max output neuron Receive (cid:126)x from parents (cid:126)l = Θ (cid:126)x + (cid:126)b (cid:126)p = exp (cid:126)l (cid:80) i exp (cid:126)l [ i ] Output φ ∼ (cid:126)p Receive R from environment for all i do L [ i ] = (cid:126)l + Θ[ i ] (cid:12) (1 − (cid:126)x ) L [ i ] = (cid:126)l − Θ[ i ] (cid:12) (cid:126)x end (cid:126)p = exp L [ φ ] (cid:80) i exp L [ i ] (cid:126)p = exp L [ φ ] (cid:80) i exp L [ i ] Pass (cid:126)p , (cid:126)p , R to parents for all i do Θ[ i ] = Θ[ i ] + α(cid:126)x ( ( φ = i ) − (cid:126)p [ i ]) R b [ i ] = b [ i ] + α ( ( φ = i ) − (cid:126)p [ i ]) R end Our stochastic neural network model will consist of a number of hidden layers, wherein each neuronoutputs according to a Bernoulli distribution. The policy of each Bernoulli neuron is parametrizedas a linear function of it’s inputs, followed by a sigmoid activation. The policy of the output layeris a distribution over class labels, parameterized as a softmax. We now separately highlight theimplementations for the softmax output and Bernoulli hidden nodes.Algorithm 1 shows the implementation of HNCA for the Bernoulli hidden nodes. The pseudo-code provided is for Bernoulli nodes with a zero-one mapping, but is straightforward to modifyto use negative one and one instead, as we do in our main experiments. Lines 1-5 implementthe forward pass. The forward pass simply takes input from the parents, uses it to compute theﬁre probability p and samples φ ∈ { , } . The backward pass receives two vectors of proba-bilities (cid:126)q and (cid:126)q , each with one element for each child of the node. A given element repre-sents (cid:126)q / [ i ] = P ( C i | B ( C i ) \ Φ , Φ = 0 / for a given child C i ∈ C (Φ) . Lines 7 and 8 takethe product of all these child probabilities to compute P ( C (Φ) | B ( C ) \ Φ , Φ = 0 / . Note thatcomputing P ( C (Φ) | B ( C ) \ Φ , Φ = 0 / in this way is made possible by Assumption 1. Due toAssumption 1, no C i ∈ C (Φ) can inﬂuence another C j ∈ C (Φ) via a downstream path. Thus P ( C (Φ) | B ( C ) \ Φ , Φ = 0 /

1) = (cid:81) i P ( C i | B ( C i ) \ Φ , Φ = 0 / . Line 9 uses (cid:126)q / [ i ] along with theﬁre probability to compute ¯ q = P ( C (Φ) | B ( C (Φ)) \ Φ , B (Φ)) .Line 10-13 use the already computed logit l to efﬁciently compute a vector of probabilities (cid:126)p and (cid:126)p where each element corresponds to a counterfactual ﬁre probability if all else was the same but agiven parent’s value was ﬁxed to 1 or 0. Here, (cid:12) represents the element-wise product. Note, thatcomputing this in this way only requires compute time on the order of the number of parents (whilenaively computing each counterfactual probability from scratch would require time on the order ofthe number of children squared). Line 14 passes the nessesary information to the node’s children.Lines 15 and 16 ﬁnally update the parameter using ˆ G HNCA Φ with learning-rate hyperparameter α .Algorithm 2 shows the implementation for the softmax output node. Note that the output nodeitself uses the REINFORCE estimator in its update, as it has no children. Nonetheless, the outputnode still needs to provide information to its parents, which use HNCA. Lines 1-4 implement theforward pass, in this case producing an integer φ drawn from the possible classes. Lines 6-11compute counterfactual probabilities of the given output class conditional on ﬁxing the value ofeach parent. Note that Θ[ i ] refers to the i th row of the matrix Θ . In this case, computing thesecounterfactual probabilities requires computation on the order of the number of parents, times thenumber of possible classes. Line 12 passes the necessary information back to the parents. Lines13-16 update the parameters according to ˆ G RE Φ . 9verall, the output node requires computation proportional to the number of parents times the numberof classes, this is the same the number of parameters in that node. Similarly, the hidden nodes eachrequire computation proportional to it’s number of parents, which again is the same as the numberof parameters. Thus the overall computation required for HNCA is on the order of the number ofparameters, the same order as doing a forward pass, and the same order as backpropagation.It’s worth noting that, in principle, HNCA is also more parallelizable than backpropagation. Sinceno information from a node’s children is needed to compute the information passed to its parents,the backward pass could be done in parallel across all layers. However, this can also be seen as alimitation of the proposed approach since the ability to condition on information further upstreamcould lead to further variance reduction. G Experiments with Zero-One Output Mapping

In the main body of the paper our experiments mapping the binary output of Bernoulli neuronsto one or negative one. We found that this worked much better than mapping to zero or one. InFigure 3, we show the results with of REINFORCE and HNCA with a zero-one mapping. We alsouse a sigmoid activation instead of tanh backpropagation baseline, as this offers a closer analog tothe stochastic zero-one mapping. In each case, the use of a zero-one mapping signiﬁcantly hurtperformance. Replacing tanh with sigmoid or the backpropagation baseline makes the differencebetween signiﬁcantly outperforming ReLU and barely learning in the 2 and 3 hidden layers cases. (a) Learning-rate sensitivitycurves for 1 hidden layer. (b) Learning-rate sensitivitycurves for 2 hidden layers. (c) Learning-rate sensitivitycurves for 3 hidden layers.(d) Learning curves for bestlearning-rate for 1 hidden layer. (e) Learning curves for bestlearning-rate for 2 hidden layers. (f) Learning curves for bestlearning-rate for 3 hidden layers.(a) Learning-rate sensitivitycurves for 1 hidden layer. (b) Learning-rate sensitivitycurves for 2 hidden layers. (c) Learning-rate sensitivitycurves for 3 hidden layers.(d) Learning curves for bestlearning-rate for 1 hidden layer. (e) Learning curves for bestlearning-rate for 2 hidden layers. (f) Learning curves for bestlearning-rate for 3 hidden layers.