Some Theoretical Properties of a Network of Discretely Firing Neurons
aa r X i v : . [ c s . N E ] M a y Some Theoretical Properties of a Network ofDiscretely Firing Neurons ∗ Stephen LuttrellSeptember 24, 2018
Abstract:
The problem of optimising a network of discretely firing neuronsis addressed. An objective function is introduced which measures the averagenumber of bits that are needed for the network to encode its state. When this isminimised, it is shown that this leads to a number of results, such as topographicmappings, piecewise linear dependence on the input of the probability of aneuron firing, and factorial encoder networks.
In this paper the problem of optimising the firing characteristics of a network ofdiscretely firing neurons will be considered. The approach adopted will not bebased on any particular model of how real neurons operate, but will focus on the-oretically analysing some of the information processing capabilities of a layerednetwork of units (which happen to be called neurons). Ideal network behaviouris derived by choosing the ideal neural properties that minimise an informationtheoretic objective function which specifies the number of bits required by thenetwork to encode the state of its layers. This is done in preference to assuminga highly specific neural behaviour at the outset, followed by optimisation of afew remaining parameters such as weight and bias values.Why use an objective function in the first place? An objective function is avery convenient starting point (a set of “axioms”, as it were), from which every-thing else can, in principle, be derived (as “theorems”, as it were). An objectivefunction has the same status as a model, which may be falsified should somecounterevidence be discovered. The objective function used in this paper is thesimplest that is consistent with predicting a number of non-trivial results, suchas topographic mappings, and factorial encoders (which are discussed in thispaper). However, it does not include any temporal information, nor any biolog-ical plausibility constraints (other than the fact that the network is assumed tobe layered). More complicated objective functions will be the subject of futurepublications.In section 2 an objective function is introduced, and its connection withdiscretely firing neural networks is derived. In section 3 some examples arepresented which show how this theory of discretely firing neural networks leadsto some non-trivial results. ∗ This paper was submitted to Special Issue of Neurocomputing on Theoretical Analysis ofReal-Valued Function Classes on 19 January 1998. It was not accepted for publication, butit underpins several subsequently published papers. Theory
In this section a theory of discretely firing neural networks is developed. Section2.1 introduces the objective function for optimising an encoder, and section 2.2shows how this can be applied to the problem of optimising a discretely firingneural network.
The inspiration for the approach that is used here is the minimum descriptionlength (MDL) method [5]. In this paper, a training set vector (which is un-labelled) will be denoted as x , a vector of statistics which are stochasticallyderived from x will be denoted as y , and their joint probability density function(PDF) will be denoted as Pr( x , y ) . The problem is to learn the functional formof Pr( x , y ) , so that vectors ( x , y ) sampled from Pr( x , y ) can be encoded usingthe minimum number of bits on average. It is unconventional to consider theproblem of encoding ( x , y ) , rather than x alone, but it turns out that this leadsto many useful results.Thus Pr( x , y ) is approximated by a learnt model Q ( x , y ) , in which case theaverage number of bits required to encode an ( x , y ) sampled from the PDF Pr( x , y ) is given by the objective function D , which is defined as D ≡ − ˆ d x X y Pr( x , y ) log Q ( x , y ) (1)Now split D into two contributions by using Pr( x , y ) = Pr( x ) Pr( y | x ) and Q ( x , y ) = Q ( x ) Q ( y | x ) . D = − ˆ d x Pr( x ) X y Pr( y | x ) log Q ( x | y ) − X y Pr( y ) log Q ( y ) (2)The first term is the cost (i.e. the average number of bits), averaged over allpossible values of y , of encoding an x sampled from Pr( x | y ) using the model Q ( x | y ) . This interpretation uses that Pr( x ) Pr( y | x ) = Pr( y ) Pr( x | y ) . Thesecond term is the cost of encoding a y sampled from Pr( y ) using the model Q ( y ) . Together these two terms correspond to encoding y (the second term),then encoding x given that y is known.The model Q ( x , y ) may be optimised so that it minimises D , and thusleads to the minimum cost of encoding ( x , y ) sampled from Pr( x , y ) . Ideally Q ( x , y ) = Pr( x , y ) , but in practice this is not possible because insufficient in-formation is available to determine Pr( x , y ) exactly (i.e. the training set doesnot contain an infinite number of ( x , y ) vectors). It is therefore necessary tointroduce a parametric model Q ( x , y ) , and to choose the values of the parame-ters so that D is minimised. If the number of parameters is small enough, andthe training set is large enough, then the parameter values can be accuratelydetermined.A further simplification may be made if y can occupy much fewer states than x (given y ) can, because then the cost of encoding y is much less than the cost ofencoding x (given y ) (i.e. the second and first terms in equation 2, respectively).In this case, it is a good approximation to retain only the first term in equation2. This approximation becomes exact if Q ( y ) assigns equal probability to allstates y , because then the third term is a constant. The reason for definingthe objective function D as in equation 1, rather than defining it to be the firstterm of equation 2, is because equation 1 may be readily generalised to morecomplex systems, such as ( x , y , z ) in which Pr( x , y , z ) = Pr( x ) Pr( y | x ) Pr( z | y ) ,and so on. An example of this is given in section 3.1.It is possible to relate the minimisation of D to the maximisation of the mu-tual information I between x and y . If the cost of encoding an x sampled from Pr( x ) using the model Q ( x ) (i.e. − ´ d x Pr( x ) log Q ( x ) ) and the cost of encod-ing a y sampled from Pr( y ) using the model Q ( y ) (i.e. − P y Pr( y ) log Q ( y ) ) areboth subtracted from D , then the result is − ´ d x P y Pr( x , y ) log (cid:16) Q ( x , y ) Q ( x ) Q ( y ) (cid:17) .When Q ( x , y ) −→ Pr( x , y ) this reduces to (minus) the mutual information I between x and y . Thus, if the cost of encoding the correlations between x and y is much greater than the cost of separately encoding x and y (i.e. the log ( Q ( x ) Q ( y )) term can be ignored in I ), then D -minimisation approximates I -maximisation, which is another commonly used objective function. In order to apply the above coding theory results to a 2-layer discretely firingneural network, it is necessary to interpret x as a pattern of activity in theinput layer, and y as the vector of locations in the output layer of a finitenumber of firing events. The objective function D is then the cost of using themodel Q ( x , y ) of the network behaviour to encode the state ( x , y ) of the neuralnetwork (i.e. the input pattern and the location of the firing events), whichis sampled from the Pr( x , y ) that describes the true network behaviour. Forinstance, a second neural network can be used solely for computing the model Q ( x , y ) , which is then used to encode the state ( x , y ) of the above first neuralnetwork. Note that no temporal information is included in this analysis, so theinput and output of the network is a static ( x , y ) vector containing no timevariables.These two neural networks can be combined into a single hybrid network, inwhich the machinery for computing the model Q ( x , y ) is interleaved with theneural network, whose true behavior is described by Pr( x , y ) . The notation ofequation 2 can now be expressed in more neural terms, where Pr( y | x ) is thena recognition model (i.e. bottom-up) and Q ( x | y ) is then a generative model(i.e. top-down), both of which live inside the same neural network. This isan unsupervised neural network, because it is trained with examples of only x -vectors, and the network uses its Pr( y | x ) to stochastically generate a y fromeach x .Now introduce a Gaussian parametric model Q ( x | y ) Q ( x | y ) = 1 (cid:0) √ πσ (cid:1) dim x exp − k x − x ′ ( y ) k σ ! (3)where x ′ ( y ) is the centroid of the Gaussian (given y ), σ is the standard deviationof the Gaussian. Also define a soft vector quantiser (VQ) objective function D VQ as D VQ ≡ ˆ d x Pr ( x ) X y Pr( y | x ) k x − x ′ ( y ) k (4)3hich is (twice) the average Euclidean reconstruction error that results when x is probabilistically encoded as y and then deterministically reconstructed as x ′ ( y ) . These definitions of Q ( x | y ) and D VQ allow D to be written as D = 14 σ D VQ − log (cid:16) √ πσ (cid:17) dim x − X y Pr( y ) log Q ( y ) (5)where the second term is constant, and the third term may be ignored if y canoccupy much fewer states than x (given y ) can. The conditions under whichthe third term can be ignored are satisfed in a neural network, because x is anactivity pattern, and y as the vector of locations of a finite number of firingevents.The first term of D is proportional to D VQ , whose properties may be in-vestigated using the techniques in [3]. Assume that there are n firing events,so that y = ( y , y , · · · , y n ) , then the marginal probabilities of the symmetricpart S [Pr( y | x )] of Pr( y | x ) under interchange of its ( y , y , · · · , y n ) argumentsare given by Pr( y | x ) ≡ M X y , ··· ,y n =1 S [Pr( y , y , · · · , y n | x )]Pr( y , y | x ) ≡ M X y , ··· ,y n =1 S [Pr( y , y , · · · , y n | x )] (6)where Pr( y | x ) may be interpreted as the probability that the next firing eventoccurs on neuron y (given x ), Also define 2 useful integrals, D and D , as D ≡ n ˆ d x Pr( x ) M X y =1 Pr( y | x ) k x − x ′ ( y ) k D ≡ n − n ˆ d x Pr( x ) M X y ,y =1 Pr( y , y | x ) ( x − x ′ ( y )) · ( x − x ′ ( y )) (7)where x ′ ( y ) is any vector function of y (i.e. not necessarily related to x ′ ( y ) ), toyield the following upper bound on D VQ D VQ ≤ D + D (8)where D is non-negative but D can have either sign, and the inequality reducesto an equality in the case n = 1 . Thus far nothing specific has been assumedabout Pr( y | x ) , other than the fact that it contains no temporal information, sothe upper bound on D VQ applies whatever the form of Pr( y | x ) .If the firing events occur independently of each other (given x ), then Pr( y , y | x ) =Pr( y | x ) Pr( y | x ) , which allows D to be redefined as D ≡ n − n ˆ d x Pr( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − M X y =1 Pr( y | x ) x ′ ( y ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (9)where D is non-negative. 4n summary, the assumptions which have been made in order to obtain theupper bound on D VQ in equation 8 with the definition of D as given in equation7 and D as given in equation 9 are: no temporal information is included in thenetwork state vector ( x , y ) , y can occupy much fewer states than x (given y ) can,and firing events occur independently of each other (given x ). In reality, thereis always temporal information available, and the firing events are correlatedwith each other, so a more realistic objective function could be constructed.However, it is worthwhile to consider the consequences of equation 8, becauseit turns out that it leads to many non-trivial results.The upper bound on D VQ may be minimised with respect to all free param-eters in order to obtain a least upper bound. In the case of independent firingevents, the free parameters are the x ′ ( y ) and the Pr( y | x ) . These two types ofparameters cannot be independently optimised, because they correspond to thegenerative and recognition models implicit in the neural network, respectively.A gradient descent algorithm for optimising the parameter values may read-ily be obtained by differentiating D and D with respect to x ′ ( y ) and Pr( y | x ) .Given the freedom to explore the entire space of functions Pr( y | x ) , the optimumneural firing behaviour (given x ) can in principle be determined, and in certainsimple cases this can be determined by inspection. If this option is not avail-able, such as would be the case if biological contraints restricted the allowedfunctional form of Pr( y | x ) , then a limited search of the entire space of func-tions Pr( y | x ) can be made by invoking parametric model of the neural firingbehaviour (given x ). In this section several examples are presented which illustrate the use of D + D in the optimisation of discretely firing neural networks. In section 3.1 atopographic mapping network is derived from D alone, in section 3.2 Pr( y | x ) that minimises D + D is shown to be piecewise linear, and a solved example ispresented. Finally, in section 3.3 a more detailed worked example is presented,which demonstrates how a factorial encoder emerges when D + D is minimised. When an appropriate from of V VQ is considered, it can be seen that it leads to anetwork that is closely related to Kohonen’s topographic mapping network [1].The derivation of a topographic mapping network that was given in [2] willnow be recast in the framework of section 2.2. Thus, consider the objectivefunction for a 3-layer network ( x , y, z ) , in which (compare equation 1) D = − ˆ X z d x Pr( x , z ) log Q ( x , z ) (10)where the cost of encoding y has been ignored, so that effectively only a 2-layernetwork ( x , z ) is visible, and D VQ is given by D VQ = 2 ˆ d x Pr( x ) M z X z =1 Pr( z | x ) k x − x ′ ( z ) k (11)5his expression for D VQ explicitly involves ( x , z ) , but it may be manipulatedinto a form that explicitly involves ( x , y ) . In order to make simplify this calcu-lation, D VQ will be replaced by the equivalent objective function D VQ = ˆ d x Pr( x ) M z X z =1 Pr( z | x ) ˆ d x ′ Pr( x ′ | z ) k x − x ′ k (12)Now introduce dummy integrations over y to obtain D VQ = ˆ d x Pr( x ) M y X y =1 Pr( y | x ) M z X z =1 Pr( z | y ) M y X y ′ =1 Pr( y ′ | z ) ˆ d x ′ Pr( x ′ | y ′ ) k x − x ′ k (13)and rearrange to obtain D VQ = ˆ d x Pr( x ) M y X y ′ =1 Pr( y ′ | x ) ˆ d x ′ Pr( x ′ | y ′ ) k x − x ′ k (14)where Pr( y ′ | y ) = M z X z =1 Pr( y ′ | z ) Pr( z | y )Pr( y ′ | x ) = M y X y =1 Pr( y ′ | y ) Pr( y | x ) (15)which may be replaced by the equivalent objective function D VQ = 2 ˆ d x Pr( x ) M y X y ′ =1 Pr( y ′ | x ) k x − x ′ ( y ′ ) k (16)By manipulating D VQ from the form it has in equation 11 to the form it has inequation 16, it becomes clear that optimisation of the ( x , z ) network involvesoptimisation of the ( x , y ′ ) subnetwork, for which an objective function can bewritten that uses a Pr( y ′ | x ) as defined in equation 15. When optimising the ( x , y ′ ) subnetwork, Pr( y ′ | y ) takes account of the effect that z has on y .If n = 1 , so that only 1 firing event is observed, then D VQ = D , and theoptimum Pr( y | x ) must ensure that y depends deterministically on x , so that Pr( y | x ) = δ y,y ( x ) where y ( x ) is an encoding function that converts x into theindex of the neuron that fires in response to x . This allows D VQ to be simplifiedto D VQ = 2 ˆ d x Pr( x ) M y X y ′ =1 Pr( y ′ | y ( x )) k x − x ′ ( y ′ ) k (17)where Pr( y ′ | y ( x )) is Pr( y ′ | y ) with y replaced by y ( x ) . Note that if Pr( y ′ | y ) = δ y,y ′ then D VQ reduces to the objective function ´ d x Pr( x ) k x − x ′ ( y ( x )) k for a standard vector quantiser (VQ).The optimum y ( x ) is given by y ( x ) = arg min y P M y y ′ =1 Pr( y ′ | y ) k x − x ′ ( y ′ ) k (which is not quite the same as the y ( x ) = arg min y k x − x ′ ( y ) k used by Ko-honen in his topographic mapping neural network [1]), and a gradient descent6lgorithm for updating x ′ ( y ′ ) is x ′ ( y ′ ) −→ x ′ ( y ′ ) + ε Pr( y ′ | y ( x )) (which is iden-tical to Kohonen’s prescription [1]). The Pr( y ′ | y ) may thus be interpreted asthe neighbourhood function, and the x ′ ( y ′ ) may be interpreted as the weightvectors, of a topographic mapping. Because all states y that can give rise tothe same state z (as specified by Pr( z | y ) ) become neighbours (as specified by Pr( y ′ | y ) in equation 15), Pr( y ′ | y ) includes a much larger class of neighbourhoodfunctions than has hitherto been used in topographic mapping neural networks.Because of the principled way in which the topographic mapping objectivefunction has been derived here, it is the preferred way to optimise topographicmapping networks. It also allows the objective function to be generalised to thecase n > , where more than one firing event is observed. The optimal
Pr( y | x ) has some interesting properties that can be obtained byinspecting its stationarity condition. For instance, the Pr( y | x ) that minimise D + D will be shown to be piecewise linear functions of x .Thus, functionally differentiate D + D with respect to log Pr( y | x ) , wherelogarithmic differentation implicitly imposes the constraint Pr( y | x ) ≥ , anduse a Lagrange multiplier term L ≡ ´ d x ′ λ ( x ′ ) P My ′ =1 Pr( y ′ | x ′ ) to impose thenormalisation constraint P My =1 Pr( y | x ) = 1 for each x , to obtain δ ( D + D − L ) δ log Pr ( y | x ) = 2 n Pr ( x ) Pr( y | x ) k x − x ′ ( y ) k − n − n Pr( x ) Pr( y | x ) x ′ ( y ) · x − M X y =1 Pr( y | x ) x ′ ( y ) ! − λ ( x ) Pr( y | x ) (18)The stationarity condition implies that P My =1 Pr( y | x ) δ ( D + D − L ) δ Pr( y | x ) = 0 , whichmay be used to determine the Lagrange multiplier function λ ( x ) . When λ ( x ) issubstituted back into the stationarity condition itself, it yields x ) Pr( y | x ) M X y ′ =1 (Pr( y ′ | x ) − δ y,y ′ ) × x ′ ( y ′ ) · x ′ ( y ′ )2 − n x +( n − M X y ′′ =1 Pr( y ′′ | x ) x ′ ( y ′′ ) (19)There are several classes of solution to this stationarity condition, correspondingto one (or more) of the three factors in equation 19 being zero.1. Pr( x ) = 0 (the first factor is zero). If the input PDF is zero at x , thennothing can be deduced about Pr( y | x ) , because there is no training datato explore the network’s behaviour at this point.2. Pr( y | x ) = 0 (the second factor is zero). This factor arises from the dif-ferentiation with respect to log Pr( y | x ) , and it ensures that Pr( y | x ) < cannot be attained. The singularity in log Pr( y | x ) when Pr( y | x ) = 0 iswhat causes this solution to emerge.7. P My ′ =1 (Pr( y ′ | x ) − δ y,y ′ ) x ′ ( y ′ ) · ( · · · ) = 0 (the third factor is zero). Thesolution to this equation is a Pr( y | x ) that has a piecewise linear depen-dence on x . This result can be seen to be intuitively reasonable because D + D is of the form ´ d x Pr( x ) f ( x ) , where f ( x ) is a linear combinationof terms of the form x i Pr( y | x ) j (for i = 0 , , and j = 0 , , ), which is aquadratic form in x (ignoring the x -dependence of Pr( y | x ) ). However, theterms that appear in this linear combination are such that a Pr( y | x ) thatis a piecewise linear function of x guarantees that f ( x ) is a piecewise linearcombination of terms of the form x i (for i = 0 , , ), which is a quadraticform in x (the normalisation constraint P My =1 Pr( y | x ) = 1 is used to re-move a contribution to that is potentially quartic in x ). Thus a piecewiselinear dependence of Pr( y | x ) on x does not lead to any dependencies on x that are not already explicitly present in D + D . The stationarity condi-tion on Pr( y | x ) (see equation 19) then imposes conditions on the allowedpiecewise linearities that Pr( y | x ) can have.For the purpose of doing analytic calculations, it is much easier to obtain ana-lytic results with the ideal piecewise linear Pr( y | x ) than with some other func-tional form. If the optimisation of Pr( y | x ) is constrained, by introducing aparametric form which has some biological plausibility, for instance, then ana-lytic optimum solutions are not in general possible to calculate, and it becomesnecessary to resort to numerical simulations. Piecewise linear Pr( y | x ) shouldtherefore be regarded as a convenient theoretical laboratory for investigatingthe properties of idealised neural networks. A simple example illustrates how the piecewise linearity property of
Pr( y | x ) may be used to find optimal solutions. Thus consider a 1-dimensional inputcoordinate x ∈ [ −∞ , + ∞ ] , with Pr( x ) = P . Assume that the number ofneurons M tends to infinity in such a way that there is 1 neuron per unit lengthof x , so that Pr( y | x ) = p ( y − x ) , where the piecewise linear property gives p ( x ) as p ( x ) = | x | ≤ − s s − | x | +14 s − s ≤ | x | ≤ + s | x | ≥ + s (20)and by symmetry x ′ ( y ) = y .This Pr( y | x ) and x ′ ( y ) allow D to be derived as D ( per neuron ) = 2 P n ˆ − s − + s dx x + 2 ˆ + s − s dx s − x + 14 s x ! = P n (cid:0) s (cid:1) (21)and D to be derived as 8 ( per unit length ) = 2 ( n − P n ´ − s − + s x dx +2 ´ − s (cid:16) x − s +2( x − s (cid:17) dx = ( n − P n (2 s − (22)Because there is one neuron per unit length, the contribution per unit length to D + D is the sum of the above two results D + D ( per unit length ) = P n (cid:16) n (2 s − + 4 s (cid:17) (23)If D + D is differentiated with respect to s , then stationarity condition d ( D + D ) ds =0 yields the optimum value of s as s = n − n (24)and the stationary value of D + D as D + D ( per unit length ) = (2 n − P n (25)When n = 1 the stationary solution reduces to s = 0 and D + D (per unitlength) = P , which is a standard vector quantiser with nonoverlapping neuralresponse regions which partition the input space into unit width quantisationcells, so that for all x there is exactly one neuron that responds. Althoughthe neurons have been manually arranged in topographic order by imposing Pr( y | x ) = p ( y − x ) , any permutation of the neuron indices in this stationarysolution will also be stationary solution. This derivation could be generalisedto the type of 3-layer network that was considered in section 3.1 , in which casea neighbourhod function Pr( y ′ | y ) would emerge automatically.As n −→ ∞ the stationary solution behaves as s −→ and D + D (perunit length) −→ P n , with overlapping linear neural response regions which coverthe input space, so that for all x there are exactly two neurons that respondwith equal and opposite linear dependence on x . As n −→ ∞ the ratio of thenumber of firing events that occur on these two neurons is sufficient to determine x to O (cid:0) n (cid:1) . When n = ∞ this stationary solution is s = and D + D (perunit length) = 0 . However, when n = ∞ there are infinitely many other waysin which the neurons could be used to yield D + D (per unit length) = 0 ,because only the D term contributes, and it is 0 when x = P My =1 Pr( y | x ) x ′ ( y ) .This is possible for any set of basis elements x ′ ( y ) that span the input space,provided that the expansion coefficients Pr( y | x ) satisfy Pr( y | x ) ≥ . In this 1-dimensional example only two basis elements are required (i.e. M = 2 ), whichare x ′ (1) = −∞ and x ′ (2) = + ∞ . More generally, for this type of stationarysolution, M = dim x + 1 is required to span the input space in such a way that Pr( y | x ) ≥ , and if M < dim x + 1 then the stationary solution will span theinput subspace (of dimension M − ) that has the largest variance.The n = 1 and n −→ ∞ limiting cases are very different. When n = 1 theoptimum network splits up the input space into non-overlapping quantisation9ells, and as n −→ ∞ the optimum network does a linear decomposition of theinput space using non-negative expansion coefficients. This behaviour occursbecause for n > the neurons can cooperate when encoding the input x , so thatby allowing more than one neuron to fire in response to x , the encoded versionof x is distributed over more than one neuron. In the above 1-dimensionalexample, the code is spread over one or two neurons depending on the value of x . This cooperation amongst neurons is a property of the coherent part D ofthe upper bound on D VQ (see equation 8). For certain types of distribution of data in input space the optimal networkconsists of a number of subnetworks, each of which responds to only a subspaceof the input space. This is called factorial encoding, where the encoded input isdistributed over more than one neuron, and this distributed code typically hasa much richer structure than was encountered in section 3.2.The simplest problem that demonstrates factorial encoding will now be in-vestigated (this example was presented in [4], but the derivation given here ismore direct). Thus, assume that the data in input space uniformly populatesthe surface of a 2-torus S × S . Each of the S is a plane unit circle embeddedin R × R and centred on the origin, and S × S is the Cartesian product ofa pair of such circles. Overall, the 2-torus lives in a 4-dimensional input spacewhose elements are denoted as x = ( x , x , x , x ) , where one of the circles livesin ( x , x ) and the other lives in ( x , x ) . These circles may be parameterisedby angular degrees of freedom θ and θ , respectively.The optimal Pr( y | x ) (i.e. a piecewise linear stationary solution of the typethat was encountered in section 3.2 could be derived from this input data PDF Pr( x ) . However, the properties of the sought-after optimal Pr( y | x ) are preservedif one restricts the solution space to the following types of Pr( y | x )Pr( y | x ) = δ y,y ( θ ) or δ y,y ( θ ) type 1 δ y,y ( θ ,θ ) type 2 (cid:0) δ y,y ( θ ) + δ y,y ( θ ) (cid:1) type 3 (26)where y ( θ ) and y ( θ ) encode θ , y ( θ ) and y ( θ ) encode θ , and y ( θ , θ ) encodes ( θ , θ ) . The allowed ranges of the code indices are ≤ y ( θ ) ≤ M (and similarly y ( θ ) ), ≤ y ( θ ) ≤ M , M + 1 ≤ y ( θ ) ≤ M ,and ≤ y ( θ , θ ) ≤ M . The type 1 solution assumes that all M neurons re-spond only to θ (or, alternatively, all respond only to θ ), the type 2 solutionassumes that all M neurons respond to ( θ , θ ) , and the type 3 solution (whichis very simple type of factorial encoder) assumes that M neurons respond onlyto θ , and the other M neurons respond only to θ .In order to derive explicit results for the stationary value of D + D , it isnecessary to optimise the x ′ ( y ) . The stationary condition on x ′ ( y ) may readilybe deduced from the stationarity condition ∂ ( D + D ) ∂ x ′ ( y ) = 0 as n ˆ d x Pr( x | y ) x = x ′ ( y ) + ( n − ˆ d x Pr( x | y ) M X y ′ =1 Pr( y ′ | x ) x ′ ( y ′ ) (27)10f Pr( y | x ) (and hence Pr( x | y ) ) are inserted into this stationarity condition, thenit may be solved for the corresponding x ′ ( y ) .Assume that the encoding functions partition up the 2-torus symmetrically,the three types of solution may be optimised as described in the following threesections. Assume that
Pr( y | x ) = δ y,y ( x ,x ) , and that the y = 1 quantisation cell is theCartesian product of the arcs | θ | ≤ πM and θ ≤ π of the 2 unit circles thatform the 2-torus, then the stationarity condition for x ′ (1) becomes n M π ˆ πM − πM dθ π ˆ π dθ (cos θ , sin θ , cos θ , sin θ )= x ′ (1) + ( n − M π ˆ πM − πM dθ π ˆ π dθ x ′ (1) (28)which yields the solution x ′ (1) = (cid:0) Mπ sin (cid:0) πM (cid:1) , , , (cid:1) . The first two componentsare the centroid of the arc | ϑ | ≤ πM of a unit circle centred on the origin. Allof the x ′ ( y ) can be obtained by rotating x ′ (1) about the origin by multiples of πM . Using the assumed symmetry of the solution, the expression for D + D becomes D + D = Mπ ˆ πM − πM dθ (cid:13)(cid:13)(cid:13)(cid:13) (cos θ , sin θ ) − (cid:18) Mπ sin (cid:16) πM (cid:17) , (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + 1 π ˆ π dθ k (cos θ , sin θ ) − (0 , k (29)where the first (or second) term corresponds to the subspace to which the neu-rons respond (or not respond). This gives the stationary value of D + D as D + D = 4 − M π sin (cid:0) πM (cid:1) . Only one neuron can fire (given x ), because Pr( y | x ) = δ y,y ( θ ) or δ y,y ( θ ) , no further information about x can be obtainedafter the first firing event has occurred, so this result for D + D is independentof n , as expected. Assume that the y = 1 quantisation cell is the Cartesian product of the arcs | ϑ | ≤ π √ M and | ϑ | ≤ π √ M of the two unit circles that form the 2-torus. Thestationarity condition for x ′ (1) can be deduced from the type 1 case with the re-placement M −→ √ M , which gives x ′ (1) = (cid:16) √ Mπ sin (cid:16) π √ M (cid:17) , , √ Mπ sin (cid:16) π √ M (cid:17) , (cid:17) .The expression for D + D may similarly be deduced from the type 1 case astwice the first term in equation 29 with the replacement M −→ √ M , to yieldthe stationary value of D + D as D + D = 4 − Mπ sin (cid:16) π √ M (cid:17) . As in thetype 1 case, this result for D + D is independent of n .11 .3.3 Type 3 Solution The stationarity condition for x ′ (1) can be written by analogy with the type1 case, with the replacement M −→ M , and modifying the last term to takeaccount of the more complicated form of Pr( y | x ) , to yield n M π ˆ πM − πM dθ π ˆ π dθ (cos θ , sin θ , cos θ , sin θ )= x ′ (1) + 12 ( n − M π ˆ πM − πM dθ π ˆ π dθ x ′ (1) (30)where π ´ π dθ P My ′ = M +1 δ y ′ ,y ( θ ) x ′ ( y ′ ) = 0 has been used (this followsfrom the assumed symmetry of the solution). This yields the solution x ′ (1) = nn +1 (cid:0) M π sin (cid:0) πM (cid:1) , , , (cid:1) . Using the assumed symmetry of the solution, theexpression for D becomes D = 2 n M π ˆ πM − πM dθ (cid:13)(cid:13)(cid:13)(cid:13) (cos θ , sin θ ) − nn + 1 (cid:18) M π sin (cid:18) πM (cid:19) , (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) + 12 π ˆ π dθ k (cos θ , sin θ ) k ! (31)and the expression for D becomes D = 2( n − n M π ˆ πM − πM dθ (cid:13)(cid:13)(cid:13)(cid:13) (cos θ , sin θ ) − nn + 1 (cid:18) M π sin (cid:18) πM (cid:19) , (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ! (32)This gives the stationary value of D + D as D + D = 4 − nM ( n +1) π sin (cid:0) πM (cid:1) .Because Pr( y | x ) = (cid:0) δ y,y ( θ ) + δ y,y ( θ ) (cid:1) , one firing event has to occur ineach of the intervals ≤ y ≤ M and M + 1 ≤ y ≤ M for all of the informationto be collected about x . However, the random nature of the firing events meansthat the probability with which this condition is satisfied increases with n , sothis result for D + D decreases as n increases. Collect the above results together for comparison. D + D = − M π sin (cid:0) πM (cid:1) type 1 − Mπ sin (cid:16) π √ M (cid:17) type 2 − nM ( n +1) π sin (cid:0) πM (cid:1) type 3 (33)For constant M and letting n −→ ∞ , the value of D + D for the type 3solution asymptotically behaves as D + D ( type 3 ) −→ − M π sin (cid:0) πM (cid:1) , inwhich case the relative stability of the three types of solution is: type 3 (most12table), type 2 (intermediate), type 1 (least stable). Similarly, for constant n and letting M −→ ∞ , the relative stability of the three types of solution is:type 2 (most stable), type 3 (intermediate), type 1 (least stable).In both of these limiting cases the type 1 solution is least stable. If thereis a fixed number of firing events n , and there is no upper limit on the numberof neurons M , then the type 2 solution is most stable, because it can partitionthe 2-torus into lots of small quantisation cells. If there is a fixed number ofneurons M (which is the usual case), and there is no upper limit on the numberof firing events n , then the type 3 solution is most stable, because the limitedsize of M renders the type 2 solution inefficient (the quantisation cells would betoo large), so the 2-torus S × S is split into two S subspaces each of whichis assigned a subset of M neurons. If n is large enough, then each of these twosubsets of neurons has a high probability of occurrence of a firing event, whichensures that both of the S subspaces are encoded.More generally, when there is a limited number of neurons they will tend tosplit into subsets, each of which encodes a separate subspace of the input. Theassumed form of Pr( y | x ) in equation 26 does not allow an unrestricted searchof all possible Pr( y | x ) . If the global optimum solution (which has piecewiselinear Pr( y | x ) , as proved in section 3.2) cuts up the input space into partiallyoverlapping pieces, then it is well approximated by a solution such as one ofthose listed in equation 26. Typically, curved input spaces lead to such solutions,because a piecewise linear Pr( y | x ) can readily quantise such spaces by slicingoff the curved “corners” ’ that occur in such spaces. In this paper an objective function for optimising a layered network of discretelyfiring neurons has been presented, and three non-trivial examples of how it isapplied have been shown: topographic mapping networks, piecewise linear de-pendence on the input of the probability of a neuron firing, and factorial encodernetworks. Many other examples could be given, such as combining the first andthird of the above results to obtain factorial topographic networks, or extendingthe theory to multilayer networks, or introducing temporal information.
I thank Chris Webber for many useful conversations that we had during thecourse of this research.
References [1] T Kohonen.
Self-organisation and associative memory . Springer-Verlag,Berlin, 1989.[2] S P Luttrell. Hierarchical self-organising networks. In
Proceedings of IEEConference on Artificial Neural Networks , pages 2–6, London, 1989. IEE.133] S P Luttrell.
Mathematics of neural networks: models, algorithms and appli-cations , chapter A theory of self-organising neural networks, pages 240–244.Kluwer, Boston, 1997.[4] S P Luttrell. Self-organisation of multiple winner-take-all neural networks.
Connection Science , 9(1):11–30, 1997.[5] J Rissanen. Modelling by shortest data description.