[PDF] BHN: A Brain-like Heterogeneous Network

Abstract

The human brain works in an unsupervised way, and more than one brain region is essential for lighting up intelligence. Inspired by this, we propose a brain-like heterogeneous network (BHN), which can cooperatively learn a lot of distributed representations and one global attention representation. By optimizing distributed, self-supervised, and gradient-isolated objective functions in a minimax fashion, our model improves its representations, which are generated from patches of pictures or frames of videos in experiments.

Full PDF

BBHN: A Brain-like Heterogeneous Network

Liu, Tao

Abstract

The human brain works in an unsupervised way, and more than one brain regionis essential for lighting up intelligence. Inspired by this, we propose a brain-likeheterogeneous network (BHN), which can cooperatively learn a lot of distributedrepresentations and one global attention representation. By optimizing distributed,self-supervised, and gradient-isolated objective functions in a minimax fashion, ourmodel improves its representations, which are generated from patches of picturesor frames of videos in experiments.

It is a mystery how different brain regions to be optimized jointly. In this article, we propose abrain-like heterogeneous network(BHN) simulating the multi-module structure of the brain.We use three hypothesises in this article:1. The brain is a machine to maximize information of its inner representations. This hy-pothesis was known as Efﬁcient Coding[Barlow et al., 1961] or Efﬁcient InformationRepresentation[Linsker, 1990, Atick, 1992].2. The brain learns by optimizing certain objective functions, and different brain regionsoptimize different objective functions[Lake et al., 2017].3. The brain works by fusing top-down predictions with bottom-up perceptions. This hypothe-sis actually enables the brain to process information recursively.We view hypothesis 1 as the ﬁrst principle to understand the brain and obtain desired objectivefunctions by formalizing it. The objective is a sum of many objective functions, each applied onan individual module, and all the modules make up BHN. We also seek to understand the brain’sinformation processing scheme, which we name as Recursive Modeling in this article.Following in this article, ﬁrstly, the section 2 will give the objective functions derived from the ﬁrsthypothesis. Next, the section 3 and the section 5 will elaborate on BHN and Recursive Modelingrespectively. And then the section 4 and the section 6 will provide some demonstration experimentsfor the former two sections respectively.

The brain collects information from the environment( x ) and then generates internal representations( z ).It is inferred that an important function of the brain is to maximize the information entropyof its representations. It is generally believed that these representations are distributed on thecerebral cortex, and so it is essential to ensure the independence of information they repre-sent. Previous solutions include sparse-coding[Olshausen and Field, 1996], independent componentanalysis[Hyvärinen and Oja, 2000], and end-to-end deep learning. In this article we propose oursolution as follows. Preprint. Under review. a r X i v : . [ c s . N E ] J un e use { z , z , · · · , z n } to denote the representations distributed on the cerebral cortex, and use H ( z z · · · z n ) to denote the information entropy of them. We then formalize the objective functionas max H ( z z · · · z n ) .Considering H ( z z · · · z n ) = (cid:88) i H ( z i ) + [ H ( z z · · · z n ) − (cid:88) i H ( z i )] (1)the objective function can be roughly decomposed into two sub-objectives[Atick, 1992], as (cid:40) max z H ( z i )min z I ( z i ; z j ) , if i (cid:54) = j (2)Noting that the second sub-objective is intractable because of the Ω( n ) computational complexity,so we introduce a global attention[Graves et al., 2014, Vaswani et al., 2017] representation( a ) into(2) by reforming the expression in a minimax fashion, as (cid:40) max z (cid:80) i H ( z i )min z max a (cid:80) i I ( z i ; a ) (3)Then, by re-composing the two expressions above into a single one, we obtain the objective function: min z max a (cid:88) i [ − H ( z i ) + I ( a ; z i )] (4)We use contrastive losses[Hadsell et al., 2006] to formalize H ( z i ) . Contrastive losses measure thesimilarities of sample pairs in a representation space. A form of a contrastive loss function, which iscalled InfoNCE[Oord et al., 2018], is considered in this article: H ( z i ) ∝ log exp( f ( z i , z i + )) (cid:80) z i − ∈ Z i exp( f ( z i , z i − )) (5)where f is a density ratio, which preserves the mutual information between a positive or negative pairof samples.The next step is to formalize I ( a ; z i ) . To stabilize minimaxing on I ( a ; z i ) , we do not formalize itdirectly. Instead, we use a to produce a probability distribution, i.e. P ( z i ) , as the prediction of z i .The a is called as an attention representation because it is used to generate shared Query/Key vectors a i , each of which is paired with a representation z i , and these Q/K vectors will be used to calculatedeach sample’s probability/weight. The details are as follows:We provide a memory pool having N paired samples, as X i = ( A i , Z i ) = { ( a i , z i ) , ( a i , z i ) , · · · , · · · , ( a iN , z iN ) } (6)where Z i is the sample space of P ( z i ) . The probability P (cid:12)(cid:12) z i = z ij is equal to the attention weight w ij calculated by P (cid:12)(cid:12) z i = z ij = w ij = softmax (cid:0) similarity ( a i , a ij ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) a ij ∈ A i (7)Now we can formalize I ( a ; z i ) as I ( a ; z i ) ∝ log (cid:80) z ij ∈ Z i w ij exp( f ( z i , z ij )) (cid:80) z i − ∈ Z i exp( f ( z i , z i − )) (8)Eventually, by bringing (5) and (8) into (4), we formalize the objective function as min z max a (cid:88) i [ − log exp( f ( z i , z i + )) (cid:80) z ij ∈ Z i w ij exp( f ( z i , z ij )) ] (9)Notable, this objective function suggests a probabilistic inference machines[von Helmholtz, 1925]and it is the corollary of our hypothesises. This is biologically plausible and we can say that theattention representation makes predictions by activating selective replays of representationsin the cerebral cortex . 2 Brain-like Heterogeneous Network

In the section we propose the architecture of BHN to apply the objective function in Equation 9. It hasa cortex-network composed of basic units, with unit i generating the corticocerebral representation z i , and an attention-network generating the global attention representation a . As the name suggests,we use artiﬁcial neural network(ANN) to implement these two components. Different from popularapproach using end-to-end back-propagation with a global loss function, in our model, there isgradient isolation between the units and between the two networks. Cortex-network

In each unit i , there is an encoder g ienc to encode the input x into a latent repre-sentation z i . In the following image task, there only is a g ienc inside the unit. While in the videotasks, in each unit, another network, which is called aggregator g iar , is used to output a unit con-text c i to act as the positive partner of the z i . Actually, we are applying Contrastive PredictiveCoding[Oord et al., 2018] in each unit, as shown in Figure 1.Figure 1: Architecture of the cortex-network in video tasks Attention-network

The attention-network generates the global attention representation a , like themedial temporal lobe in the mammalian brain. Its architecture is like a traditional encoder-decodernetwork, where the encoder generates a and the the decoder generates a i , as shown in Figure 2a. (a) (b) control group 2 Figure 2: Architecture of the attention-network in video tasksThe input of the attention-network is from the output of the cortex-network. In our image task, it isnatural to take all z i as input because they are the only outputs of the cortex-network. In our videotasks, the attention-network takes all c i as input, because we want to get a in advance of z i .The attention representation a should not retain all the information inputted into it, but only needs tocapture the information shared by multiple units. To achieve this goal, one option is to make a act asan information bottleneck, which means that a is lower dimensional than the input vector. The otheroption is to arbitrarily drop out some units’ outputs. In our image task, we adopt the second option,and in our video tasks, we adopt the ﬁrst one. 3 eural Interface Neural interface is not an essential component of BHN, but we want to mentionit here in advance because it is important for the Recursive Modeling. Unlike the cortex-network andthe attention-network, Neural Interfaces have no biological counterparts. Actually, this name comesfrom Brain-Computer Interfaces(BCIs) [Wolpaw et al., 2000]. By processing information from thecortex-network, neural interfaces perform various functions, such as controlling attention, controllingactions, and whatever as you need.

We download ten landscape pictures from the internet and crop them into 8000 patches of × pixels, and then each patch is gray-scaled and normalized. We then design a BHN model to learn onthis dataset. The model has 64 units in its cortex-network. The encoder in each unit contains 128hidden units with leaky-relu activation, and the attention-network contains 256 hidden units, which isalso the dimension of a , with leaky-relu activation. The dimensions of z i and a i are both set to 1.The batch size, which is also the size of X i , is 512.The density ratio f is formalized as f ( z, z ) = − clamp_max_5 ( | z − z | ) (10)The similarity between a i is formalized as similarity ( a i , a ij ) = −| a i − a ij | /τ (11)where τ is the temperature optimized together with the attention-network.The inputs are added with Gaussian white noise with mean = 0 and std = 0 . for image enhance-ment, and also in this way to produce positive sample pairs in constrastive loss function. The dropoutratio is . in the attention-network. We use SGD optimizer with the lr = 0 . , momentum = 0 . ,and weight _ decay = 0 . . The model is light and the training runs fast even in a laptop withoutGPU acceleration.In addition to the normal experiment, we also establish a control experiment where the objective func-tion is to max (cid:80) i H ( z i ) only. After 40 epochs of training, we visualize all 64 units by maximizingtheir outputs. The results are shown in Figure 3. (a) untrained (b) normal (c) control Figure 3: Visualized features of unitsIn order to show clearly, we use red and green to indicate light and shade. As can be seen fromthe ﬁgure, the visualized features are noisy if the model is not trained. In both normal and controlexperiments, the units have intensiﬁed responses to certain image modes after training. The resultimages in the control experiment are fuzzy. In contrast, the visualized features in the normalexperiment are more sharp and diverse.

We build a video set containing 64 episodes recording the play of CarRacing game in OpenAI gym.Each episode lasts for 512 frames and each frame has a size of (96, 96) pixels. The frames are4onverted to gray scale and rescaled to (-1,1). At each time step, 4 consecutive frames with additionalnoises are fed to the input.A linear layer, shared by all g ienc for the consideration of reducing the number of parameters, willﬁrst reduce the dimensions of inputs from × (96 × to . The encoder architecture g ienc contains 32 hidden units with leaky-relu activations. We then use a GRU-RNN [Cho et al., 2014] forthe autoregressive part of the unit, g iar , with 32 dimensional hidden state. The cortex-network has 16units, and the attention-network is a simple unbiased linear network with a hidden layer. Dimensionsof z it , c it , a it and a t are all set to 2. The batch size, which is also the size of X i , is 256.In our experiment, z it +4 and c it are used as the positive pair for the contrastive loss function. Thedelay of is, somewhat arbitrary, to quantify the directional information between z and c .The density ratio f is formalized as f ( z it +4 , c it ) = − cos (cid:104) z it +4 , c it (cid:105) /T (12)where T is the temperature optimized together with the cortex-network.The similarity between a i is formalized as similarity ( a i , a ij ) = − cos (cid:104) a i , a ij (cid:105) /τ (13)where τ is the temperature optimized together with the attention-network.We choose Adam optimizer with lr = 1 e − . We use data enhancement in which each episode isfolded into 16 segments of 256 frames long. We train each model for 20 epochs. However, accordingto our experience, a much longer training would not lead to over-ﬁtting.We use deconvolutional networks[Zeiler et al., 2011] to reconstruct images from representations z t , c t and a t respectively. Mean square errors( mse ) of the reconstructed images will be used to evaluate thequality of source representations. Given that a trivial solution could achieve a loss of 0.0225 if noneinformation was provided, in the following, we use the score, calculated by (0 . − mse ) × ,to indicate the quality.We also establish two control groups to demonstrate the performance of adversarial training. control groups 1 We abandon the attention-network to only perform optimizations on H ( z i ) , justin the same way as the control group established in section 4.2. control groups 2 We design a restricted attention-network architecture by cutting off the links via a between units, as shown in Figure 2b.Table 1 gives the scores of z t , c t and a t before and after training. The scores of the experimentalgroup surpass those of its competitors.Table 1: Scores of representations z t c t a t Before Training . ± .

06 2 . ± .

04 0 . ± . Experimental Group . ± .

04 2 . ± .

07 0 . ± . Control Group 1 . ± .

07 1 . ± . Control Group 2 . ± .

07 1 . ± . Model building, arguably, is the approach to general intelligence[Lake et al., 2017]. Additionally, wethink recursion is essential in the design of strong artiﬁcial intelligence, just as it is for many Turingcomplete machines[Turing, 1936]. So we propose the approach of Recursive Modeling, which meansthat the agent should not only build causal models for the environment, but also recursively buildcausal models on the early-built ones. The environment is where negentropy[Schrodinger, 1944]ﬂows in. 5igure 4: Schematic diagram of Recursive ModelingAs shown in the schematic diagram(Figure 4), the Recursive Modeling approach has two requirements.The ﬁrst requirement is to build a mental space where models run. If we think of a model as a collectionof regularities (or schemas[Piaget, 1929, Bartlett and Bartlett, 1932]), then the mental space is thecollection of all regularities. Regularities are usually obtained from information bottlenecks, likethe linguistic regularities found in the word vector space[Mnih and Kavukcuoglu, 2013], and thedisentangled representations generated by generative models[Bengio et al., 2013, Larsen et al., 2015].Existing low-level representations should be recursively distilled by the information bottleneck.The second requirement of Recursive Modeling is to allow the agent to perceive and intervene themental space, just as it does with the environment in the physical world. Perception and interventionare two necessities to build causal models at any time.Among the models that have been built, the early built models are to simulate the relations betweenreal entities in the environment, while the later ones are responsible for abstract thinking tasks, suchas calculus in a symbolic system. We do not mean that there is a clear hierarchy between models. Infact, the notion of "model" is only a ﬁctitious concept describing a set of closely related regularities,and many of those regularities are actually intertwined and shared, and reappear at different levels.Units in the cortex-network can also cluster into function regions, and regions can be organized in ahierarchy-like pattern. Different models can correspond to different regions in the cortex. However,this may be a future work and the article dose not involve this too much.

Figure 5: BHN adapted for Recursive ModelingFigure 5 gives a schematic diagram of BHN adapted for Recursive Modeling, in which the threeLoops marked in Figure 4 are also marked roughly at the corresponding positions.6HN meets the two requirements of Recursive Modeling. Firstly, the attention-network can serveas an information/attention bottleneck[Felleman and Van, 1991], and the global attention( a ) can beregarded as representations in the mental space. Secondly, it is possible for the agent to perceive themental space by fusing bottom-up perceptions with top-down predictions, which will be detailed inthe section 5.2.We think that much of the intelligence of the human brain resides in its sophisticated architecture,and now our BHN model is oversimpliﬁed and lacks many essential functions, such as dopaminergicneurons for reward and prediction error learning[Hollerman and Schultz, 1998], a realization ofthe attention control interface, the hippocampus forming mental maps and episodic memories,etc. There is no doubt that we need more inspiration from the human brain to proceed with thiswork[Lake et al., 2017]. We think that the human brain works by continuously mixing real perceptions with imaginarypredictions, and in extreme cases it is like "hearing one’s thoughts spoken out aloud"[Schneider, 1939].If z it represents what is heard, then the expectation e it = (cid:80) z ij ∈ Z i w itj z ij can represent what the brainpredicts to hear. By replacing z it with e it at some times, the agent can somewhat perceive the mentalspace just as it perceives the external environment. z t and e t are homologous, and they can both be used as the output of a unit, so that the informationﬂow within the net is actually a mixture of perceptions( z t ) and predictions( e t ). z t is involuntary andvolatile, but e t is processed recurrently and remains somewhat locked inside the Loop (3)(markedboth in Figure 4 and Figure 5), and so in this way, e t can provide gain for z t in a sense. We speculatethat this mechanism corresponds to the brain’s working memory, and its gain level determines whetherthe representations in the cortex will be suppressed or enhanced[Miller et al., 1991]. We follow the same basic setup of the simple model in section 4.2 to test the hypothesis of workingmemory by mixing z it with e it − .First, in the training phase, we feed z t to c it = g iar ( ∗ ) for even time steps and feed e t − for odd timesteps. A deconvolutional network reconstructing images from e to give a score is also trained in thisphase.Next, in the testing phase, taking a certain time step as the boundary, z t is used before and e t − isused after. We judge the performance by how long the score of e keeps positive in the testing phase.Figure 6 gives the result and it shows that the working memory effectively lasts for about 30 frames,much longer than the one frame which is what we adapt the system to in the earlier training phase.Figure 6: The score of e over time in the testing phase7 Conclusions

In this article, we propose three hypothesises on the learning and working mechanism of the humanbrain. By formalizing these hypothesises, we get a computable objective, which is a sum of manyobjective functions. After that, we build and test a model(BHN), which couples several artiﬁcialneural networks together, to optimize the objective functions obtained. Finally, we propose theapproach of Recursive Modeling and test a hypothesis on working memory.

Broader Impact

Our work has no direct ethical or societal implications.

References [Atick, 1992] Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing?

Network: Computation in neural systems , 3(2):213–251.[Barlow et al., 1961] Barlow, H. B. et al. (1961). Possible principles underlying the transformation of sensorymessages.

Sensory communication , 1:217–234.[Bartlett and Bartlett, 1932] Bartlett, F. C. and Bartlett, F. C. (1932).

Remembering: A study in experimentaland social psychology . Cambridge University Press.[Bengio et al., 2013] Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review andnew perspectives.

IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828.[Cho et al., 2014] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H.,and Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machinetranslation. arXiv preprint arXiv:1406.1078 .[Felleman and Van, 1991] Felleman, D. J. and Van, D. E. (1991). Distributed hierarchical processing in theprimate cerebral cortex.

Cerebral cortex (New York, NY: 1991) , 1(1):1–47.[Graves et al., 2014] Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. arXiv preprintarXiv:1410.5401 .[Hadsell et al., 2006] Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning aninvariant mapping. In , volume 2, pages 1735–1742. IEEE.[Hollerman and Schultz, 1998] Hollerman, J. R. and Schultz, W. (1998). Dopamine neurons report an error inthe temporal prediction of reward during learning.

Nature neuroscience , 1(4):304–309.[Hyvärinen and Oja, 2000] Hyvärinen, A. and Oja, E. (2000). Independent component analysis: algorithmsand applications.

Neural networks , 13(4-5):411–430.[Lake et al., 2017] Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017). Buildingmachines that learn and think like people.

Behavioral and brain sciences , 40.[Larsen et al., 2015] Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and Winther, O. (2015). Autoencodingbeyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300 .[Linsker, 1990] Linsker, R. (1990). Perceptual neural organization: some approaches based on network modelsand information theory.

Annual review of Neuroscience , 13(1):257–281.[Miller et al., 1991] Miller, E. K., Li, L., and Desimone, R. (1991). A neural mechanism for working andrecognition memory in inferior temporal cortex.

Science , 254(5036):1377–1379.[Mnih and Kavukcuoglu, 2013] Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings efﬁcientlywith noise-contrastive estimation. In

Advances in neural information processing systems , pages 2265–2273.[Olshausen and Field, 1996] Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive ﬁeldproperties by learning a sparse code for natural images.

Nature , 381(6583):607–609.[Oord et al., 2018] Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representation learning with contrastivepredictive coding. arXiv preprint arXiv:1807.03748 .[Piaget, 1929] Piaget, J. (1929). The child’s conception of the world (j. & a. tomlinson, trans.).

Savage,Maryland: Littleﬁeld Adams .[Schneider, 1939] Schneider, K. (1939).

Psychischer befund und psychiatrische diagnose . Thieme.[Schrodinger, 1944] Schrodinger, E. (1944). What is life. Turing, 1936] Turing, A. M. (1936). On computable numbers, with an application to the entscheidungsproblem.

J. of Math , 58(345-363):5.[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł.,and Polosukhin, I. (2017). Attention is all you need. In

Advances in neural information processing systems ,pages 5998–6008.[von Helmholtz, 1925] von Helmholtz, H. (1925).

Physiological Optics , volume 3. Optical Society of America.[Wolpaw et al., 2000] Wolpaw, J. R., Birbaumer, N., Heetderks, W. J., McFarland, D. J., Peckham, P. H., Schalk,G., Donchin, E., Quatrano, L. A., Robinson, C. J., and Vaughan, T. M. (2000). Brain-computer interfacetechnology: a review of the ﬁrst international meeting.

IEEE transactions on rehabilitation engineering ,8(2):164–173.[Zeiler et al., 2011] Zeiler, M. D., Taylor, G. W., and Fergus, R. (2011). Adaptive deconvolutional networks formid and high level feature learning. In , pages 2018–2025.IEEE., pages 2018–2025.IEEE.