A Framework for Learning Invariant Physical Relations in Multimodal Sensory Processing
Du Xiaorui, Yavuzhan Erdem, Immanuel Schweizer, Cristian Axenie
AA Framework for Learning Invariant PhysicalRelations in Multimodal Sensory Processing st Du Xiaorui
Fakult¨at InformatikTechnische Hochschule Ingolstadt
Ingolstadt, [email protected] nd Yavuzhan Erdem
Fakult¨at Elektro- und InformationstechnikTechnische Hochschule Ingolstadt
Ingolstadt, [email protected] rd Immanuel Schweizer
Artificial Intelligence ResearchMerck KGaA
Darmstadt, [email protected] th Cristian Axenie
Audi Konfuzius-Institut IngolstadtTechnische Hochschule Ingolstadt
Ingolstadt, [email protected]
Abstract —Perceptual learning enables humans to recognizeand represent stimuli invariant to various transformations andbuild a consistent representation of the self and physical world.Such representations preserve the invariant physical relationsamong the multiple perceived sensory cues.This work is an attempt to exploit these principles in anengineered system. We design a novel neural network architecturecapable of learning, in an unsupervised manner, relations amongmultiple sensory cues. The system combines computationalprinciples, such as competition, cooperation and correlation, ina neurally plausible computational substrate. It achieves thatthrough a parallel and distributed processing architecture inwhich the relations among the multiple sensory quantities areextracted from time sequenced data.We describe the core system functionality when learningarbitrary non-linear relations in low-dimensional sensory data.Here, an initial benefit rises from the fact that such a network canbe engineered in a relatively straightforward way without priorinformation about the sensors and their interactions. Moreover,alleviating the need for tedious modelling and parametrization,the network converges to a consistent description of any arbi-trary high-dimensional multisensory setup. We demonstrate thisthrough a real-world learning problem, where, from standardRGB camera frames, the network learns the relations betweenphysical quantities such as light intensity, spatial gradient, andoptical flow, describing a visual scene.Overall, the benefits of such a framework lie in the capabilityto learn non-linear pairwise relations among sensory streams inan architecture that is stable under noise and missing sensorinput.
Index Terms —Multisensory Processing, Self Organising Maps,Hebbian Learning, Neural Networks, Invariant Relations
I. I
NTRODUCTION
Perception is a process of acquiring information that hap-pens over time [1]. The primary source of perceptual informa-tion are events, or ”happenings over time” [2]. These eventsare the critical component of what is kept during perceptuallearning and development. Moreover, multimodal processingemerges in the context of an event [3]. For example, opticflow and motion parallax, emerge as one moves throughthe world, whereas accretion and deletion of visual texture elements occur when an object or part of the scene becomesprogressively uncovered or occluded as a result of motion [4].In any multi-sensory system some aspects of the world andthe self will change during a given event. But some aspectswill not, they stay invariant under the transformation. Theseinvariances must be extracted from the multi-sensory system tolearn more about the world and the self. One such invariance isthe physical relations between the different modalities. Theserelations specify what is permanent, what is changing, andhow the change is occurring [3]. For example, one’s motiongenerates particular transformation patterns in the optic arrayand it is through these patterns that one identifies unitaryobjects and simultaneously the trajectory and type of motionthrough the world.Such invariant relations bind different sensory modalitiesinto what is a consistent perception of the scene. But in mostengineered systems these relations are specified at design-time. It would be more plausible to believe relations betweenmultiple sensory systems are learned to allow for adaptationsshould a concept drift or concept shift occur.Starting from these principles we propose a biologicallyplausible neural network capable of extracting, in an un-supervised manner, the invariant physical relations amongmultiple sensory cues. The network employs neural computingprinciples such as competition, cooperation, and correlationin neural populations, to learn multimodal sensory relationswithout any prior knowledge about the type of sensory dataand the underlying relations.We describe the capabilities of such a network in a series ofexperiments on amodal data (e.g. timeseries) in order to givethe reader an understanding over the processing mechanismsof the framework. Subsequently, we demonstrate the frame-work’s applicability in learning invariant physical relationsin the visual scene autonomously, without describing thescene mathematically. Once such relations have been learnedthe system can simultaneously de-noise potentially perturbedsensory streams and even infer missing ones without bringing a r X i v : . [ c s . N E ] J un ny modification to the its structure.II. N ETWORK ARCHITECTURE
The adaptive development of the human perceptual capabil-ities seems to depend strongly on the available sensory inputs,which gradually sharpen their interaction during development,given the constraints imposed by multisensory relations [5].Following this principle, we propose a model based on Self-Organizing Maps (SOM) [6] and Hebbian Learning (HL) [7]as main components for learning underlying relations amongmultiple sensory cues. In order to introduce the proposed net-work, we provide a simple example in Figure 1. In this simpleexample, we consider two input sensor streams following apower law dependency relation (cmp. Figure 1a). The systemhas no prior information about sensory data distributionsand the generating processes, but learns the underlying (i.e.hidden) relation directly from the input data in an unsupervisedmanner.
Fig. 1. Basic functionality. a) Input data resembling a non-linear relation andits distribution - relation is hidden in the data. b) Basic architecture of thenetwork.
A. Core model
The input SOMs extract the distribution of the incomingsensory data, depicted in Figure 1a, and encode sensorysamples in a distributed activity pattern, as shown in Figure 1b.This activity pattern is generated such that the closest preferredvalue of a neuron to the input sample will be strongly activatedand will decay, proportional with distance, for neighbouringunits. The SOM specialises to represent a certain (preferred)
Fig. 2. Detailed network functionality value in the sensory space and learns its sensitivity (i.e.tuning curve shape). Given an input sample from the sensortimeseries, s p ( k ) at time step k , the network follows theprocessing stages depicted in Figure 2. For each i -th neuronin the p -th input SOM, with preferred value w p in ,i and tuningcurve size ξ pi ( k ) , the elicited activation is given by a pi ( k ) = 1 √ πξ pi ( k ) e − ( sp ( k ) − wp in ,i ( k ))22 ξpi ( k )2 . (1)The winning neuron of the p -th population, b p ( k ) , is the onewhich elicits the highest activation given the sensory input attime k b p ( k ) = argmax i a pi ( k ) . (2)The competition for highest activation in the SOM is followedby cooperation in representing the input sensory space. Giventhe winner neuron, b p ( k ) , the (cooperation) interaction kernel, h pb,i ( k ) = e −|| ri − rb || σ ( k )2 . (3)allows neighbouring cells (found at position r i in the network)to precisely represent the sensory input sample given theirlocation in the neighbourhood σ ( k ) of the winning neuron. Theinteraction kernel in Equation 3, ensures that specific neuronsin the network specialise on different areas in the sensorypace, such that the input weights (i.e. preferred values) ofthe neurons are pulled closer to the input sample, ∆ w p in ,i ( k ) = α ( k ) h pb,i ( k )( s p ( k ) − w p in ,i ( k )) . (4)This corresponds to updating the tuning curves, or sensitivities.Each neuron’s tuning curve is modulated by the spatial loca-tion of the neuron in the network, the distance to the inputsample, the interaction kernel size, and a decaying learningrate α ( k ) , ∆ ξ pi ( k ) = α ( k ) h pb,i ( k )(( s p ( k ) − w p in ,i ( k )) − ξ pi ( k ) ) . (5)As an illustration of the process, let’s consider learned tuningcurves shapes for 5 neurons in the input SOMs (i.e. neurons1, 6, 13, 40, 45), depicted in Figure 3. We observe that higherinput probability distributions are represented by dense andsharp tuning curves (e.g. neuron 1, 6, 13 in SOM1), whereaslower or uniform probability distributions are represented bymore sparse and wide tuning curves (e.g. neuron 40, 45in SOM1). Using this mechanism, the network optimally Fig. 3. Extracted sensory relation and data statistics for the data in Figure 1a allocates resources (i.e. neurons). A higher amount of neuronsto areas in the input space, which need a finer resolution; anda lower amount for more coarsely represented areas.Neurons in different SOMs are then linked by a fully (all-to-all) connected matrix of synaptic connections, where theweights in the matrix are computed using Hebbian learning. Inthe end, the matrix encodes the co-activation patterns betweenthe input layers (i.e. SOMs), as shown in Figure 1b, and,eventually, the learned relation between the sensory cues, asshown in Figure 3. The connections between uncorrelated (orweakly correlated) neurons in each population (i.e. w cross )are suppressed (i.e. darker color) while correlated neuronsconnections are enhanced (i.e. brighter color). The effectivecorrelation pattern encoded in the w cross matrix, depicts theactual learnt relation. Mathematically speaking, the physicalrelation imposes constraints upon possible sensory values - hence supporting de-noising and inference. One can seethe network dynamics as a constraint satisfaction problem,whereas the constraint is imposed by the underlying invariantrelation among the sensor data (i.e. describing the physicslaws). Formally, the connection weight w p cross ,i,j betweenneurons i, j in the different input SOMs are updated with aHebbian learning rule as follows: ∆ w p cross ,i,j ( k ) = η ( k )( a pi ( k ) − a pi ( k ))( a qj ( k ) − a qj ( k )) , (6)where a pi ( k ) = (1 − β ( k )) a pi ( k −
1) + β ( k ) a pi ( k ) , (7)and η ( k ) , β ( k ) are monotonic decaying functions. Hebbianlearning ensures that when neurons fire synchronously theirconnection strengths increase, whereas if their firing patternsare anti-correlated the weights decrease.Self-organisation and Hebbian correlation learning pro-cesses evolve simultaneously, such that both the representationand the extracted relation are continuously refined, as new datais available. III. E XPERIMENTS
A. Learning in low-dimensional multimodal sensory space
In the first batch of experiments we look at how theproposed network can learn arbitrary relations among multiplesensory cues m i given sensory timeseries s i . For this, welook at two scenarios where the dependencies between sensorycues are organized differently; namely as a tree (Figure 4) orin a circular structure (Figure 5). Such architectures imposedifferent local dynamics and a different data flow whenextracting the hidden relations. In both of the cases data fromsensory timeseries s i is fed to the network which encodes eachsensory cue m i in the SOMs and learns the underlying relationin the Hebbian linkage. Using a set of arbitrary relations wedemonstrate that the network learns the invariant relationsamong the interacting sensory cues (Figure 4b, Figure 5b),fact validated by the decoded outcome overlaid in the figureover the original (i.e. hidden) relation in the data (Figure 4a,Figure 5a). We use as decoding mechanism an optimisationmethod that recovers the real-world value given the self-calculated bounds of the input sensory space. The bounds areobtained as minimum and maximum of a cost function of thedistance between the current preferred value of the winningneuron and the input sample at the SOM level. Dependingon the position of the winning neuron in the N -dimensionalSOM, the recovered value y ( t ) is computed as: y ( t ) = (cid:40) w pin,i + d pi if i ≥ N w pin,i + d pi if x < N where, d pi = (cid:113) ξ ki ( k ) log ( √ πa pi ( k ) ξ ki ( k ) ) for the mostactive neuron with index i in the SOM, a preferred value w pin,i and ξ ki ( k ) tuning curve size. The optimiser is based on Brent’smethod [8], a recursive method to find the global optimum ofa function for which the analytical form of its derivative is notavailable or too complex. ig. 4. Multimodal relations learning in low-dimensional space with thenetwork of relations organized as a tree a) Input data and decoded learntdata. b) Learnt relations among the sensory cues. B. Learning in high-dimensional multimodal sensory space
In the current section we demonstrate that our system iscapable of learning invariant sensory relations also in high-dimensional spaces, where the complexity of the interactionsgoes beyond simple algebrical relations. We considered for ourexperiments, visual perception where sets of invariant relationswithin stimulus and self-motion are described through spatio-temporal differentiation and integration [9]. For instance, therelation between radius of curvature of the motion path andthe two components of acceleration (i.e. tangential and normal)describe consistent visual motion perception through differen-tiation. Such accurate mathematical descriptions, encourageus to hypothesise that a relational model explains visualmotion perception based on physical constraints that keepthe multiple sensory inputs describing the scene in agreement[4]. Moreover, ”information from different modalities belongstogether when it is unified by the same invariant relations”[10].Such premises motivate our second batch of real-world
Fig. 5. Multimodal relations learning in low-dimensional space with circularstructure network of relations. a) Input data and decoded learnt data. b) Learntrelations among the sensory cues. experiments where we demonstrate that the proposed networkcan learn invariant relations between physical quantities suchas light intensity, spatial gradient and optical flow, describinga visual scene from standard RGB camera frames.Similar work was carried out by [11] for fast event-based(frameless) visual interpretation. Here the authors hard-codedthe relations among the sensory cues by deducing simpleupdate formulations from the vectorial representations in thegeometry of the scene. In our experiments we alleviate theneed to geometrically describe the problem and the difficultythis analytical formulation poses on the system design. Weshow how our network learns the underlying relations amongthe visual scene quantities and, through its dynamics, is able toprovide a consistent scene interpretation that brings all sensoryquantities in agreement.In our real-world instantiation the network learns the rela-tions between multiple visual cues, such as: optic flow F , lightintensity I , spatial light intensity gradient G , and temporallight intensity derivative V . The physics of the visual scenealready imposes that G should be the gradient of I , and thatpatial variation G in brightness should match time variation V according to the local optic flow F . Such constraints obeythree-dimensional geometry and describe the mathematics ofthe scene [9]. Fig. 6. Network structure for visual scene interpretation. Physical relationsbetween Optical Flow F , Light Intensity I , Light Intensity Spatial Gradient G and Intensity Temporal Derivative V . For instance, the network describing the relation among V , F , and G , (Figure 6), namely, − V = F G shows that thechange in brightness over time is given by the speed of theoptic flow times the change in brightness in the direction of theoptic flow (i.e. optical flow constraint equation [12]). Similarly,the relation between I and G requires that G = ∇ I holds. Yet,the equation − V = F G can not be solved uniquely for F and G , due to the aperture problem. Given a spatial gradient G and a temporal gradient V , the solutions for F lie on a vectorperpendicular on G (i.e. for a limited aperture size and an edgestructure, motion can only be estimated normal to that edge).This theoretical treatment is meant to support the reader tounderstand the output of the network, which, again, had noprior knowledge about the relations or the sensory quantitiesfed to it.In our experiments we fed the network, subsequently withpairs of: light intensity ( I ) and light intensity derivative ( V );light intensity ( I ) and light intensity spatial gradient ( G ); andlight intensity temporal derivative ( V ), spatial gradient ( G ) andoptical flow ( F ), respectively. The network learns the individ-ual relations among these quantities from the incoming RGBcamera frames (i.e. resolution: 200x200px, color scale: gray(1channel), framerate: 30FPS). Note that in order to generate allquantities for learning the relations we used existing operatorsin state-of-the-art libraries (e.g. Sobel operator, Lucas-Kanadeoptic flow method in OpenCV). After the network learnedthe individual relations among pairs of sensory streams weconnected all of the SOMs corresponding to the sensors andtheir respective Hebbian linkage.The network converges to a stable representation, as shownin Figure 7. Note that the relations are learned in pairwise manner, subsequently allowing the network to converge givenall relations. The network allows for such a decoupling, asshown also in the tree and circular experiments in the previoussection. Fig. 7. Inferred quantities given the learned relations among multimodal cuesin visual scene interpretation.
The network learns the hidden relations among the multiplemodalities describing the visual scene, as shown in Figure 8.
Fig. 8. Hidden relations among the multimodal sensory cues in visual sceneinterpretation.
IV. R
ELATED WORK
In order to frame our contribution we describe compara-tively other state-of-the-art approaches addressing the extrac-tion of (unknown) relations among multiple data streams. Webriefly address aspects such as the amount of prior informationused in the model and the inference capabilities of the models.Related work in Cook et al. [13], used a combination of sim-ple biologically plausible mechanisms, like Winner-Take-All(WTA) circuitry, Hebbian learning, and homeostatic activityegulation, to extract relations in artificially generated sensorydata. After learning, the model was able to infer missingquantities given the learned relations and available sensors,clean-up noisy sensory inputs and integrate the sensory dataconsistently.Using a different neurally inspired substrate, Weber etal. [14] combined competition and cooperation in a self-organizing network of processing units to extract coordinatetransformations in a robotic visual object localization scenario.Going away from biological inspiration, Mandal et al.[15] used a nonlinear canonical correlation analysis method,termed alpha-beta divergence correlation analysis (ABCA),to extract relations between sets of multidimensional randomvariables. The model employed a probabilistic method basedon nonlinear correlation analysis using a more flexible metric(i.e. divergence / distance) than typical canonical correlationanalysis.Using a neurally inspired computing substrate for im-plementing canonical correlation analysis Hsieh et al. [16]proposed a model able to extract the underlying structuresbetween two sets of variables under moderate noise conditions,basically employing nonlinear PCA.
Priors
Although less intuitive, the purely mathematical approaches[15] (i.e. using canonical correlation analysis) need less tuningeffort as the parameters are the result of an optimisationprocedure. On the other side, the neurally inspired approaches[13], [14] or the hybrid approaches [16] (i.e. combining neuralnetworks and correlation analysis) need a more judiciousparameter tuning, as their dynamics are more sensitive, andcan either reach instability (e.g. recurrent networks) or localminima. Excepting parametrisation, prior information aboutinputs is generally needed when instantiating the state-of-the-art systems for a certain scenario. Sensory values boundsand probability distributions must be explicitly encoded inthe models through explicit tiling of tuning values over apopulation of neurons [13], [14], linear coefficients in vectorcombinations [15], or standardisation routines of input vari-ables [16]. Our model doesn’t require any specification ofthe bounds or other information about the input sensory data.It learns autonomously the statistics of the data required forlearning (i.e. probability density).
Inference
We consider here the capability to infer (i.e. predict) missingquantities once the hidden relation among sensory quantities islearned. The capability to use the learned functional relationsto determine missing quantities is not available in all presentedmodels, e.g., [15] due to the fact that the divergence andcorrelation coefficient expressions might be non-invertiblefunctions. On the other side, using either the learned co-activation weight matrix [13], [14], or the known standarddeviations of the canonical variants [16] these models are ableto predict missing quantities to some extent. Our model isable, once it has learnt the hidden relation, to use it to inferpossible values of a sensor given the other available ones.When addressing non-invertible relations (e.g. powerlaw) our system uses the learnt domain and range (i.e. tuning curves andlearnt data distribution) to identify the correct inverse function.As the domain and range of the inverse relation come fromthe range, and domain of the hidden relation, respectively, thesystem can perform the swapping of domain and range toconverge to a correct solution.V. C
ONCLUSION
Looking at perception as a hierarchy of physical relationsto fulfil is an interesting biological hypothesis. In this workwe demonstrated that an engineered system, using principlesof neural computation, is able to learn without supervisionsuch relations describing low-dimensional sensory streams aswell as high-dimensional visual scenes. The network has thepotential to learn from arbitrary sensory streams and providean interpretation of the multisensory scene without the need tomodel either the sensors or their interactions. We believe thatsuch a framework can leverage multisensory processing by ex-ploiting underlying relations in real-world sensory streams andtheir interactions employing simple computational principlestowards efficient implementations across various scenarios.R
EFERENCES[1] K. E. Adolph and K. S. Kretch, “Gibsons theory of perceptual learning,”
International encyclopedia of the social and behavioral sciences , vol. 10,pp. 127–134, 2015.[2] E. J. Gibson,
Principles of perceptual learning and development.
Appleton-Century-Crofts, 1969.[3] J. Gibson, “The ecological approach to visual perception. boston, ma,us,” 1979.[4] G. Johansson, “Visual perception of biological motion and a model forits analysis,”
Perception & psychophysics , vol. 14, no. 2, pp. 201–211,1973.[5] G. Westermann, D. Mareschal, M. H. Johnson, S. Sirois, M. W.Spratling, and M. S. Thomas, “Neuroconstructivism,”
Developmentalscience , vol. 10, no. 1, pp. 75–83, 2007.[6] T. Kohonen, “Self-organized formation of topologically correct featuremaps,”
Biological cybernetics , vol. 43, no. 1, pp. 59–69, 1982.[7] Z. Chen, S. Haykin, J. J. Eggermont, and S. Becker,
Correlativelearning: a basis for brain and adaptive systems . John Wiley & Sons,2008, vol. 49.[8] R. P. Brent, “An algorithm with guaranteed convergence for finding azero of a function,”
The Computer Journal , vol. 14, no. 4, pp. 422–425,1971.[9] G. Johansson, “Spatio-temporal differentiation and integration in visualmotion perception,”
Psychological research , vol. 38, no. 4, pp. 379–393,1976.[10] E. J. Gibson and A. S. Walker, “Development of knowledge of visual-tactual affordances of substance,”
Child development , pp. 453–460,1984.[11] M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger, “Interactingmaps for fast visual interpretation,” in
The 2011 International JointConference on Neural Networks . IEEE, 2011, pp. 770–776.[12] B. K. Horn and B. G. Schunck, “Determining optical flow,”
Artificialintelligence , vol. 17, no. 1-3, pp. 185–203, 1981.[13] M. Cook, F. Jug, C. Krautz, and A. Steger, “Unsupervised learning ofrelations,” in
International Conference on Artificial Neural Networks .Springer, 2010, pp. 164–173.[14] C. Weber and S. Wermter, “A self-organizing map of sigma–pi units,”
Neurocomputing , vol. 70, no. 13-15, pp. 2552–2560, 2007.[15] A. Mandal and A. Cichocki, “Non-linear canonical correlation analysisusing alpha-beta divergence,”
Entropy , vol. 15, no. 7, pp. 2788–2804,2013.[16] W. W. Hsieh, “Nonlinear canonical correlation analysis by neuralnetworks,”